Optimization and Coding for Speed

This subpage is under construction.

High-level advice
Branching
Arithmetic
Memory Access

High-level advice

The compilers aren't all-powerful, don't place too much trust in them
- Hand-written ASM is king (or queen, as it were)
There is no hardware FPU, so design your code around integer and fixed point arithmetic
Multiply is about as slow as Divide, Trigonometry is fast!

The rest of this page will assume a basic understanding of assembly language.

Branching

Mind the Branch!

All branch instructions, when taken, consume 4 cycles when the target lies in Cog RAM or LUT RAM, twice as much as a normal instruction. They are much slower when the target is in Hub RAM: At least 13 cycles!!!

So the following rules of thumb emerge:

Code Placement

Code that is frequently jumped to should be placed in Cog or LUT memory, whereas code that runs mostly straight through without branching can be placed in Hub memory with little penalty.

Loops

Tight loops should always be in Cog or LUT RAM. Use the REP instruction to implement the innermost loop where possible.

_RET_ over RET

Where possible (no WCZ or other conditon code), use the _RET_ Condition Code where possible instead of the RET instruction - this saves 2 cycles!

Arithmetic

Avoiding QMUL

When multiplying small values, use the ALU's MUL and MULS instructions. These only perform 16x16 multiplication, but they are much quicker than the CORDIC unit's QMUL instruction.

When multiplying by powers of two, use the bit-shift instructions! (that's an old hat, isn't it?)

In cases where latency matters, it is often better to construct a larger multiply from multiple MUL(S) and/or shift operations. See also: 32x16 multiply idiom

CORDIC pipelining (Easy Mode)

GETQX (and GETQY) can take a very long time (up to 58 cycles!) when used immediately after a CORDIC operation has been submitted. Luckily, you don't have to put up with that! Since the CORDIC unit operates independently in the background, any amount of other work you can do between submitting the command and grabbing the result subtracts from GETQX's blocking time. It's free real estate!. If the result is already available, GETQX takes the minimum 2 cycles.

        QMUL a,b
        RDLONG ur ' <- this instruction executes "for free" ...
        GETQX c   ' ...because this one has to wait less for its data to become available

CORDIC pipelining (Advanced Mode)

The CORDIC unit allows submitting a command every 8 cycles. This means that multiple commands can be in flight at once. (They don't have to be the same kind of command, either). This can cause a huge increase in performance! But take care: The CORDIC X/Y result buffers only hold one value. If you don't grab a result in time, it will be overwritten!. So it becomes important to mind the cycle counts!

Slow and nasty:

        QMUL a,b
        GETQX v
        QMUL c,d
        GETQX w

The same thing, but almost twice as fast with overlapped instructions:

        QMUL a,b
        QMUL c,d
        GETQX v
        GETQX w

Due to the aforementioned 8-cycle rhythm, the second QMUL and GETQX in the above example will each have 6 additional wait cycles in which other work could be done:

        QMUL a,b
        WAITX #6-2 ' <- do other work here (6 cycles = 3 instructions!!!)
        QMUL c,d
        WAITX #46-2 ' <- do other work here (TODO verify max safe cycle count here)
        GETQX v
        WAITX #6-2 ' <- do other work here
        GETQX w

(of course only 4 cycles are available if both GETQX and GETQY are used)

CORDIC pipelining (HARDCORE MODE)

The true master class. Pipelined CORDIC operations in loops. It becomes very important to maintain exact cycle counts and alignments. To hide the initial latency, the loop needs to be pre-heated, i.e. the inital commands need to be sent before the loop proper (and conversely, extra results are left unused afterwards, but this is usually not a problem: TODO document precise GETQX blocking behaviour).

A basic example might look like

        ' Pre-heat
        QDIV rasL,rasZ
        ADD rasL,stepL
        ADD rasZ,stepZ
        WAITX #56-8-2 ' Waiting the right amount of time here is essential!
        ' actual loop
        REP @.loop,#123
        QDIV rasL,rasZ
        GETQX perpL
        ADD rasL,stepL
        ADD rasZ,stepZ
        WAITX #56-8-2

.loop
        ' loop is left with dummular processing in progredd

TODO more on this and verify the example

Memory Access

With 3 different memory types (4 if external PSRAM is considered), it is not always obvious to decide which should be used. The following table tries (and likely fails) to explain the strengths and weaknesses in short:

	Cog RAM	LUT RAM	Hub RAM	Ext. PSRAM (on P2EDGE)
Size	2016 bytes (as 504 longs)	2048 bytes (as 512 longs)	512 KiB (16 KiB reserved for Boot ROM/Debugger)	32 MiB
Address granularity	32 bits (16,8,4 or 1 bit with special instructions)	32 bits	8 bits	32 bits
Special Feature	Can be directly used as operands.	Can be synchronized between pairs of Cogs. (SETLUTS) Can hold streamer lookup data. Can hold XBYTE tables.	Fast block transfers. Fast FIFO interface. Byte-masked writes (WMLONG)	HUGE!
Code Execution	YES	YES	YES (slow branches, FIFO tied up)	NO
Random Read	4 cycles (ALTx with MOV or GETBYTE/etc) 2 cycles (ALTS immediately consumed)	3 cycles (RDLUT)	9..17 cycles (RDLONG) 2 cycles (FIFO) 4+(13) cycles (RDFAST random access trick, see below)	~100 cycles (depends on implementation)
Random Write	4 cycles (ALTx with MOV or SETBYTE/etc) 2 cycles (ALTD/ALTR from operation)	2 cycles (WRLUT)	3..11 cycles (WRLONG) 2 cycles (FIFO) 4 cycles (RDFAST random access trick, see below)	~80 cycles (depends on implementation)
Block copy to Cog RAM	SLOW 4N cycles (REP/ALTI/MOV) 2N cycles (Hub round-trip)	SLOW 5N cycles (REP/ALTD/RDLUT) 2N cycles (Hub round-trip)	FAST 1*N cycles (SETQ+RDLONG)	N/A
Block copy to LUT RAM	SLOW 4N cycles (REP/ALTS/WRLUT) 2N cycles (Hub round-trip)	SLOW 5N cycles (REP/RDLUT/WRLUT) 2N cycles (Hub round-trip)	FAST 1*N cycles (SETQ2+RDLONG)	N/A
Block copy to Hub RAM	FAST 1*N cycles (SETQ+WRLONG)	FAST 1*N cycles (SETQ2+WRLONG)	SLOW (must round-trip through Cog/LUT, 2N cycles asymptotically*)	4*N cycles (Streamer DMA)

Note 1: Block copy speeds are given for N longs and without setup overhead. All copies involving Hub RAM incur the usual latency, etc.

This really doesn't tell the whole story, but hopefully hints at the complexity enough for you to start forming your own mental model.

Avoid Hub stacks

CALLA and RETA (and their PTRB equivalents) should be avoided.

Use the FIFO!

The FIFO allows quick reading or writing of sequential data streams in Hub RAM by buffering the data and thus hiding the bus latency. It should always be used for such when possible. Do keep in mind that branching to code in Hub RAM (HUBEXEC) always resets the FIFO state!

RFVAR

The RFVAR and RFVARS instructions allow natively reading a variable-length integer format from the FIFO. There are no corresponding write instructions. Do with that knowledge what you will.

FIFO random access trick

While the FIFO is designed for sequential access, using the non-blocking mode of RDFAST/WRFAST, even random access can be accelerated beyond the regular Hub access instructions. FIFO operations also have fully stable timing, so it's easier to time them alongside, e.g. CORDIC operations.

The key is to do other work while the FIFO performs memory access in the background. Here this is illustrated by WAITX instructions, but to reap the benefits, this time must be filled with useful non-memory instructions. (Other Hub RAM access is not allowed! See also: RDFAST Startup Bug). Exceeding the neccessary wait time is fine and causes no problems.

Writing is easy:

        WRFAST bit31, address
        WFBYTE data
        WAITX #10-2 ' wait for it to get flushed (TODO: verify minimum safe timing)

Reading is trickier, since reading the data too early will cause zeroes to be read:

        RDFAST bit31, address
        WAITX #13-2 ' wait for FIFO to become ready, 13 cycles.
        RFBYTE data

In either case a constant (named bit31 by convention) must be created in Cog RAM:

bit31   long 1<<31

This trick has proven itself to be essential to super-heavy random read use-cases such as texture mapping.

Hub slice alignment

When you do need to use the random Hub RAM access instructions, it is possible to minimize wait cycles by spacing the instructions appropriately according to the relative slice alignment of the addresses accessed.

TODO: finally figure this out