Assembler on Raspberry Pico and instruction timing measurement

Although there is plenty of documentation available for the Raspberry Pico, the assembly language instruction set can be confusing and unclear. In this article, we will summarize the Raspberry Pico processor instructions and measure their timing.

Processor

The RP2040 processor used in the Raspberry Pico is a dual-core Cortex M0+ in ARMv6-M version, running at 125 MHz as standard (guaranteed frequency 133 MHz). The instruction set is Thumb-1, with a few added instructions from Thumb-2. So you won't find 32-bit ARM instructions here, and unfortunately not even the if-then IT conditional block instruction (this is only found in the ARMv6T2 version).

The original ARM processors contain a 32-bit ARM instruction set (each instruction is 32 bits). A limited set of 16-bit instructions, Thumb-1, was derived from the ARM instruction set, which is a subset of the ARM instructions and was designed for smaller processors. If a processor supports both instruction sets, it is possible to switch between them by setting bit 0 of the PC register. Setting bit 0 interprets the Thumb instructions, resetting bit 0 uses the ARM instruction set.

The Raspberry Pico processor only supports the Thumb-1 instruction set. Therefore, bit 0 of the PC register must always be set. Attempting to switch to the ARM instruction set (by jumping to an address with bit 0 reset) will result in a hard-fail.

Registers

R0..R12 ... registers for general use. In most Thumb-1 instructions, only registers R0..R7 are supported.

Some registers are used by the C compiler for special purposes. Register R12 is the Intra Procedure Call Scratch Register (IPC). Register R11 is used as the Frame Pointer FP. Register R10 is the Stack Limit SL. Register R9 is used as Static Base SB.

R13 (SP) ... Stack Pointer. The processor has two stack pointers. MSP is Main stack. It serves as the main stack pointer when the processor starts and during interrupts. PSP is the Process stack, used for processes in a multitasking system. The current stack pointer, assigned to R13 (SP), is selected in the CONTROL register. The stack pointer points to the bottom edge of the stack. When the register is loaded into the stack, the SP pointer value is decremented.

R14 (LR) ... Link Register. During subroutine calls in ARM processors, the return address is not stored in the stack, but in the link register LR. When returning from a subroutine, a jump is made back to the contents of the link register. This has the advantage of faster execution of the subroutine because the stack does not have to be manipulated. However, if subroutines are called in greater depth, it is necessary to ensure that the LR register is preserved and restored. In most cases, a free register can be used, and so even in this case the subroutine call can be fast.

R15 (PC) ... Program Counter. The PC program counter points to the current address in program memory. Instructions in Thumb mode must be aligned to a multiple of 2 (i.e., an even address) because each instruction takes up 2 or 4 bytes. Therefore, the lowest unused bit of the PC register is used to indicate whether the ARM instruction set (bit 0 is cleared) or the Thumb instruction set (bit 0 is set) is being used in program processing. If you read the PC register in Pico, bit 0 will always be set.

PSR ... Program Status Register is a special register containing the processor state. It is accessed through 3 subregisters: the APSR (Application Program Status Register, contains flags), the IPSR (Interrupt Program Status Register) and the EPSR (Execution Program Status Register).

PRIMASK ... is the interrupt masking register. In Pico it contains only the lowest bit 0, indicating global interrupt disable.

CONTROL ... is the processor control register. It sets the privileged mode and stack pointer mode.

When a function is called in gcc, registers R0 to R3 contain the input arguments to the function and also the return value from the function. If more input arguments are needed, additional arguments are passed via the stack.

The called function must preserve the contents of registers R4 to R11 and register R13 (SP). Conversely, the function may not preserve registers R0 to 3, R12 (IPC) and R14 (LR). In particular, register R14 (LR) must be watched and its contents preserved before calling the function so that a return from the subroutine is possible.

Flags

N, bit 31 ... Negation. The flag is a copy of the high bit of the 31 operation result register and indicates a negative number.

Z, bit 30 ... Zero. If flag Z is set to 1, it indicates a null result of the operation (or a match in the comparison operation).

C, bit 29 ... Carry. Flag C set to 1 indicates an overflow of the result of a non-sign operation.

V, bit 28 ... Overflow. Flag V set to 1 indicates an overflow of the result of the sign operation.

Let us consider flags C and V in more detail. Think of the flags as extending the registers by 1 high bit. The C transfer is set when the extension bit is set. When added together with the input C transfer, only the final result is critical, not the intermediate steps. Example:

[0] 0xFFFFFFFE ... first operand
+ [0] 0x00000001 ... the second operand, adding it would be the intermediate result [0] 0xFFFFFFFF
+ carry C=1 ... addition of input carry

Výsledkem je [1] 0x00000000. The expansion bit is set, so the output flag C is 1.

When subtracting, the situation is different. In ARM processors, subtraction is implemented as adding the inverted value (NOT) plus 1, i.e., adding a negative operand. Thus, when flag C is set, it does not mean an underflow of the result below 0 (i.e., the "borrow" flag), but an inverted underflow flag. Example:

When subtracting, we first bitwise invert the second operand and change the operation to addition.

+ [0] 0xFFFFFFFE ... the inverted second operand, adding it would be the intermediate result [1] 0x00000000.

As you can see, the result of the operation is the correct value 1, but the C flag has been set. We have to look at C as an inverted Borrow flag for subtraction. If we add the input transfer C in the operation, we do it the same way as for addition and it doesn't matter that it has the meaning of an inverted underflow.

In understanding the overflow flag V, we can again help ourselves by extending the bit to extend (duplicate) the sign of the operand. The overflow flag V will be set if the result overflows out of range - which it will if the expansion bit is different from the sign bit. Example:

[1] 0x80000000 ... the first operand, this is the maximum negative number -2147483648

The result is [1] 0x7FFFFFFF, i.e. the number -2147483649. From the difference between the expansion and sign bits, we can see that the result overflows outside the sign number range and therefore the overflow flag V will be set.

Instructions

For each instruction, the number of clock cycles T and the length of the instruction in ns, related to the Raspberry Pico running at the default 125 MHz, are listed. For the Load and Store instructions, the time depends on the memory being accessed. When accessing regular RAM, the time is typically 2 clock cycles. When accessing the SIO registers (GPIO), the access time is 1 clock. When accessing some registers and external Flash, the time may be longer due to additional wait-state.

If an "S" appears at the end of the instruction name, it means that the instruction affects flags (does not apply to CMP and TST instructions, which always affect flags).

Unless otherwise specified, the instruction supports only registers R0..R7. The extended register set R0..R15 is supported only by the instructions MOV Rd,Rm, MOV PC,Rm, ADD Rd,Rm, ADD PC,Rm, CMP Rn,Rm, BX Rm, BLX Rm, MRS Rd,<reg>, and MSR <reg>,Rd.

Subtract instructions use Carry to mean the inverse of the Borrow subinstruction.

When writing assembler instructions, the # character precedes a numeric constant.

Instructions must not access an unaligned address (otherwise they will hard-fault) - that is, 16-bit access instructions must not access an odd address, and 32-bit access instructions must only access a 32-bit aligned address.

In the instructions, 32 bits is referred to as a word, 16 bits is a half-word, and 8 bits is a byte.

Group	Instruction	Timing T	Flags	Meaning	Note
Move	MOVS Rd,#<imm>	1 (8ns)	N Z - -	Rd <- imm	imm is a constant 0..255
	MOVS Rd,Rm	1 (8ns)	N Z - -	Rd <- Rm
	MOV Rd,Rm	1 (8ns)	- - - -	Rd <- Rm	Registers R0..R15. If Rd=PC, Rm must not be PC. If Rd=SP, Rm must not be SP.
	MOV PC,Rm	2 (16ns)	- - - -	PC <- Rm	Registers R0..R14. Performs a jump to the address in register Rm. Bit 0 of the address is ignored.
Add	ADDS Rd,Rn,#<imm>	1 (8ns)	N Z C V	Rd <- Rn + imm	imm is a constant 0..7
	ADDS Rd,Rn,Rm	1 (8ns)	N Z C V	Rd <- Rn + Rm
	ADD Rd,Rm	1 (8ns)	- - - -	Rd <- Rd + Rm	Registers R0..R14
	ADD PC,Rm	2 (16ns)	- - - -	PC <- PC + Rm	Registers R0..R14. Performs a jump relative to the next address + 2 (e.g. at #0 skips 1 following instruction). Bit 0 of the resulting address is ignored.
	ADDS Rd,#<imm>	1 (8ns)	N Z C V	Rd <- Rd + imm	imm is a constant 0..255
	ADCS Rd,Rm	1 (8ns)	N Z C V	Rd <- Rd + Rm + C	sum with carry
	ADD SP,#<imm>	1 (8ns)	- - - -	SP <- SP + imm	imm is a constant 0..508, must be a multiple of 4
	ADD Rd,Sp,#<imm>	1 (8ns)	- - - -	Rd <- SP + imm	imm is a constant 0..1020, must be a multiple of 4
	ADR Rd,<label>	1 (8ns)	- - - -	Rd <- label	label is in range PC..PC+1020. Address must be aligned to 32 bits (i.e. a multiple of 4). It is replaced by the ADD Rd,PC,#<imm> instruction during compilation.
	ADD Rd,PC,#<imm>	1 (8 ns)	- - - -	Rd <- PC + imm	imm is a constant 0..1020, it must be a multiple of 4. PC has the value of the following address + 2 aligned down to a multiple of 4.
Subtract	SUBS Rd,Rn,Rm	1 (8ns)	N Z C V	Rd <- Rn - Rm
	SUBS Rd,Rn,#<imm>	1 (8ns)	N Z C V	Rd <- Rn - imm	imm is a constant 0..7
	SUBS Rd,#<imm>	1 (8ns)	N Z C V	Rd <- Rd - imm	imm is a constant 0..255
	SBCS Rd,Rm	1 (8ns)	N Z C V	Rd <- Rd - Rm - not C
	SUB SP,#<imm>	1 (8ns)	- - - -	SP <- SP - imm	imm is a constant 0..508, must be a multiple of 4
Negate	NEGS Rd,Rn	1 (8ns)	N Z C V	Rd <- 0 - Rn	negation, synonym of RSBS Rd,Rn,#0 instruction
	RSBS Rd,Rn,#0	1 (8ns)	N Z C V	Rd <- 0 - Rn	negation, synonym of NEGS Rd,Rn instruction
Multiply	MULS Rd,Rm	1 (8ns)	N Z - -	Rd <- Rd * Rm	C and V flags remain unchanged (they are undefined in older ARMs versions)
Compare	CMP Rn,Rm	1 (8ns)	N Z C V	Rn - Rm ?	Registers R0..R15.
	CMPN Rn,Rm	1 (8ns)	N Z C V	Rn + Rm ?	comparison with negation of Rm
	CMP Rn,#<imm>	1 (8ns)	N Z C V	Rn - imm ?	imm is a constant 0..255
Logical	ANDS Rd,Rm	1 (8ns)	N Z - -	Rd <- Rd and Rm
	EORS Rd,Rm	1 (8ns)	N Z - -	Rd <- Rd xor Rm	exclusive or
	ORRS Rd,Rm	1 (8ns)	N Z - -	Rd <- Rd or Rm
	BICS Rd,Rm	1 (8ns)	N Z - -	Rd <- Rd and not Rm	bit clear
	MVNS Rd,Rm	1 (8ns)	N Z - -	Rd <- not Rm	move not
	TST Rn,Rm	1 (8ns)	N Z - -	Rn and Rm ?	AND test
Shift	LSLS Rd,Rm,#<shift>	1 (8ns)	N Z C -	Rd <- Rm << shift	A logical shift to the left. shift is a constant 0..31. 0 is inserted into the lower bits. Carry is the carry from the last high bit.
	LSLS Rd,Rs	1 (8ns)	N Z C -	Rd <- Rd << Rs	A logical shift to the left. Bits 0..7 in Rs contain the number of shifts. 0 is inserted in the lower bits. Carry is the carry from the last high bit.
	LSRS Rd,Rm,#<shift>	1 (8ns)	N Z C -	Rd <- Rm >> shift	Logical shift to the right. shift is a constant 0..31. 0 is inserted into the upper bits. Carry is the carry from the last lowest bit.
	LSRS Rd,Rs	1 (8ns)	N Z C -	Rd <- Rd >> Rs	A logical shift to the right. Bits 0..7 in Rs contain the number of shifts. 0 is inserted in the upper bits. Carry is the carry from the last lowest bit.
	ASRS Rd,Rm,#<shift>	1 (8ns)	N Z C -	Rd <- Rm >> shift	Arithmetic shift to the right. shift is a constant 0..31. Bit 31 (sign) is multiplied into the upper bits. Carry is the carry from the last lowest bit.
	ASRS Rd,Rs	1 (8ns)	N Z C -	Rd <- Rd >> Rs	Arithmetic shift to the right. Bits 0..7 in Rs contain the number of shifts. Bit 31 (sign) is multiplied into the upper bits. Carry is the carry from the last lowest bit.
Rotate	RORS Rd,Rs	1 (8ns)	N Z C -	Rd <- Rd >> Rs, wrap	Rotation to the right. Bits 0..7 in Rs contain the number of shifts. The released lower bits are transferred to the upper bits. Carry is the carry from the last lowest bit.
Load	LDR Rd,[Rn,#<imm>]	1-2 (8-16ns)	- - - -	Rd <- [Rn + imm]	imm is offset 0..124, must be a multiple of 4.
	LDRH Rd,[Rn,#<imm>]	1-2 (8-16ns)	- - - -	Rd <- [Rn + imm] [16]	imm is offset 0..62, multiple of 2. Reads 16 bits, fills the upper 16 bits with 0.
	LDRB Rd,[Rn,#<imm>]	1-2 (8-16ns)	- - - -	Rd <- [Rn + imm] [8]	imm je offset 0..31. Reads 8 bits, fills the upper 24 bits with 0.
	LDR Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd <- [Rn + Rm]
	LDRH Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd <- [Rn + Rm] [16]	It reads 16 bits. Fills the upper 16 bits with 0.
	LDRSH Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd <- [Rn + Rm] [16]	It reads 16 bits. The upper 16 bits are expanded from the signed 15th bit.
	LDRB Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd <- [Rn + Rm] [8]	It reads 8 bits. Fills the upper 24 bits with 0.
	LDRSB Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd <- [Rn + Rm] [8]	It reads 8 bits. The upper 24 bits are expanded from the signed 7th bit.
	LDR Rd,<label>	1-2 (8-16ns)	- - - -	Rd <- [label]	Loads 32-bits from the specified address. The address must be in the range PC..PC+1020 and must be aligned to 4 bytes.
	LDR Rd,[SP,#<imm>]	1-2 (8-16ns)	- - - -	Rd <- [SP + imm]	imm is a constant 0..1020, it must be a multiple of 4.
	LDM Rn!,{<reglist>}	1+N (8+N*8ns)	- - - -	load <reglist>	Reads N registers according to reglist, from base address Rn, excluding register Rn. Register Rn will be incremented after the operation. The registers must be listed in ascending order in the list. Another name for the instruction is LDMIA (Load Multiply and Increment After).
	LDM Rn,{<reglist>}	1+N (8+N*8ns)	- - - -	load <reglist>	Reads N registers according to reglist, from base address Rn, including register Rn. Register Rn must be specified in the register list. The registers must be listed in ascending order. Another name for the instruction is LDMIA (Load Multiply and Increment After).
Store	STR Rd,[Rn,#<imm>]	1-2 (8-16ns)	- - - -	Rd -> [Rn + imm]	imm is offset 0..124, it must be a multiple of 4.
	STRH Rd,[Rn,#<imm>]	1-2 (8-16ns)	- - - -	Rd -> [Rn + imm] [16]	imm is offset 0..62, multiple of 2. Stores 16 bits.
	STRB Rd,[Rn,#<imm>]	1-2 (8-16ns)	- - - -	Rd -> [Rn + imm] [8]	imm is offset 0..31. It stores 8 bits.
	STR Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd -> [Rn + Rm]
	STRH Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd -> [Rn + Rm] [16]	Stores 16 bits.
	STRB Rd,[Rn,Rm]	1-2 (8-16ns)	- - - -	Rd -> [Rn + Rm] [8]	Stores 8 bits.
	STR Rd,[SP,#<imm>]	1-2 (8-16ns)	- - - -	Rd -> [SP + imm]	imm is a constant 0..1020, it must be a multiple of 4.
	STM Rn!,{<reglist>}	1+N (8+N*8ns)	- - - -	store <reglist>	Stores N registers according to reglist, from base address Rn, excluding register Rn. Register Rn will be incremented after the operation. The registers must be listed in ascending order in the list. Another name for the instruction is STMIA (Store Multiply and Increment After).
Push	PUSH {<reglist>}	1+N (8+N*8ns)	- - - -	push <reglist>	Stores the registers into the stack according to the list.
	PUSH {<reglist>,LR}	1+N (8+N*8ns)	- - - -	push <reglist,LR>	It stores the registers in the tray according to the list, including the LR register.
Pop	POP {<reglist>}	1+N (8+N*8ns)	- - - -	pop <reglist>	Restores registers from the stack according to the list.
	POP {<reglist>,PC}	1+N (8+N*8ns)	- - - -	pop <reglist,PC>	Restores registers from the stack according to the list, including the PC registry. This jumps to the original LR address. The PC address must have its lowest bit set to 0 (otherwise hardfault).
Branch	B<cc> <label>	1-2 (8-16ns)	- - - -	if (cc) then PC <- label	Jump to the given label if the condition is met. The label must be max. -256..+254 distant from the next instruction + 2. When the condition is met (and the jump is executed) the instruction takes 2 clock cycles. If not met (and continued), it will take 1 clock cycle.
	B <label>	2 (16ns)	- - - -	PC <- label	Unconditional jump. Label must be within a range of max. +- 2 KB.
	BL <label>	3 (24ns)	- - - -	LR <- PC, PC <- label	Save the following address into LR and jump to the label. Must be in the range +- 4 MB.
	BX Rm	2 (16ns)	- - - -	PC <- Rm	Indirect jump to an address from the Rm register. Bit 0 of the address must be set (otherwise hardfault). Registers R0..R15.
	BLX Rm	2 (16ns)	- - - -	LR <- PC, PC <- Rm	Storage of the following address into LR and indirect jump to the address from the Rm register. Bit 0 of the address must be set (otherwise hardfault). Registers R0..R15.
Extend	SXTH Rd,Rm	1 (8ns)	- - - -	Rd <- Rm [16]	Extends the lower 16 bits from Rm to Rd, fills the upper 16 bits with a sign bit.
	SXTB Rd,Rm	1 (8ns)	- - - -	Rd <- Rm [8]	Extends the lower 8 bits from Rm to Rd, fills the upper 24 bits with a sign bit.
	UXTH Rd,Rm	1 (8ns)	- - - -	Rd <- Rm [16]	Extends the lower 16 bits from Rm to Rd, clears the upper 16 bits.
	UXTB Rd,Rm	1 (8ns)	- - - -	Rd <- Rm [8]	Extends the lower 8 bits from Rm to Rd, clears the upper 24 bits
Reverse	REV Rd,Rm	1 (8ns)	- - - -	Rd <- revb Rm	Reverses the byte order of the Rm register
	REV16 Rd,Rm	1 (8ns)	- - - -	Rd <- revh Rm	Swaps the byte order of the halves Rm (b[0] <-> b[1], b[2] <-> b[3])
	REVSH Rd,Rm	1 (8ns)	- - - -	Rd <- revsh Rm	Reverses the byte order of the lower half of Rm and extends the sign to the upper half.
State	SVC #<imm>	-	- - - -		Supervisor Call, imm is a constant 0..255
	CPSID i	1 (8ns)	- - - -		Global Interruption Disable
	CPSIE i	1 (8ns)	- - - -		Global Interruption Enable
	MRS Rd,<reg>	3 (24ns)	- - - -		Read the system register. Registers R0..R15
	MSR <reg>,Rd	3 (24ns)	- - - -		Write to the system register. Registers R0..R15
	BKPT #<imm>	-	- - - -		Breakpoint. imm is a constant 0..255.
Hint	SEV	1 (8ns)	- - - -		Send event.
	WFE	2 (16ns)	- - - -		Waiting for the event. The time can be extended until the event arrives.
	WFI	2 (16ns)	- - - -		Waiting for an interrupt. The time can be extended until the interrupt arrives.
	YIELD	1 (8ns)	- - - -		Indication in mutlithread that a task is being executed.
	NOP	1 (8ns)	- - - -		No operation
Barriers	ISB	3 (24ns)	- - - -		Instruction synchronization
	DMB	3 (24ns)	- - - -		Data memory barrier
	DSB	3 (24ns)	- - - -		Data synchronization barrier

Measuring instruction timings

Yes, the datasheet lists the clock cycles, how long each instruction takes. But still, there is nothing like "feeling" the instructions :-), checking the timings and trying out the function of the instructions.

We can't measure instruction execution time directly in clock cycles or nanoseconds, we don't have precise enough resources for that. We will measure the execution time of repeated executions of multiple identical instructions. For this we need a precise clock. In Pico, we have a counter that increments by 1 us and has a resolution of 64 bits. In the Pico SDK, we read its value using the time_us_64 function. When measuring a time interval, we keep the counter state both at the beginning and at the end. From the time difference, we find the elapsed time in us.

Here it is important to note that the timer runs over and over again, with an overflow of 64 bits. Even if the timer overflows the upper bound, we can still find the elapsed interval by subtracting the end and start states. This is an advantage over a counter that would have an upper bound set at which it would reset - with such a counter, we could not measure time without taking end overflow into account. For our measurement, we even only need the lower 32 bits. We can make the measurement with the following loop:

int t = (int)time_us_64();
for (i = 1000; i > 0; i--) fnc();
t = (int)time_us_64() - t;

The measured function is repeated 1000 times. By measuring the loop time in us with the loop repeated 1000 times, we obtain a resultant number representing the time of one loop step in ns.

At this point, it would be impractical to insert the instruction under test directly into the loop. The loop has some overhead, and contains several instructions that would distort our result. First, we insert an empty function into the loop and measure the base loop time without the added instructions, this time will later be subtracted from the measured time.

We will use a function written in assembler as the measured function, which will contain a sequence of 100 measured instructions. This will minimize the effects of the surrounding instructions needed for overhead. We will write the instructions in assembler source code. Of course, we don't have to write the instruction 100 times because the compiler has a macro statement to repeat. The code to measure the instruction "MOVS Rd,#<imm>" will look like this:

We can certainly assume that with a different register and a different constant the timing of the instruction will not change. The instruction is executed 100 times. It is a subroutine, so there must be a BX LR instruction at the end, to ensure a return from the subroutine. As we can see from the datasheet (and as we will verify later), the BX LR return instruction takes 2 clock cycles. The instructions to service the loop and call the subroutine take another 7 clock cycles. So we subtract a total of 9 clock cycles from the measured time of one loop step. But even if we didn't subtract them, the 100 steps of the tested instruction ensure that it would be a small inaccuracy that is lost by rounding.

Thus, when measuring, we measured the time of the entire loop in us. For 1000 loop steps, this represents the time of one loop step in ns. We convert this to CPU clock speeds by multiplying by the CPU frequency, so clk = ns/1000000*khz (khz is the CPU frequency in kHz, which we either entered manually or found using the frequency_count_khz function). Subtract the 9 clock cycles required for overhead from the clock rates of one loop. Divide the obtained value by the number of instructions tested, i.e. 100. This gives the time of one instruction in clock cycles.

Of course, not all instructions can simply be placed in a test loop. Instructions are executed when they are called, so we need to prepare the program so that the instruction is executed, but to ensure that the program runs consistently. For example, if we are testing the PUSH instruction, we may execute it 100 times, but at the end of the function we will include the ADD SP,400 instruction to ensure that the stack pointer is refreshed. We can also repeat a sequence of multiple instructions if necessary. For example, when testing the LDM instruction, we must repeatedly set the base register using the LDR instruction. But this is not a problem because we know the timing of the auxiliary LDR instruction from before, so we can subtract it from the result.

And as we can (perhaps surprisingly) find out, the measured instruction timing results actually match the datasheet exactly. :-) Unnecessary work? I think not, it has yielded many useful insights.

The source code for the instruction timing measurement program can be downloaded here. It contains a version of the program for both output to the UART port and output to the virtual USB COM port. The package includes a complete compilation environment for Windows, only the GCC-ARM compiler is needed.