Assembly trick : char set membership test. ·

While I was working on my json parser project I took a look at how the compiler had digested the following whitespace skipping code :


/*
 * Skip whitespaces.
 */
const char *ns_js_skp_whs(
    const char *pos
)
{
    char c;
    while ((c = *pos)) {
        if ((c == ' ') || (c == '\r') || (c == '\n') || (c == '\t')) pos++;
        else break;
    }
    return pos;
}

In this function, we want to skip every character in the set {' ', '\r', '\n', '\t'}.

The diligent reader may note that this function could be much shorter. We’ll cover this in a moment.

Assembly level -o0
#

Let’s see what it looks like in -o0 :

(gdb) disassemble ns_js_skp_whs
Dump of assembler code for function ns_js_skp_whs:
   0x000000000045056c <+0>:     sub     sp, sp, #0x20
   0x0000000000450570 <+4>:     str     x0, [sp, #8]
   0x0000000000450574 <+8>:     b       0x4505b4 <ns_js_skp_whs+72>
   0x0000000000450578 <+12>:    ldrb    w0, [sp, #31]
   0x000000000045057c <+16>:    cmp     w0, #0x20
   0x0000000000450580 <+20>:    b.eq    0x4505a8 <ns_js_skp_whs+60>  // b.none
   0x0000000000450584 <+24>:    ldrb    w0, [sp, #31]
   0x0000000000450588 <+28>:    cmp     w0, #0xd
   0x000000000045058c <+32>:    b.eq    0x4505a8 <ns_js_skp_whs+60>  // b.none
   0x0000000000450590 <+36>:    ldrb    w0, [sp, #31]
   0x0000000000450594 <+40>:    cmp     w0, #0xa
   0x0000000000450598 <+44>:    b.eq    0x4505a8 <ns_js_skp_whs+60>  // b.none
   0x000000000045059c <+48>:    ldrb    w0, [sp, #31]
   0x00000000004505a0 <+52>:    cmp     w0, #0x9
   0x00000000004505a4 <+56>:    b.ne    0x4505cc <ns_js_skp_whs+96>  // b.any
   0x00000000004505a8 <+60>:    ldr     x0, [sp, #8]
   0x00000000004505ac <+64>:    add     x0, x0, #0x1
   0x00000000004505b0 <+68>:    str     x0, [sp, #8]
   0x00000000004505b4 <+72>:    ldr     x0, [sp, #8]
   0x00000000004505b8 <+76>:    ldrb    w0, [x0]
   0x00000000004505bc <+80>:    strb    w0, [sp, #31]
   0x00000000004505c0 <+84>:    ldrb    w0, [sp, #31]
   0x00000000004505c4 <+88>:    cmp     w0, #0x0
   0x00000000004505c8 <+92>:    b.ne    0x450578 <ns_js_skp_whs+12>  // b.any
   0x00000000004505cc <+96>:    ldr     x0, [sp, #8]
   0x00000000004505d0 <+100>:   add     sp, sp, #0x20
   0x00000000004505d4 <+104>:   ret

As you can see, the compiler basically does each check sequentially, which makes sense since we told it to do so via -o0.

Now let’s witness the magic of -o3 :

(gdb) disassemble ns_js_skp_whs
Dump of assembler code for function ns_js_skp_whs:
   0x0000000000435320 <+0>:     ldrb    w1, [x0]
   0x0000000000435324 <+4>:     cbz     w1, 0x435348 <ns_js_skp_whs+40>
   0x0000000000435328 <+8>:     mov     x2, #0x2600                   // #9728
   0x000000000043532c <+12>:    movk    x2, #0x1, lsl #32
   0x0000000000435330 <+16>:    cmp     w1, #0x20
   0x0000000000435334 <+20>:    b.hi    0x435348 <ns_js_skp_whs+40>  // b.pmore
   0x0000000000435338 <+24>:    lsr     x1, x2, x1
   0x000000000043533c <+28>:    tbz     w1, #0, 0x435348 <ns_js_skp_whs+40>
   0x0000000000435340 <+32>:    ldrb    w1, [x0, #1]!
   0x0000000000435344 <+36>:    cbnz    w1, 0x435330 <ns_js_skp_whs+16>
   0x0000000000435348 <+40>:    ret

This looks nothing like the previous version, and if you had given me just this assembly, I’d have hardly guessed that it was doing a set membership test.

Assembly analysis
#

Let’s see what’s happening here :

ASCII translations
#

Here are the numerical values of the character constants used by the C source file :

char	ASCII value
`\0`	0
`\t`	9
`\n`	10
`\r`	13
`' '`	32

Use man ascii if you need to know this sort of info.

Prologue + read
#

As per the arm64 calling convention, the argument (pos) is in x0.

The prologue will load the first character into w1, and initialize x2 with a clever constant that we’ll elaborate on.

   0x0000000000435320 <+0>:     ldrb    w1, [x0]
   0x0000000000435324 <+4>:     cbz     w1, 0x435348 <ns_js_skp_whs+40>
   0x0000000000435328 <+8>:     mov     x2, #0x2600                   // #9728
   0x000000000043532c <+12>:    movk    x2, #0x1, lsl #32

Notes : w1 will be written again when the next character is loaded. x2 will always contain 0x100002600.

Main comparison loop
#

The next lines contain the core of the comparison trick.

   0x0000000000435330 <+16>:    cmp     w1, #0x20
   0x0000000000435334 <+20>:    b.hi    0x435348 <ns_js_skp_whs+40>  // b.pmore
   0x0000000000435338 <+24>:    lsr     x1, x2, x1
   0x000000000043533c <+28>:    tbz     w1, #0, 0x435348 <ns_js_skp_whs+40>
   0x0000000000435340 <+32>:    ldrb    w1, [x0, #1]!
   0x0000000000435344 <+36>:    cbnz    w1, 0x435330 <ns_js_skp_whs+16>

The first two lines compare the current character to 0x20 (remember that it is the decimal value for ' ') and jump to the return section if it is strictly superior to it.

The next two lines shift our clever constant by the current character’s numerical value and jump to the return section if the resulting value has its first bit set. We will come back to this in the next section.

Then, the last two instructions load the next character, updating pos in the process, and :

jump back to the start of the comparison loop if the character is non-null.
proceed to the return section if the character is null.

What’s going on ?
#

To understand what’s going on, let’s take a look at those two lines again :

   0x0000000000435338 <+24>:    lsr     x1, x2, x1
   0x000000000043533c <+28>:    tbz     w1, #0, 0x435348 <ns_js_skp_whs+40>

So there must be something going on with x2. Remember that its value throughout the function is 0x100002600.

Let’s compute the following number :

(1 << ' ') | (1 << '\r') | (1 << '\n') | (1 << '\t')

which translates to :

(1 << 32) | (1 << 13) | (1 << 10) | (1 << 9)

which gives :

0x100000000 | 0x2000 | 0x400 | 0x200

which gives us back our mysterious x2 constant :

0x100002600.

Now as the reader may realize, shifting this number back by any of {' ', '\r', '\n', '\t'} will cause bit 0 to be set.

This is how, with a simple shift and test of bit 0, we can detect that a char is a member of a given set.

The applicability of this trick depends on :

the register size. Here, we can only test among a set of constants with values of 63 or less.
the behavior or LSR, as the dedicated section will cover.

Better C generates better assembly
#

Let’s simplify our C function.

As the reader might have guessed, we don’t really need the null test, since the set membership test does it for us.

/*
 * Skip whitespaces.
 */
const char *ns_js_skp_whs(
    const char *pos
)
{
    char c;
    while (((c = *pos) == ' ') || (c == '\r') || (c == '\n') || (c == '\t')) pos++;
    return pos;
}

Resulting assembly :

(gdb) disassemble ns_js_skp_whs
Dump of assembler code for function ns_js_skp_whs:
   0x0000000000435320 <+0>:     mov     x2, #0x2600                     // #9728
   0x0000000000435324 <+4>:     movk    x2, #0x1, lsl #32
   0x0000000000435328 <+8>:     b       0x435330 <ns_js_skp_whs+16>
   0x000000000043532c <+12>:    add     x0, x0, #0x1
   0x0000000000435330 <+16>:    ldrb    w1, [x0]
   0x0000000000435334 <+20>:    cmp     w1, #0x20
   0x0000000000435338 <+24>:    b.hi    0x435344 <ns_js_skp_whs+36>  // b.pmore
   0x000000000043533c <+28>:    lsr     x1, x2, x1
   0x0000000000435340 <+32>:    tbnz    w1, #0, 0x43532c <ns_js_skp_whs+12>
   0x0000000000435344 <+36>:    ret

And now we see that the null comparison has disappeared.

LSR weirdness
#

One could wonder why those two lines are required, as we are later comparing against a bitmask.

   0x0000000000435334 <+20>:    cmp     w1, #0x20
   0x0000000000435338 <+24>:    b.hi    0x435344 <ns_js_skp_whs+36>  // b.pmore

This is due to a weirdness in the behavior of LSR.

Quoting the aarch64 spec :

LSRV <Xd>, <Xn>, <Xm>
...
Xm : Is the 64-bit name of the second general-purpose source register holding a shift amount from 0 to 63 in its bottom 6 bits, encoded in the "Rm" field.

This causes the following shifts behave exactly the same :

    # shift by 0
    mov x1, #0x00
    # or shift by 64
    mov x1, #0x40
    # will produce the original value if used as a shift operand.
    lsr x0, x0, x1

This essentially means that this LSR trick only works with shifts < 0x40, which mandates a compare + branch to compare. Here the compiler was smart enough to detect the range of our set and just added an upper bound comparison.

This behavior is different from the behavior of ARMV7’s LSR which reads the entire byte. For this architecture, the compare + branch would not be needed. Always know the version of the architecture that you’re writing your assembly for, otherwise you may end up with surprises.

Improvement : simple fixes
#

Some improvements can still be made to this function :

1 : it loads the constant with two instructions.
2 : it branches at the start of the function, to skip the initial add. This is dumb as it causes a branch + the extra x0 add will be executed at each iteration.
3 : the main loop has two branches : we first check that the current char is < 0x20, and then we test if it is in the target set.

We can update the code the following way :

we’ll hardcode the constant after the code and do a PC-relative load. This may or may not be a good idea as a load may take time, but the PC-local cache line is probably in cache.
we will always add 1 to x0 during the ldrb to make the loop more compact, and we will decrement x0 before ret. This is OK since we only return once but may execute the loop multiple times.

The third point will be more interesting to update :

first, let’s note that the bitmask trick works for constants up to 63, due to :
- the register mask size.
- the behavior of LSR in aarch64.
thus, we can replace the is <= 0x20 check by a is < 0x40 check in the code without any change of behavior.
this is equivalent to checking that bits 6 and 7 of our character are 0.
so now we can rephrase our check : test that bits 6 and 7 of the character are 0, and that the bitmask shifted by the character value has bit 0 set.
with a CSEL (conditional select) we can rework our assembly to do the following :
- if bits 6 or 7 of x1 are set, then store 0 in x1.
- otherwise store the shifted mask in x1
- then, test bit 0 of x1 and branch back if it is set.

Here is the resulting assembly :

(lldb) dis
prc`ns_js_skp_whs:
->  0x4294c0 <+0>:  ldr    x2, 0x4294e0 ; <+32>
    0x4294c4 <+4>:  ldrb   w1, [x0], #0x1
    0x4294c8 <+8>:  tst    w1, #0xc0
    0x4294cc <+12>: lsr    x1, x2, x1
    0x4294d0 <+16>: csel   x1, x1, xzr, eq
    0x4294d4 <+20>: tbnz   w1, #0x0, 0x4294c4 ; <+4>
    0x4294d8 <+24>: sub    x0, x0, #0x1
    0x4294dc <+28>: ret
    0x4294e0 <+32>: udf    #0x2600
    0x4294e4 <+36>: udf    #0x1

I had to switch to LLDB as GDB was choking on my hand-written assembly function : it was hanging at program start.

Improvement : using smarter instructions
#

Let’s change the high level algorithm : we’ll update the range of supported constants to check to [0, 62] and be sure to have bit 63 set to 0.

We’ll then use this instruction to assign a value to 63 to everything superior to 63, which will effectively exclude all those chars of the match set :

prc`ns_js_skp_whs:
    0x4294c0 <+0>:  ldr    x2, 0x4294dc ; <+28>
    0x4294c4 <+4>:  ldrb   w1, [x0], #0x1
    0x4294c8 <+8>:  umin   w1, w1, #0x63
    0x4294cc <+12>: lsr    x1, x2, x1
    0x4294d0 <+16>: tbnz   w1, #0x0, 0x4294c4 ; <+4>
    0x4294d4 <+20>: sub    x0, x0, #0x1
    0x4294d8 <+24>: ret
    0x4294dc <+28>: udf    #0x2600
    0x4294e0 <+32>: udf    #0x1

When compiling, we see a strange compiling error :

Compiling arm64.S
arm64.S: Assembler messages:
arm64.S:15: Error: selected processor does not support `umin w1,w1,0x63'
make: *** [Makefile:755: build/lib/obj/_tgt_ns/arm64.S.o] Error 1

This is due to the target that I’m compiling my assembly file with :

   .arch armv8.5-a

As the AARCH64 doc says, this instruction is only available if FEAT_CSSC is available in the target architecture, and this feature is optional in ARMV8.5. Adding this feature to the supported assembly fixes the build error.

   .arch armv8.5-a+cssc

Though, we’re not at the end of our troubles, as it appears that my computer (Apple M2 with Asahi) doesn’t support this instruction :

lldb build/prc/prc
(lldb) target create "build/prc/prc"
Current executable set to '/home/bt/bt/work/emb/build/prc/prc' (aarch64).
(lldb) run
Process 31568 launched: '/home/bt/bt/work/emb/build/prc/prc' (aarch64)
Process 31568 stopped
* thread #1, name = 'prc', stop reason = signal SIGILL: illegal opcode
    frame #0: 0x00000000004294c8 prc`ns_js_skp_whs + 8
(lldb) dis
prc`ns_js_skp_whs:
    0x4294c0 <+0>:  ldr    x2, 0x4294dc ; <+28>
    0x4294c4 <+4>:  ldrb   w1, [x0], #0x1
->  0x4294c8 <+8>:  umin   w1, w1, #0x63
    0x4294cc <+12>: lsr    x1, x2, x1
    0x4294d0 <+16>: tbnz   w1, #0x0, 0x4294c4 ; <+4>
    0x4294d4 <+20>: sub    x0, x0, #0x1
    0x4294d8 <+24>: ret
    0x4294dc <+28>: udf    #0x2600
    0x4294e0 <+32>: udf    #0x1

The instruction is not supported by my CPU hence it generated a sync exception reporting an undefined instruction when encountering it.

This version is thus not-applicable to my CPU but teaches us a great lesson in optimization :

The architecture that your CPU runs on matters a lot as it affects the set of tricks at your disposal to grind perf.

Doing it in C
#

OK so now we know this super clever trick used by the compiler. We can now just use it in our C code to avoid relying on the compiler’s cleverness. We just have to carefully compute the masks, which can be easily achieved by two macros and some bitwise trickery :

/*
 * Skip whitespaces.
*/
_weak_ const char *ns_js_skp_whs(
    const char *pos
)
{

    /* Produce a bitmask for a char range. stt must be <= end. Undefined behavior otherwise. */
    #define RANGE(stt, end) ((((u64) 1 << ((end) + 1 - (stt))) - 1) << (stt))
    #define VALUE(c) RANGE(c, c)

    /* Compute the whitespace mask. */
    const u64 msk = VALUE(' ') | VALUE('\r') | RANGE('\t', '\n');

    /* Process naively. This line purposefully has a bug which the next section will cover. */                                                          
    char c;
    while ((msk >> (c = *pos)) & 0x1) pos++;

    /* Complete. */
    return pos;
}

Let’s compile it in -o0 just to prevent the compiler from being clever, and be horrified by the resulting assembly.

(lldb) disassemble  -n ns_js_skp_whs
prc`ns_js_skp_whs:
0x4505f0 <+0>:  sub    sp, sp, #0x20
0x4505f4 <+4>:  str    x0, [sp, #0x8]
0x4505f8 <+8>:  mov    x0, #0x2600 ; =9728
0x4505fc <+12>: movk   x0, #0x1, lsl #32
0x450600 <+16>: str    x0, [sp, #0x18]
0x450604 <+20>: b      0x450614       ; <+36> at js.c:93:20
0x450608 <+24>: ldr    x0, [sp, #0x8]
0x45060c <+28>: add    x0, x0, #0x1
0x450610 <+32>: str    x0, [sp, #0x8]
0x450614 <+36>: ldr    x0, [sp, #0x8]
0x450618 <+40>: ldrb   w0, [x0]
0x45061c <+44>: strb   w0, [sp, #0x17]
0x450620 <+48>: ldrb   w0, [sp, #0x17]
0x450624 <+52>: ldr    x1, [sp, #0x18]
0x450628 <+56>: lsr    x0, x1, x0
0x45062c <+60>: and    x0, x0, #0x1
0x450630 <+64>: cmp    x0, #0x0
0x450634 <+68>: b.ne   0x450608       ; <+24> at js.c:93:39
0x450638 <+72>: ldr    x0, [sp, #0x8]
0x45063c <+76>: add    sp, sp, #0x20
0x450640 <+80>: ret

I mean what the hell is that… Why is it even using any stack at all… That’s where -o0 gets you.

But between two retches we can still see that the compiler uses the same mask plus shift than earlier which is exactly what we want.

C undefined behaviors and why we care
#

Looking at the assembly in the section above, the attentive reader may ask :

We’re using LSR right ? So how about the comparison with 64 ? Where is the check and jump that was previously used by the compiler itself ?

This is a legit question, as in its current state, given what we covered earlier, this algorithm is not working.

But confusingly enough, the compiler was actually right in its assembly generation.

To explain why, let’s refer to GNU’s C documentation at section 3.8 on bit shifting :

For both « and », if the second operand is greater than the bit-width of the first operand, or the second operand is negative, the behavior is undefined.

So basically our C program has a bug : it should manually check for the value of the char and only shift if it is less than (<) 64, otherwise, the result of the shift is undefined.

From the compiler’s standpoint, words matter : undefined means ‘whose behavior is not defined’, understand ‘whose behavior can legitimately produce any result’. Hence, the compiler does not have to ensure what result the operation has and could theoretically do everything it wants. Here, it simply used the aarch64 LSR instruction and stuck to its behavior. But for all intents and purposes, it could very well have generated an assembly sequence that :

checks the char range and aborts if >= 64. or
checks the char range and moves 0xdeadbeef in x0 instead of the shift result. or
blows up your machine.

All those scenarios would all have been perfectly acceptable behaviors as per the C standard.

This teaches us one other optimization lesson :

Beware of undefined behaviors. They may cause your optimization to work on a given platform/compiler, and not on others.

Assembly level -o0#

Assembly analysis#

ASCII translations#

Prologue + read#

Main comparison loop#

What’s going on ?#

Better C generates better assembly#

LSR weirdness#

Improvement : simple fixes#

Improvement : using smarter instructions#

Doing it in C#

C undefined behaviors and why we care#