TurboJSON : Assembly-level optimization. ·

Table of Contents

Json_optimization - This article is part of a series.

Part 0: TurboJSON : JSON basics.

Part 1: This Article

Part 1: TurboJSON : C-level optimization.

Introduction.
#

In the previous chapter we introduced a bunch of C-level tricks to optimize our parsing.

We now will dive at the assembly level to explore how we can increase performance even further.

Two warnings before we start.

First, the assembly world is cruel and cold. You may spend hours working on what you believe to be a clever optimization using fancy SIMD instructions and discover that they simply worsen your perf by a factor of 2. You’ll actually see this in practice. This chapter will essentially be an enumeration of the different assembly variants of string and array skip, which occupy the most of our json parser. Do not expect any fun story here. It will be just that, and watching my slow descent into insanity.

Working at the assembly level essentially means trying to find out how we can tune our assembly to be executed more efficiently by our processor, which is heavily dependent on its microarchitecture. A trick that works well on a given machine may actually not work / decrease the performance on another machine with the same architecture but a different microarchitecture.

As an example, let’s imagine two hypothetical ARM64 machines, one with a very efficient branch predictor which accurately predicts at a 95% rate, and another with no branch predictor at all, or a very simple and stupid one.

The first machine will essentially always successfully speculate, which means that our job will be to increase the instructions bandwidth regardless of branches. The second machine’s speculation capabilities will be doubtful, and our first job will be to reduce to a minimum the number of branches, using tricks like conditional arithmetics to reduce their number. This will lead to programs that look very different. You’ll see an example of that and associated performance metrics on my machine.

Calling assembly from C.
#

As stated before, this chapter will essentially be a summary of my autistic assembly-writing frenzy. I’ll cover many versions of programs that do exactly the same thing (skipping strings and skipping arrays) but with different perf metrics.

It is important for our C code to be able into the selected assembly variant as easily as possible.

To do this, we’ll use two things : the weak attribute and an assembly trampoline.

First, we’ll prefix our two C functions ns_js_fsk_str and ns_js_fsk_arr by __attribute__((weak)), which will tell the compiler that the provided definition should only use as a backup in case no other function (generally, symbol) with the same name is provided at link time.

The last two words, link time do matter here. Indeed, since weak symbol resolution is done at link time, this by design prevents the compiler from inlining behind our back and reduces the scope of its optimizations.

Let’s see that in action : here are the perf metrics for a run without the weak attribute :

 nkb &&  taskset -c 4  build/prc/prc -rdb /tmp/regs.json -C 10 -c 10
Compiling arm64.S
Packing lib_ns.o
Auto-packing build/prc/prc.o
Auto-linking build/prc/prc
stp 0 : 36.591.
stp 1 : 29.859.
stp 2 : 29.650.
stp 3 : 29.682.
stp 4 : 29.681.
stp 5 : 29.677.
stp 6 : 29.690.
stp 7 : 29.682.
stp 8 : 29.687.
stp 9 : 29.681.
stp 10 : 29.670.
stp 11 : 29.680.
stp 12 : 29.682.
stp 13 : 29.687.
stp 14 : 29.682.
stp 15 : 29.697.
stp 16 : 29.671.
stp 17 : 29.709.
stp 18 : 29.717.
stp 19 : 29.719.
Average : 29.691.

Now here are the perf metrics with the weak attribute.

nkb &&  taskset -c 4  build/prc/prc -rdb /tmp/regs.json -C 10 -c 10
Compiling arm64.S
Packing lib_ns.o
Auto-packing build/prc/prc.o
Auto-linking build/prc/prc
stp 0 : 39.208.
stp 1 : 31.896.
stp 2 : 31.672.
stp 3 : 31.717.
stp 4 : 31.703.
stp 5 : 31.713.
stp 6 : 31.726.
stp 7 : 31.761.
stp 8 : 31.725.
stp 9 : 31.725.
stp 10 : 31.708.
stp 11 : 31.718.
stp 12 : 31.717.
stp 13 : 31.723.
stp 14 : 31.699.
stp 15 : 31.711.
stp 16 : 31.712.
stp 17 : 31.715.
stp 18 : 31.725.
stp 19 : 31.765.
Average : 31.719.

So just by letting the C code call into our assembly functions, we already lost perf. We’ll compensate for that later on.

In the rest of the article I’ll stop showing the perf iterations and just show the average, since there will be many of them.

On the assembly side we’ll define trampoline functions named like our C functions, and which call the variant of our choice.

   .align 4
   .global ns_js_fsk_str
   .type   ns_js_fsk_str, %function
s_js_fsk_str:
   b ns_js_fsk_str1
   .size   ns_js_fsk_str, .-ns_js_fsk_str

    .align 4
    .global ns_js_fsk_arr
    .type   ns_js_fsk_arr, %function
ns_js_fsk_arr:
    b ns_js_fsk_arr22
    .size   ns_js_fsk_arr, .-ns_js_fsk_arr

That sets us up for assembly programming.

String skip.
#

Our json file is mostly composed of strings and skipping them quickly is important.

Our first optimization will be centered around ns_js_fsk_str.

Compiler version.
#

(lldb) dis -n ns_js_fsk_str
prc`ns_js_fsk_str:
0x45cd80 <+0>:  ldrb   w1, [x0], #0x1
0x45cd84 <+4>:  mov    w3, #0x5c ; =92
0x45cd88 <+8>:  mov    w2, w1
0x45cd8c <+12>: ldrb   w1, [x0], #0x1
0x45cd90 <+16>: cmp    w1, #0x22
0x45cd94 <+20>: ccmp   w2, w3, #0x4, eq
0x45cd98 <+24>: b.eq   0x45cd88       ; <+8> at js.c:605:3
0x45cd9c <+28>: ret

Clever as usual, it generates a minimal loop which reads incrementally, and remembers the last character.

As mentioned in the previous chapter, this algorithm is actually incorrect, as to parse an actual json string, we must keep track of the number of backslashes preceding a quote, and consider it a string end only if it is even.

Variant 0 : using SIMD
#

SIMD&FP is mandatory in AARCH64 and is supported by my processor.

We’ll use a CMEQ to compare 8 characters at a time to ", and break at the first match. Then we’ll count trailing 0s to determine the location of the first encountered quote, adjust the char pointer to this location. Finally, we’ll count the number of \s preceding the quote and break if it is even.

Using CMEQ to compare 8 chars at a time.

# JS strink skip.
	.global	ns_js_fsk_str0
	.type	ns_js_fsk_str0, %function
ns_js_fsk_str0:

	# Skip first '"' char.
	add x0, x0, #1
	
	# Constants init.
	ldr d0, 6f;

	# Main loop.
	# Read 8 bytes, compare them to '"' (simd).
	# Then if none is equal, reiterate.
	# Otherwise, move x0 after the first '"' and break.
1:
	ldr d1, [x0], #8;
	cmeq v1.8b, v1.8b, v0.8b
	mov x1, v1.D[0]
	cbz x1, 1b
	
	# Loop exit. Adjust x0 to point to the char after '"'.
	sub x0, x0, #7
	rbit x1, x1
	clz x1, x1
	add x0, x0, x1, lsr 3

	# If previous char is '\', corner case, as we need to
	# Determine if it is meaningful.
	ldrb w1, [x0, #-2]
	cmp x1, 0x5c
	b.ne 3f
	
	# Count the number of '/' before '"'.
	# If it is even, return.
	# If it is odd, re-enter the loop
	mov x1, 0
	sub x2, x0, #3
2:
	add x1, x1, #1
	ldrb w3, [x2], #-1
	cmp x3, 0x5c
	b.eq 2b;
	tbnz x1, #0, 1b
3:
	ret

# Constants.
6:
	.quad 0x2222222222222222
	.size	ns_js_fsk_str0, .-ns_js_fsk_str0

Multiple factors make this code efficient :

counting the /s after we find the quote makes the loop lightweight.
guarding the /s count behind a branch makes the branch predictor systematically skip this section, as backslashed quotes are very rare.

Average : 29.081.

This already compensated for the perf loss caused by the weak attribute.

For now this is the only variant for the str skip, though I may give a try to the multi-read trick shown in the array skip section.

Array skip.
#

As shown by valgrind in the previous chapter, we spend most of our time in ns_js_fsk_arr. This makes it our main target for optimization.

All this section will cover this function, and the perf metrics shown will include the trick shown in the section about ns_js_fsk_str.

Our baseline will hence be :

Average : 29.081.

C code
#

As a reminder, here is our base optimized C code :

s8 ns_js_fsk_arr_tbl [255] = {
   [']'] = -1,
   ['['] = 1
};

#define ITR(pos, v, cnt, qt, op, cl, tbl) \
        for (u8 i = 0; i < 8; pos++, i++, v >>= 8) { \
                u8 c = (u8) v; \
                cnt += tbl[c]; \
                if (!cnt) { \
                        return pos + 1; \
                } \
                if (c == qt) { \
                        pos = ns_js_fsk_str(pos); \
                        goto stt; \
                } \
        }

/*
 * Skip an array.
 */
_weak_ const char *ns_js_fsk_arr(
        const char *pos
)
{
        check(*pos == '[');
        pos++;
        u32 cnt = 1;
        while (1) {
                stt:;
                uint64_t v = *(uint64_t *) pos;
                uint64_t v1 = *(uint64_t *) (pos + 8);
                ITR(pos, v, cnt, '"', '[', ']', ns_js_fsk_arr_tbl);
                ITR(pos, v1, cnt, '"', '[', ']', ns_js_fsk_arr_tbl);
        }
}

We make two memory reads of 64 bits, then iterate over each byte.

Let’s see what gcc does with that.

Compiler version.
#

Here is the version generated by gcc :

GCC output.

(lldb) dis -n ns_js_fsk_arr
prc`ns_js_fsk_arr:
0x45cdcc <+0>:   stp    x29, x30, [sp, #-0x20]!
0x45cdd0 <+4>:   mov    x1, x0
0x45cdd4 <+8>:   add    x0, x0, #0x1
0x45cdd8 <+12>:  mov    x29, sp
0x45cddc <+16>:  stp    x19, x20, [sp, #0x10]
0x45cde0 <+20>:  adrp   x20, 52
0x45cde4 <+24>:  add    x20, x20, #0x5a0 ; ns_js_fsk_arr_tbl
0x45cde8 <+28>:  ldur   x2, [x1, #0x1]
0x45cdec <+32>:  ldur   x3, [x1, #0x9]
0x45cdf0 <+36>:  and    w1, w2, #0xff
0x45cdf4 <+40>:  ldrsb  w19, [x20, w1, sxtw]
0x45cdf8 <+44>:  adds   w19, w19, #0x1
0x45cdfc <+48>:  b.eq   0x45cfcc       ; <+512> at js.c:663:3
0x45ce00 <+52>:  cmp    w1, #0x22
0x45ce04 <+56>:  b.eq   0x45cfe4       ; <+536> at js.c:663:3
0x45ce08 <+60>:  ubfx   w5, w2, #8, #8
0x45ce0c <+64>:  add    x1, x0, #0x1
0x45ce10 <+68>:  ldrsb  w4, [x20, w5, sxtw]
0x45ce14 <+72>:  adds   w19, w19, w4
0x45ce18 <+76>:  b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45ce1c <+80>:  cmp    w5, #0x22
0x45ce20 <+84>:  b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ce24 <+88>:  ubfx   w5, w2, #16, #8
0x45ce28 <+92>:  add    x1, x0, #0x2
0x45ce2c <+96>:  ldrsb  w4, [x20, w5, sxtw]
0x45ce30 <+100>: adds   w19, w19, w4
0x45ce34 <+104>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45ce38 <+108>: cmp    w5, #0x22
0x45ce3c <+112>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ce40 <+116>: lsr    w5, w2, #24
0x45ce44 <+120>: add    x1, x0, #0x3
0x45ce48 <+124>: ldrsb  w4, [x20, w5, sxtw]
0x45ce4c <+128>: adds   w19, w19, w4
0x45ce50 <+132>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45ce54 <+136>: cmp    w5, #0x22
0x45ce58 <+140>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ce5c <+144>: ubfx   x5, x2, #32, #8
0x45ce60 <+148>: add    x1, x0, #0x4
0x45ce64 <+152>: ldrsb  w4, [x20, w5, sxtw]
0x45ce68 <+156>: adds   w19, w19, w4
0x45ce6c <+160>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45ce70 <+164>: cmp    w5, #0x22
0x45ce74 <+168>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ce78 <+172>: ubfx   x5, x2, #40, #8
0x45ce7c <+176>: add    x1, x0, #0x5
0x45ce80 <+180>: ldrsb  w4, [x20, w5, sxtw]
0x45ce84 <+184>: adds   w19, w19, w4
0x45ce88 <+188>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45ce8c <+192>: cmp    w5, #0x22
0x45ce90 <+196>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ce94 <+200>: ubfx   x5, x2, #48, #8
0x45ce98 <+204>: add    x1, x0, #0x6
0x45ce9c <+208>: ldrsb  w4, [x20, w5, sxtw]
0x45cea0 <+212>: adds   w19, w19, w4
0x45cea4 <+216>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cea8 <+220>: cmp    w5, #0x22
0x45ceac <+224>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ceb0 <+228>: lsr    x4, x2, #56
0x45ceb4 <+232>: add    x1, x0, #0x7
0x45ceb8 <+236>: mov    x2, x4
0x45cebc <+240>: ldrsb  w4, [x20, x4]
0x45cec0 <+244>: adds   w19, w19, w4
0x45cec4 <+248>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cec8 <+252>: cmp    x2, #0x22
0x45cecc <+256>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ced0 <+260>: and    w4, w3, #0xff
0x45ced4 <+264>: add    x1, x0, #0x8
0x45ced8 <+268>: ldrsb  w2, [x20, w4, sxtw]
0x45cedc <+272>: adds   w19, w19, w2
0x45cee0 <+276>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cee4 <+280>: cmp    w4, #0x22
0x45cee8 <+284>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45ceec <+288>: ubfx   w4, w3, #8, #8
0x45cef0 <+292>: add    x1, x0, #0x9
0x45cef4 <+296>: ldrsb  w2, [x20, w4, sxtw]
0x45cef8 <+300>: adds   w19, w19, w2
0x45cefc <+304>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cf00 <+308>: cmp    w4, #0x22
0x45cf04 <+312>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cf08 <+316>: ubfx   w4, w3, #16, #8
0x45cf0c <+320>: add    x1, x0, #0xa
0x45cf10 <+324>: ldrsb  w2, [x20, w4, sxtw]
0x45cf14 <+328>: adds   w19, w19, w2
0x45cf18 <+332>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cf1c <+336>: cmp    w4, #0x22
0x45cf20 <+340>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cf24 <+344>: lsr    w4, w3, #24
0x45cf28 <+348>: add    x1, x0, #0xb
0x45cf2c <+352>: ldrsb  w2, [x20, w4, sxtw]
0x45cf30 <+356>: adds   w19, w19, w2
0x45cf34 <+360>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cf38 <+364>: cmp    w4, #0x22
0x45cf3c <+368>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cf40 <+372>: ubfx   x4, x3, #32, #8
0x45cf44 <+376>: add    x1, x0, #0xc
0x45cf48 <+380>: ldrsb  w2, [x20, w4, sxtw]
0x45cf4c <+384>: adds   w19, w19, w2
0x45cf50 <+388>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cf54 <+392>: cmp    w4, #0x22
0x45cf58 <+396>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cf5c <+400>: ubfx   x4, x3, #40, #8
0x45cf60 <+404>: add    x1, x0, #0xd
0x45cf64 <+408>: ldrsb  w2, [x20, w4, sxtw]
0x45cf68 <+412>: adds   w19, w19, w2
0x45cf6c <+416>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cf70 <+420>: cmp    w4, #0x22
0x45cf74 <+424>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cf78 <+428>: ubfx   x4, x3, #48, #8
0x45cf7c <+432>: add    x1, x0, #0xe
0x45cf80 <+436>: ldrsb  w2, [x20, w4, sxtw]
0x45cf84 <+440>: adds   w19, w19, w2
0x45cf88 <+444>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cf8c <+448>: cmp    w4, #0x22
0x45cf90 <+452>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cf94 <+456>: lsr    x2, x3, #56
0x45cf98 <+460>: add    x1, x0, #0xf
0x45cf9c <+464>: mov    x3, x2
0x45cfa0 <+468>: ldrsb  w2, [x20, x2]
0x45cfa4 <+472>: adds   w19, w19, w2
0x45cfa8 <+476>: b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45cfac <+480>: cmp    x3, #0x22
0x45cfb0 <+484>: b.eq   0x45cfe0       ; <+532> at js.c:663:3
0x45cfb4 <+488>: add    x0, x0, #0x10
0x45cfb8 <+492>: ldp    x2, x3, [x0]
0x45cfbc <+496>: and    w1, w2, #0xff
0x45cfc0 <+500>: ldrsb  w4, [x20, w1, sxtw]
0x45cfc4 <+504>: adds   w19, w19, w4
0x45cfc8 <+508>: b.ne   0x45ce00       ; <+52> at js.c:663:3
0x45cfcc <+512>: mov    x1, x0
0x45cfd0 <+516>: ldp    x19, x20, [sp, #0x10]
0x45cfd4 <+520>: add    x0, x1, #0x1
0x45cfd8 <+524>: ldp    x29, x30, [sp], #0x20
0x45cfdc <+528>: ret
0x45cfe0 <+532>: mov    x0, x1
0x45cfe4 <+536>: bl     0x44e840       ; ns_js_fsk_str
0x45cfe8 <+540>: b      0x45cfb8       ; <+492> at js.c:662:12

Though it is kind of difficult to read, its working principle is kind of simple :

First, it initializes a couple of things, and enters the main loop.

Then it reads our two u64s with :

0x45cde8 <+28>:  ldur   x2, [x1, #0x1]
0x45cdec <+32>:  ldur   x3, [x1, #0x9]

Then it processes each byte of the two registers with 16 sections like this :

0x45ce08 <+60>:  ubfx   w5, w2, #8, #8
0x45ce0c <+64>:  add    x1, x0, #0x1
0x45ce10 <+68>:  ldrsb  w4, [x20, w5, sxtw]
0x45ce14 <+72>:  adds   w19, w19, w4
0x45ce18 <+76>:  b.eq   0x45cfd0       ; <+516> at js.c:666:1
0x45ce1c <+80>:  cmp    w5, #0x22
0x45ce20 <+84>:  b.eq   0x45cfe0       ; <+532> at js.c:663:3

Each of these will process one specific byte (selected via ubfx), and store the address of the current char in x1, so that the quote or end-of-array sections know where we are. Then the lookup table is accessed via ldrsb, the array nesting counter is computed with adds, and finally, both exit conditions are checked : if either a quote or an end of array is encountered in this section, the CPU jumps to the relevant section.

Performance notes :

instead of the store x0 + offset in x1 dance, we could just have incremented x0, but this would decrease perf, as it creates artificial register dependencies that prevent the CPU from reordering or executing in parallel. GCC knows that and avoids those.
we are always writing in x1 or w5 or w4, but register renaming breaks those write-after-write artificial dependencies for us and allows their parallel execution.

If no quote or end-of-array is encountered, the next u64s are loaded via :

0x45cfb8 <+492>: ldp    x2, x3, [x0]

and the byte-processing sequence is restarted.

This version is full of branches, but branch prediction being as perfect as it is, it does not matter. We’ll take a look at variants of this algorithm and explain this in more details.

Variant 0 : set membership test.
#

We’ll use a variant of the char membership test described here. "[] are in a less-than-64-wide range in the ascii charset, so we can just initialize a mask in x5 with their locations and use the lsr trick on the current char to quickly identify them.

If we see them, then we compare them manually. Otherwise we jump back.

Using a set membership test to detect []".


# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr0
	.type	ns_js_fsk_arr0, %function
ns_js_fsk_arr0:
	stp x29, x30, [sp, #-0x10]!

	# skip first '"' char.
	add x0, x0, #1

	# init depth counter. 
	ldr x5, 6f
	mov x4, #1

	# main loop.
1:
	ldrb w1, [x0], #1
	sub w2, w1, #32
	lsr x2, x5, x2  
	tbz w2, #0, 1b; 

	# special char detection.
	cmp w1, #0x22 
	b.eq 2f
	cmp w1, #0x5b 
	b.eq 3f
	cmp w1, #0x5d 
	b.eq 4f
	b 1b;

	# no jump to '"'
2:
	# '"' handling.
	sub x0, x0, #1
	bl ns_js_fsk_str  
	b 1b;

	# '[' handling.
3:	
	add x4, x4, #1
	b 1b;

	# ']' handling.
4:	
	sub x4, x4, #1
	cbnz x4, 1b;
	ldp x29, x30, [sp], #0x10
	ret
6:
	.quad 0x2800000000000004 
	.size	ns_js_fsk_arr0, .-ns_js_fsk_arr0

The perf numbers for this method are catastrophic :

Average : 44.022.

The reason why this is bad is because we don’t let branch prediction enough branches to make accurate predictions. In the algorithm above, every section had its own branches and BP could just index them all. Here it just has one loop branch and three char branches to make a compound prediction for all of them.

On top of that, the loop is very short and all instructions have true data dependencies (RAW) which cause the CPU to execute them in order.

Variant 1 : SIMD-based char membership test.
#

We’ll try our luck with the SIMD world again.

Ideally we’d like to try to port the previous method with SIMD. Though we can’t do this as variant 0’s method as-is requires a 64 bits mask to do the set membership test.

We’ll use the method described in a dedicated article and use this expression as a detector :

({u8 shf = (((c - '"') ^ ((u8) 1 << 5)) - (u8) 25); !((shf <= 7) && ((0xa1 >> shf) & 1));})

It involves only 8 bit values which makes it perfectly suitable for SIMD.

A more complex but SIMD-compatible set membership test.

# Relies on the following trick to
# make a quick byte membership test for '[]"' :
# ({u8 shf = (((c - '"') ^ ((u8) 1 << 5)) - (u8) 25); !((shf <= 7) && ((0xa1 >> shf) & 1));})

	.align 4
	.global	ns_js_fsk_arr1
	.type	ns_js_fsk_arr1, %function
ns_js_fsk_arr1:
	stp x29, x30, [sp, #-0x10]!
	ldr x1, =6f

	# Skip first '"' char.
	add x0, x0, #1

	# Init depth counter. 
	mov x4, #1
	
	# Constants init.
	ldp d16, d17, [x1]
	ldp d18, d19, [x1, #16]
	ldr d20, [x1, #32]

	# Main loop.
	# Read 8 bytes, check if they are in '[]"'.
	# Then if none is equal, reiterate.
	# If one is equal, move .
1:
	ldr d4, [x0], #8;
	sub v4.8b, v4.8b, v16.8b
	eor v4.8b, v4.8b, v17.8b
	sub v4.8b, v4.8b, v18.8b
	ushl v4.8b, v20.8b, v4.8b
	and v4.8b, v4.8b, v19.8b
	mov x1, v4.D[0]
	cbz x1, 1b

	# Loop exit.
	# Adjust x0 to point to the next character.
	# Adjust x1 to contain the maks of the found character.
	sub x0, x0, #7
	rbit x2, x1
	clz x2, x2
	and x5, x2, #0xf8
	add x0, x0, x2, lsr #3
	lsr x1, x1, x5

	# Process.
	# jump to '['
	tbnz x1, #0, 2f; 
	# jump to ']'
	tbnz x1, #2, 3f; 

	# No jump to '"'
	# '"' handling.
	sub x0, x0, #1
	bl ns_js_fsk_str  
	b 1b;

	# '[' handling.
2:	
	add x4, x4, #1
	b 1b;

	# ']' handling.
3:	
	sub x4, x4, #1
	cbnz x4, 1b;
	ldp x29, x30, [sp], #0x10
	ret

# Constants.
6:
	# d16, quotes.
	.quad 0x2222222222222222
	# d17, 1 << 5.
	.quad 0x2020202020202020
	# d18, 25.
	.quad 0x1919191919191919
	# d19, mask.
	.quad 0x8585858585858585
	# d20, mask.
	.quad 0x0101010101010101
	.size	ns_js_fsk_arr1, .-ns_js_fsk_arr1

On the perf side this solution is better than the previous one but that was not very hard. It’s still pretty bad.

Average : 38.264.

Variant 2 : CMEQ + EOR3.
#

The previous membership test is clever but it has too much true data dependencies which makes its execution necessarily sequential and degrades perf.

We can do a more stupid but better code which will just use CMEQ three times and gather the results. Each CMEQ will compare 8 chars with either ", [ or ]. We’ll use the special EOR3 instruction which does a three-way exclusive or. Since a char can only be equal to one of those, in our case, EOR is semantically equivalent to a OR.

Replacing our set membership test by a dump three-way comparison.


# JS Array skip.
# Clobbers 
# - d16, d17, d18, d19, d20 (constants).
# - d4 (read) 
# - x0, x1, x2, x3, x4, x5. Scratch.
# The JS array skip relies on the following trick to
# make a quick byte membership test for '[]"' :
# ({u8 shf = (((c - '"') ^ ((u8) 1 << 5)) - (u8) 25); !((shf <= 7) && ((0xa1 >> shf) & 1));})

	.align 4
	.global	ns_js_fsk_arr2
	.type	ns_js_fsk_arr2, %function
ns_js_fsk_arr2:
	stp x29, x30, [sp, #-0x10]!
	ldr x1, =6f

	# Skip first '[' char.
	add x0, x0, #1

	# Init depth counter. 
	mov x4, #1
	
	# Constants init.
	ldp d16, d17, [x1]
	ldr d18, [x1, #16]

	# Main loop.
	# Read 8 bytes, check if they are in '[]"'.
	# Then if none is equal, reiterate.
	# If one is equal, move .
1:
	ldr d0, [x0], #8;
	cmeq v1.8b, v0.8b, v16.8b
	cmeq v2.8b, v0.8b, v17.8b
	cmeq v3.8b, v0.8b, v18.8b
	eor3 v1.16b, v1.16b, v2.16b, v3.16b
	mov x2, v1.D[0]
	cbz x2, 1b

	# Loop exit.
	# Adjust x0 to point to the next character.
	# Adjust x1 to contain the maks of the found character.
	mov x1, v0.D[0]
	and x1, x2, x1
	sub x0, x0, #7
	rbit x2, x2
	clz x2, x2
	and x5, x2, #0xf8
	add x0, x0, x2, lsr #3
	lsr x1, x1, x5
	and x1, x1, 0xff

	# Process.
	# jump to '['
	cmp x1, #0x5b
	b.eq 2f; 
	# jump to ']'
	cmp x1, #0x5d
	b.eq 3f; 
	# jump to '"'
	#cmp x1, #0x22
	#b.eq 4f; 
	#b 1b;
	b 4f

	# '[' handling.
2:	
	add x4, x4, #1
	b 1b;

	# ']' handling.
3:	
	sub x4, x4, #1
	cbnz x4, 1b;
	ldp x29, x30, [sp], #0x10
	ret

	# '"' handling.
4:
	sub x0, x0, #1
	bl ns_js_fsk_str  
	b 1b;

# Constants.
6:
	# d16, '"'
	.quad 0x2222222222222222
	# d17, '['
	.quad 0x5b5b5b5b5b5b5b5b
	# d18, ']'.
	.quad 0x5d5d5d5d5d5d5d5d
	.size	ns_js_fsk_arr2, .-ns_js_fsk_arr2

All the CMEQs are independent with the others, which allows them to be executed in parallel. The EOR3 depends on all of them and must wait for their completion to execute, but it is still better than our previous one.

Perf agrees :

Average : 34.119.

We’re still pathetic compared to GCC, but we’re progressing.

Variant 3 : screw it let’s just do like GCC.
#

Previous iterations have shown that my SIMD attempts were nice but GCC just lols at them if we compare perf numbers.

There’s something the GCC solution does well : it reads once, and processes every byte, inlining each processing, leaving a lot of room for branch prediction to cause correct speculation.

We’ll start by a simple re-implementation of what GCC does (without the weird loop tail and 16 bytes processing).

The GCC way because apparently we're no match.

# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr
	.type	ns_js_fsk_arr, %function
ns_js_fsk_arr:
	stp x29, x30, [sp, #-0x10]!

	# Init depth counter. 
	mov w4, #1
	ldr x6, =ns_js_fsk_arr_tbl
	add x0, x0, #1
	
	# Main loop.
1:
	
	# Perform reads.
	ldr x2, [x0]

	ubfx x1, x2, #0, #8
	mov x3, x0
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #8, #8
	add x3, x0, #1
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #16, #8
	add x3, x0, #2
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #24, #8
	add x3, x0, #3
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #32, #8
	add x3, x0, #4
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #40, #8
	add x3, x0, #5
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #48, #8
	add x3, x0, #6
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	ubfx x1, x2, #56, #8
	add x3, x0, #7
	ldrsb w12, [x6, w1, sxtw]
	adds w4, w4, w12
	b.eq 3f;
	cmp x1, #0x22
	b.eq 2f; 

	add x0, x0, #8
	b 1b;
	
3:
	add x0, x3, #1
	ldp x29, x30, [sp], #0x10
	ret


	# '"' handling.
2 :
	mov x0, x3
	bl ns_js_fsk_str  
	b 1b;


# Constants.
	.size	ns_js_fsk_arr, .-ns_js_fsk_arr

Perf is nice, as expected :

Average : 29.399

We’re already under what we started with so we’re on a good track here. We’ll just have to try and improve that a bit.

Variant 4 : why speculate when you can just do it.
#

Explaining why the following trick improves performance requires a bit of theory.

Branch prediction is usually accurate but sometimes it’s not.

In this case, the CPU has to flush its pipeline and start re-executing at the correct instruction.

Modern CPUs basically speculate at every branch, since we can’t realistically wait for the result of a comparison (which can take more than 10 cycles) to just fetch the next instruction. That’s the reason why branch prediction exists : located at the fetch stage, it predicts the value of the next PC from the current one.

This also implies that when the actual result of a comparison used in a conditional branch is effectively known, the CPU must retro-check that the branch target inferred by the branch predictor was accurate. If so, it just marks subsequent instructions as non-speculative. If not, it cancels their execution by flushing them out of the pipeline and reverting the CPU’s structures to the state they were before the branch’s effects started.

This flush is a very convoluted and expensive process, which consumes cycles and degrades the CPU’s performance. Consider it a double edged sword : if you predict accurately, your perf goes through the roof. If you don’t, your perf goes through the floor.

Let’s take a look at our next variant.

Moving `UBFX` and `LDRSB` out of the speculation window.


# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr10
	.type	ns_js_fsk_arr10, %function
ns_js_fsk_arr10:
	stp x29, x30, [sp, #-0x10]!

	# Init depth counter. 
	mov w4, #1
	ldr x6, =ns_js_fsk_arr_tbl
	add x0, x0, #1
	
	# Main loop.
1:
	
	# Perform reads.
	ldr x2, [x0]
	mov x3, x0

	ubfx x12, x2, #0, #8
	ubfx x13, x2, #8, #8
	ubfx x14, x2, #16, #8
	ubfx x15, x2, #24, #8

	ldrsb w8, [x6, w12, sxtw]
	ldrsb w9, [x6, w13, sxtw]
	ldrsb w10, [x6, w14, sxtw]
	ldrsb w11, [x6, w15, sxtw]

	adds w4, w4, w8
	b.eq 3f;
	cmp x12, #0x22
	b.eq 2f; 

	add x3, x0, #1
	adds w4, w4, w9
	b.eq 3f;
	cmp x13, #0x22
	b.eq 2f; 

	add x3, x0, #2
	adds w4, w4, w10
	b.eq 3f;
	cmp x14, #0x22
	b.eq 2f; 

	add x3, x0, #3
	adds w4, w4, w11
	b.eq 3f;
	cmp x15, #0x22
	b.eq 2f; 

	ubfx x12, x2, #32, #8
	add x3, x0, #4
	ldrsb w8, [x6, w12, sxtw]
	adds w4, w4, w8
	b.eq 3f;
	cmp x12, #0x22
	b.eq 2f; 

	ubfx x13, x2, #40, #8
	add x3, x0, #5
	ldrsb w9, [x6, w13, sxtw]
	adds w4, w4, w9
	b.eq 3f;
	cmp x13, #0x22
	b.eq 2f; 

	ubfx x14, x2, #48, #8
	add x3, x0, #6
	ldrsb w10, [x6, w14, sxtw]
	adds w4, w4, w10
	b.eq 3f;
	cmp x14, #0x22
	b.eq 2f; 

	ubfx x15, x2, #56, #8
	add x3, x0, #7
	ldrsb w11, [x6, w15, sxtw]
	adds w4, w4, w11
	b.eq 3f;
	cmp x15, #0x22
	b.eq 2f; 

	add x0, x0, #8
	b 1b;
	
3:
	add x0, x3, #1
	ldp x29, x30, [sp], #0x10
	ret

	# '"' handling.
2 :
	mov x0, x3
	bl ns_js_fsk_str  
	b 1b;

# Constants.
	.size	ns_js_fsk_arr10, .-ns_js_fsk_arr10

The code is simple enough : we do exactly like the previous time, but we move the first four UBFX + LDRSB at the start of the loop.

Average : 29.360.

The difference is very small but it is consistent.

What we did is essentially to move those 8 instructions out of many reordering windows, into a code section with no branches.

This set of instructions is composed of two subgroups which only contain independent instructions, and hence, which can be dispatched in parallel, increasing the final throughput.

Here we’re taking advantage of the fact that those instructions will statistically all be executed, by making them effectively executed all the time.

But as the next example shows, there’s not much room for improvement here.

Version 5 : messing up with branch-pred too much.
#

We’ll try to apply the previous trick to the last 4 bytes of our uint64_t too and see what we get.

Making things a bit too non-conditional.

# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr
	.type	ns_js_fsk_arr, %function
ns_js_fsk_arr:
	stp x29, x30, [sp, #-0x10]!

	# Init depth counter. 
	mov w4, #1
	ldr x6, =ns_js_fsk_arr_tbl
	add x0, x0, #1
	
	# Main loop.
	
	# Perform reads.
1:
	ldr x2, [x0]
	mov x3, x0

	ubfx x12, x2, #0, #8
	ubfx x13, x2, #8, #8
	ubfx x14, x2, #16, #8
	ubfx x15, x2, #24, #8

	ldrsb w8, [x6, w12, sxtw]
	ldrsb w9, [x6, w13, sxtw]
	ldrsb w10, [x6, w14, sxtw]
	ldrsb w11, [x6, w15, sxtw]

	adds w4, w4, w8
	b.eq 3f;
	cmp x12, #0x22
	b.eq 2f; 

	add x3, x0, #1
	adds w4, w4, w9
	b.eq 3f;
	cmp x13, #0x22
	b.eq 2f; 

	add x3, x0, #2
	adds w4, w4, w10
	b.eq 3f;
	cmp x14, #0x22
	b.eq 2f; 

	add x3, x0, #3
	adds w4, w4, w11
	b.eq 3f;
	cmp x15, #0x22
	b.eq 2f; 

	ubfx x12, x2, #32, #8
	ubfx x13, x2, #40, #8
	ubfx x14, x2, #48, #8
	ubfx x15, x2, #56, #8

	ldrsb w8, [x6, w12, sxtw]
	ldrsb w9, [x6, w13, sxtw]
	ldrsb w10, [x6, w14, sxtw]
	ldrsb w11, [x6, w15, sxtw]

	add x3, x0, #4
	adds w4, w4, w8
	b.eq 3f;
	cmp x12, #0x22
	b.eq 2f; 

	add x3, x0, #5
	adds w4, w4, w9
	b.eq 3f;
	cmp x13, #0x22
	b.eq 2f; 

	add x3, x0, #6
	adds w4, w4, w10
	b.eq 3f;
	cmp x14, #0x22
	b.eq 2f; 

	add x3, x0, #7
	adds w4, w4, w11
	b.eq 3f;
	cmp x15, #0x22
	b.eq 2f; 

	add x0, x0, #8
	b 1b;
	
3:
	add x0, x3, #1
	ldp x29, x30, [sp], #0x10
	ret


	# '"' handling.
2 :
	mov x0, x3
	bl ns_js_fsk_str  
	b 1b;


# Constants.
	.size	ns_js_fsk_arr, .-ns_js_fsk_arr

Average : 29.810.

Well that’s bad.

What we can conclude here is that the instructions that we made non-conditional were actually correctly identified by the branch prediction-side as not being executed in reality.

By making them non-conditionally executed, we ended up increasing the instruction count of our program, which decreased the performance.

Variant 6 : let’s reduce the branch count and destroy our perf.
#

One could think that since branches cause speculation which takes time to undo if incorrect, performance could be increased by reducing the speculation requirements i.e removing branches in our program.

Here is a variant that reads 4 bytes at a time, then processes them without branches using conditional compares and selects.

What the hell am I doing with my Sunday...


# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr
	.type	ns_js_fsk_arr, %function
ns_js_fsk_arr:
	stp x29, x30, [sp, #-0x10]!

	# Init depth counter. 
	mov w4, #1
	mov w5, #0x22 
	ldr x6, =ns_js_fsk_arr_tbl
	add x0, x0, #1
	cmp x0, #0
	
	# Main loop.
	
1:
	
	# Perform reads.
	ldr x2, [x0]
	mov x1, #0
	mov x3, #1

	# Extract char.
	ubfx x12, x2, #0, #8
	ubfx x13, x2, #8, #8
	ubfx x14, x2, #16, #8
	ubfx x15, x2, #24, #8

	# Read lookup table.
	ldrsb w8, [x6, w12, sxtw]
	ldrsb w9, [x6, w13, sxtw]
	ldrsb w10, [x6, w14, sxtw]
	ldrsb w11, [x6, w15, sxtw]
	

	# Determine if we should get out.
	# If we should not get out, compare if we're in a '"'.
	# If we should now get out, do not move forward.
	add w4, w4, w8
	ccmp w4, #0, #0x4, ne
	ccmp w12, w5, #0x4, ne
	csel x3, xzr, x3, eq
	add x1, x1, x3

	add w4, w4, w9
	ccmp w4, #0, #0x4, ne
	ccmp w13, w5, #0x4, ne
	csel x3, xzr, x3, eq
	add x1, x1, x3

	add w4, w4, w10
	ccmp w4, #0, #0x4, ne
	ccmp w14, w5, #0x4, ne
	csel x3, xzr, x3, eq
	add x1, x1, x3

	add w4, w4, w11
	ccmp w4, #0, #0x4, ne
	ccmp w15, w5, #0x4, ne
	csel x3, xzr, x3, eq
	add x1, x1, x3

	b.eq 2f;

	add x0, x0, #4
	b 1b;

2:	
	ldrb w2, [x0, x1]
	add x0, x0, x1
	add x0, x0, #1
	cmp w2, w5 
	b.eq 3f; 

	ldp x29, x30, [sp], #0x10
	ret


	# '"' handling.
3 :
	bl ns_js_fsk_str  
	b 1b;


# Constants.
	.size	ns_js_fsk_arr, .-ns_js_fsk_arr

Two comments :

it is not working. I tried re-running it when writing this article and it did not pass my testbench successfully. There’s something wrong with it but frankly I don’t care, because
it is dramatic for perf. From my notes at the time where it was working it literally multiplied the execution time by more than 5.

So let’s discontinue this approach in the first trash can that we find and move along.

Apparte : macroifying our byte-processing logic.
#

Most readers that are still here (Kudos) will probably be tired of seeing the same compulsive ubfx+ldrsb+adds+branch sequence.

So at this point in my development I grew a brain and made those macros to simplify the reading.

Here is the sequence for the record.

.macro xtr_cmp1 xrg_src, xrg_dst, xrg_dat, xrg_tbl, wrg_ctr, xrg_tm0, wrg_tm0, wrg_tm1, bit_stt, val_qot, val_add, lbl_out, lbl_qot
        ubfx \xrg_tm0, \xrg_dat, #0 + \bit_stt, #8
        add \xrg_dst, \xrg_src, #\val_add
        ldrsb \wrg_tm1, [\xrg_tbl, \wrg_tm0, sxtw]
        adds \wrg_ctr, \wrg_ctr, \wrg_tm1
        b.eq \lbl_out;
        cmp \xrg_tm0, #\val_qot
        b.eq \lbl_qot;
.endm

.macro xtr_cmp2 xrg_src, xrg_dst, xrg_dat, xrg_tbl, wrg_ctr, xrg_tm0, wrg_tm0, wrg_tm1, val_qot, val_bas, lbl_out, lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 0, \val_qot, \val_bas + 0, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 8, \val_qot, \val_bas + 1, \lbl_out, \lbl_qot
.endm

.macro xtr_cmp4 xrg_src, xrg_dst, xrg_dat, xrg_tbl, wrg_ctr, xrg_tm0, wrg_tm0, wrg_tm1, val_qot, val_bas, lbl_out, lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 0, \val_qot, \val_bas + 0, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 8, \val_qot, \val_bas + 1, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 16, \val_qot, \val_bas + 2, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 24, \val_qot, \val_bas + 3, \lbl_out, \lbl_qot
.endm

.macro xtr_cmp8 xrg_src, xrg_dst, xrg_dat, xrg_tbl, wrg_ctr, xrg_tm0, wrg_tm0, wrg_tm1, val_qot, val_bas, lbl_out, lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 0, \val_qot, \val_bas + 0, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 8, \val_qot, \val_bas + 1, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 16, \val_qot, \val_bas + 2, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 24, \val_qot, \val_bas + 3, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 32, \val_qot, \val_bas + 4, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 40, \val_qot, \val_bas + 5, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 48, \val_qot, \val_bas + 6, \lbl_out, \lbl_qot
        xtr_cmp1 \xrg_src, \xrg_dst, \xrg_dat, \xrg_tbl, \wrg_ctr, \xrg_tm0, \wrg_tm0, \wrg_tm1, 56, \val_qot, \val_bas + 7, \lbl_out, \lbl_qot
.endm

Now that we’re done covering that complete mess, and that I gave you even less reasons for continuing your reading, let’s proceed to a variant that will improve the perf for once.

Variant 7 : skipping spaces.
#

A simple look at the non-simple JSON from ARM (yeah you might have forgotten but all of this is just because I wanted to parse this quickly…) shows that it’s just full of spaces.

One could say “Hey man why don’t you just re-process your json” but WHO DOES THAT ? The rules are the rules and we’re not gonna change them just because it would make sense, right ? If so, what’s next ? Converting our json DB to a binary file and processing it entirely in what, 0.5ms, no fkng way, I’ll waste as many watts as I can and stick with my stupid rules.

More seriously, there is a bit of obsession going on right here, I won’t deny it, but I’m using this challenge as an exercise to improve perf with a stable input, as sometimes, you don’t choose the data that your program has to ingest and finding optimization schemes basing on that data is a useful skill.

Our next step will hence be to read 16 bytes at a time, compare the first 8 with spaces (with a simple cmp) and if they are all spaces, move to the next set of 8 (with only one 8 bytes read required).

Fast forward if 8 chars are spaces.

# X0 : data pointer.
# X1 : next data pointer.
# X2 : data0.
# X3 : data1.
# X4 : LUT.
# X5 : depth counter.
# X6 : cst0.
# X7 : scratch.
# X8 : TMP0.
# X9 : TMP1
# 
# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr
	.type	ns_js_fsk_arr, %function
ns_js_fsk_arr:
	stp x29, x30, [sp, #-0x10]!

	# Init loop variables.
	add x0, x0, #1
	ldr x4, =ns_js_fsk_arr_tbl
	mov x5, #1
	ldr x6, 6f;
	
	# Main loop.
0:

	# Perform reads.
	ldr x2, [x0]
15:
	ldr x3, [x0, #8]
	cmp x2, x6
	b.ne 1f
	add x0, x0, #8
	mov x2, x3
	b 15b
	
1:

	xtr_cmp8 x0, x1, x2, x4, w5, x8, w8, w9, 0x22, 0, 3f, 2f 

	xtr_cmp8 x0, x1, x3, x4, w5, x8, w8, w9, 0x22, 8, 3f, 2f 

	add x0, x0, #16
	b 0b;
	
3:
	add x0, x1, #1
	ldp x29, x30, [sp], #0x10
	ret

	# '"' handling.
2 :
	mov x0, x1
	bl ns_js_fsk_str  
	b 0b;


# Constants.
6:
	.quad 0x2020202020202020
	.size	ns_js_fsk_arr, .-ns_js_fsk_arr

Average : 25.906.

We actually did improve the performance with that one.

Let’s do one last improvement so that I can finally move on with my life.

Variant 8 : Bypass 16 bytes, use 4 bytes.
#

Our final variant will process chars in a combined way. First it will read 8 bytes at a time (16 initially to sorta-pre-fetch), and will skip spaces 8 bytes at a time. Then once it finds the first 8 byte word that contains no spaces, it will select the first half that contains a non-space character, and then process those resulting 4 bytes.

Find the first non-space 4 bytes and process them. Then redo it.

# X0 : data pointer.
# X1 : next data pointer.
# X2 : data0.
# X3 : data1.
# X4 : LUT.
# X5 : depth counter.
# X6 : cst0.
# X7 : scratch.
# X8 : TMP0.
# X9 : TMP1
# JS Array skip.
	.align 4
	.global	ns_js_fsk_arr
	.type	ns_js_fsk_arr, %function
ns_js_fsk_arr:
	stp x29, x30, [sp, #-0x10]!

	# Init depth counter. 
	mov w5, #1
	ldr x4, =ns_js_fsk_arr_tbl
	add x0, x0, #1
	ldr x6, 12f;
	ldr x13, 12f + 24
	
	# Main loop.
0:

	# Perform reads.
	ldp x2, x3, [x0]
	
1:
	cmp x2, x6
	b.ne 2f
	mov x2, x3
	ldr x3, [x0, #16]
	add x0, x0, #8
	b 1b
2:	
	cmp w2, w6
	b.ne 3f
	lsr x2, x2, #32
	add x0, x0, #4
3:	

	# 4 bytes extract compare.
	xtr_cmp4 x0, x1, x2, x4, w5, x8, w8, w9, 0x22, 0, 11f, 10f 
	add x0, x0, #4
	b 0b

	# '"' handling.
10 :
	mov x0, x1
	bl ns_js_fsk_str  
	ldr w2, [x0]
	b 3b;

	# final ']' handling.
11:
	add x0, x1, #1
	ldp x29, x30, [sp], #0x10
	ret


# Constants.
12:
	.quad 0x2020202020202020
	.size	ns_js_fsk_arr, .-ns_js_fsk_arr

Average : 22.176.

Yay !

Conclusion.
#

I must say that I’m proud of that result. It took a lot of sanity and time but so far that’s the best time I’ve scored.

One possible improvement would be to implement the same compound-byte-processing trickery that we used in the string skip processing logic, but I need to move on to other projects and am happy with my current result.

Json_optimization - This article is part of a series.

Part 0: TurboJSON : JSON basics.

Part 1: This Article

Part 1: TurboJSON : C-level optimization.

Introduction.#

Calling assembly from C.#

String skip.#

Compiler version.#

Variant 0 : using SIMD#

Array skip.#

C code#

Compiler version.#

Variant 0 : set membership test.#

Variant 1 : SIMD-based char membership test.#

Variant 2 : CMEQ + EOR3.#

Variant 3 : screw it let’s just do like GCC.#

Variant 4 : why speculate when you can just do it.#

Version 5 : messing up with branch-pred too much.#

Variant 6 : let’s reduce the branch count and destroy our perf.#

Apparte : macroifying our byte-processing logic.#

Variant 7 : skipping spaces.#

Variant 8 : Bypass 16 bytes, use 4 bytes.#

Conclusion.#

Introduction.
#

Calling assembly from C.
#

String skip.
#

Compiler version.
#

Variant 0 : using SIMD
#

Array skip.
#

C code
#

Compiler version.
#

Variant 0 : set membership test.
#

Variant 1 : SIMD-based char membership test.
#

Variant 2 : CMEQ + EOR3.
#

Variant 3 : screw it let’s just do like GCC.
#

Variant 4 : why speculate when you can just do it.
#

Version 5 : messing up with branch-pred too much.
#

Variant 6 : let’s reduce the branch count and destroy our perf.
#

Apparte : macroifying our byte-processing logic.
#

Variant 7 : skipping spaces.
#

Variant 8 : Bypass 16 bytes, use 4 bytes.
#

Conclusion.
#