Skip to content

Instantly share code, notes, and snippets.

@vdt
Forked from jwbee/jq.md
Created March 19, 2025 07:15
Show Gist options
  • Save vdt/5178eb901041ab7bb03fd11cde1ba49a to your computer and use it in GitHub Desktop.
Save vdt/5178eb901041ab7bb03fd11cde1ba49a to your computer and use it in GitHub Desktop.

Revisions

  1. @jwbee jwbee revised this gist Mar 18, 2025. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion jq.md
    Original file line number Diff line number Diff line change
    @@ -133,7 +133,7 @@ Summary
    1.44 ± 0.03 times faster than LD_PRELOAD= taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    Just preloading mimalloc gives a spectracular speedup of 44%! Wow!
    Just preloading mimalloc gives a spectacular speedup of 44%! Wow!

    ## Step 6: Rebuild with mimalloc

  2. @jwbee jwbee created this gist Mar 18, 2025.
    156 changes: 156 additions & 0 deletions jq.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,156 @@
    # Make Ubuntu packages 90% faster by rebuilding them

    ## TL;DR

    You can take the same source code package that Ubuntu uses to build [jq](https://github.com/jqlang/jq), compile it again, and realize 90% better performance.

    ## Setting

    I use `jq` for processing GeoJSON files and other open data offered in JSON format. Today I am working with a 500MB GeoJSON file that contains the Alameda County Assessor's parcel map. I want to run a query that prints the city for every parcel worth more than a threshold amount. The program is

    ```.features[] | select(.properties.TotalNetValue < 193000) | .properties.SitusCity```

    This takes about 5 seconds with the file cached, on a Ryzen 9 9950X system. That seems a bit shabby and I am sure we can do better.

    ## Step 1: Just rebuild the package

    What happens if you grab the [jq source code from Launchpad](https://launchpad.net/ubuntu/+archive/primary/+sourcefiles/jq/1.7.1-3build1/jq_1.7.1.orig.tar.gz), then configure and rebuild it with no flags at all? Even that is about 2-4% faster than the Ubuntu binary package.

    We are using [hyperfine](https://github.com/sharkdp/hyperfine) to get repeatable results. The `jq` program is being constrained on logical CPU 2, to keep it away from system interrupts that run on CPU 0 and to ensure no CPU migrations.

    ```
    % hyperfine --warmup 1 --runs 3 -L binary ~/jq-jq-1.7.1/jq,/usr/bin/jq "taskset -c 2 {binary} -rf /tmp/select.jq /tmp/parcels.geojson"
    Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.517 s ± 0.017 s [User: 3.907 s, System: 0.610 s]
    Range (min … max): 4.497 s … 4.531 s 3 runs
    Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.641 s ± 0.038 s [User: 4.013 s, System: 0.628 s]
    Range (min … max): 4.601 s … 4.675 s 3 runs
    Summary
    taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.03 ± 0.01 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    ### Step 2: Rebuild with clang and better flags

    Next, let's rebuild the program with my favorite compiler, a higher optimization level, LTO, and some flags that I typically want to help with debugging and profiling. Some of them are irrelevant to this case, but I use the same flags for most builds. The flags that seem to make a performance difference are:

    - -O3 vs -O2
    - -flto
    - -DNDEBUG

    The last of those saves a lot of cost in assertions that showed up strongly in the profiles.

    `% CC=clang-18 LDFLAGS="-flto -g -Wl,--emit-relocs -Wl,-z,now -Wl,--gc-sections -fuse-ld=lld" CFLAGS="-flto -DNDEBUG -fno-omit-frame-pointer -gmlt -march=native -O3 -mno-omit-leaf-frame-pointer -ffunction-sections -fdata-sections" ./configure`

    ```
    Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 3.853 s ± 0.033 s [User: 3.245 s, System: 0.608 s]
    Range (min … max): 3.822 s … 3.887 s 3 runs
    Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.631 s ± 0.047 s [User: 4.012 s, System: 0.619 s]
    Range (min … max): 4.602 s … 4.686 s 3 runs
    Summary
    taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.20 ± 0.02 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    Now we are 20% faster than upstream with almost no effort.

    ## Step 3: Add TCMalloc

    Jq is a complex C program, and C programs of any complexity tend to rely on malloc and free, because the language offers no other cognizable way to deal with memory. Allocation is the top line in the profile by far. What if we use a better allocator, instead of the one that comes in GNU libc? Ubuntu offers a package of TCMalloc, which is actually rather obsolete and not the current TCMalloc effort, but it's an allocator package in their repo, so let's give it a whirl.

    Having added `-L/usr/lib/x86_64-linux-gnu -ltcmalloc_minimal` to the LDFLAGS and rebuilt ...

    ```
    Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 3.253 s ± 0.009 s [User: 2.625 s, System: 0.628 s]
    Range (min … max): 3.245 s … 3.262 s 3 runs
    Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.611 s ± 0.026 s [User: 4.015 s, System: 0.596 s]
    Range (min … max): 4.591 s … 4.640 s 3 runs
    Summary
    taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.42 ± 0.01 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    This is not bad. We are now > 40% faster than the package upstream tried to foist on us.

    ## Step 4: What about just preloading TCMalloc dynamically?

    If the allocator is the issue, it stands to reason that we can get some of that benefit just by hiding the libc allocator using a dynamic preload with the stock Ubuntu binary.

    ```
    Benchmark 1: LD_PRELOAD= taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.601 s ± 0.027 s [User: 3.966 s, System: 0.634 s]
    Range (min … max): 4.577 s … 4.630 s 3 runs
    Benchmark 2: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.082 s ± 0.010 s [User: 3.476 s, System: 0.606 s]
    Range (min … max): 4.071 s … 4.091 s 3 runs
    Summary
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.13 ± 0.01 times faster than LD_PRELOAD= taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    This by itself is good for 13%. Not bad.

    ## Step 5: Dynamically loading other allocators

    Ubuntu also ships packages of [jemalloc](https://github.com/jemalloc/jemalloc) and [mimalloc](https://github.com/microsoft/mimalloc). We can try them all. It turns out that mimalloc beats all others.

    Note: mimalloc result obtained after setting `MIMALLOC_LARGE_OS_PAGES=1` in the environment.

    ```
    Benchmark 1: LD_PRELOAD= taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.636 s ± 0.055 s [User: 4.015 s, System: 0.621 s]
    Range (min … max): 4.579 s … 4.767 s 10 runs
    Benchmark 2: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.138 s ± 0.055 s [User: 3.511 s, System: 0.627 s]
    Range (min … max): 4.080 s … 4.255 s 10 runs
    Benchmark 3: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.067 s ± 0.030 s [User: 3.345 s, System: 0.721 s]
    Range (min … max): 4.031 s … 4.123 s 10 runs
    Benchmark 4: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 3.209 s ± 0.041 s [User: 2.934 s, System: 0.274 s]
    Range (min … max): 3.160 s … 3.274 s 10 runs
    Summary
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.27 ± 0.02 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    1.29 ± 0.02 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    1.44 ± 0.03 times faster than LD_PRELOAD= taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    Just preloading mimalloc gives a spectracular speedup of 44%! Wow!

    ## Step 6: Rebuild with mimalloc

    Its cool that mimalloc is fast in this case, but dynamic preloads aren't amazing for performance. Let's rebuild the program with mimalloc.

    ```
    Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 2.428 s ± 0.019 s [User: 2.161 s, System: 0.267 s]
    Range (min … max): 2.404 s … 2.464 s 10 runs
    Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    Time (mean ± σ): 4.606 s ± 0.039 s [User: 3.979 s, System: 0.627 s]
    Range (min … max): 4.522 s … 4.640 s 10 runs
    Summary
    taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.90 ± 0.02 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
    ```

    Jq rebuilt from source with a a better allocator is 1.9x, nearly twice as fast as the Ubuntu binary package for this workload. In another application, processing 2.2GB of JSON in 13000 files (using [rush](https://github.com/shenwei356/rush) to parallelize) this build of jq does the job in 0.755s vs 1.424s for the Ubuntu package. That is a speedup of nearly 2x again. These are very satisfactory results.