Skip to content

Instantly share code, notes, and snippets.

@shalaby
Forked from little-dude/Rust Optimization.md
Created March 4, 2021 09:23
Show Gist options
  • Select an option

  • Save shalaby/aa468a761ca7a38b39f1a65d8f941bce to your computer and use it in GitHub Desktop.

Select an option

Save shalaby/aa468a761ca7a38b39f1a65d8f941bce to your computer and use it in GitHub Desktop.

Revisions

  1. @jFransham jFransham revised this gist May 17, 2017. 1 changed file with 9 additions and 10 deletions.
    19 changes: 9 additions & 10 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -182,13 +182,12 @@ a pointer for every element, instead of a single integer for the entire array.

    Not only that, but without some truly crazy compiler optimizations this means
    that each element may not be directly after the element before it in memory.
    By our cache locality calculations, that means you have to calculate cache
    misses for each element seperately, and because of the space overhead of storing
    a pointer per element you get a _minimum_ of one cache miss for every 4 elements
    of Haskell's `String` type (assuming 64-bit pointers, the size of `Char` doesn't
    matter as long as it's less than 64 bits), and a ludicrous maximum of _two cache
    misses for every element_. This is why the performance-savvy in languages
    such as Haskell and Lisp know to use `Vec`-style constructs when possible.
    We'll get to how to calculate this in a moment, but essentially that means that
    Haskell's `String` type can cause a cache miss up to twice _per element_,
    whereas if you had a vector of `Char`s (assuming 32-bit chars) it could only
    cause a maximum of 1 cache miss for every 16 elements. This is why the
    performance-savvy in languages such as Haskell and Lisp know to use vector-like
    constructs when possible.

    Back in the world of Rust, this means that you should avoid indirection-heavy
    representations `Vec<Vec<_>>` to represent a matrix, since this means that each
    @@ -770,9 +769,9 @@ squeeze out every last drop of performance.

    Also, the compiler does really get it wrong sometimes, and can miss out on
    inlining opportunities that would improve code speed. However, only add
    `#[inline(always)]` pragma if you can prove with benchmarks that it improves the
    speed, and adding these pragmas is a bit of a dark art. You effort is probably
    better spent elsewhere.
    `#[inline(always)]` annotation if you can prove with benchmarks that it improves
    the speed, and adding these annotations is a bit of a dark art. You effort is
    probably better spent elsewhere.

    If you want to reduce the size of your code, you can try using
    `panic = "abort"`. This removes the "landing pads" that allow Rust to show a
  2. @jFransham jFransham revised this gist May 17, 2017. 1 changed file with 7 additions and 5 deletions.
    12 changes: 7 additions & 5 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -182,11 +182,13 @@ a pointer for every element, instead of a single integer for the entire array.

    Not only that, but without some truly crazy compiler optimizations this means
    that each element may not be directly after the element before it in memory.
    Worse, the elements may even be out of order. For a contrived example, you might
    need to load cache line A to read element 1, cache line B to read element 2, and
    then reload cache line A to read element 3. This is why the performance-savvy in
    languages such as Haskell and Lisp know to use `Vec`-style constructs when
    possible.
    By our cache locality calculations, that means you have to calculate cache
    misses for each element seperately, and because of the space overhead of storing
    a pointer per element you get a _minimum_ of one cache miss for every 4 elements
    of Haskell's `String` type (assuming 64-bit pointers, the size of `Char` doesn't
    matter as long as it's less than 64 bits), and a ludicrous maximum of _two cache
    misses for every element_. This is why the performance-savvy in languages
    such as Haskell and Lisp know to use `Vec`-style constructs when possible.

    Back in the world of Rust, this means that you should avoid indirection-heavy
    representations `Vec<Vec<_>>` to represent a matrix, since this means that each
  3. @jFransham jFransham revised this gist May 17, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -548,7 +548,7 @@ fn bench_log_base_unrolled(b: &mut Bencher) {
    }
    ```

    `test::black_box` a magic function that prevents rustc and LLVM calculating
    `test::black_box` is a magic function that prevents rustc and LLVM calculating
    those function calls at compile-time and converting them into a constant, which
    usually they would (actually, it's not magic, it's just some inline assembly
    that doesn't do anything, since neither rustc nor LLVM will try to optimize
  4. @jFransham jFransham revised this gist May 17, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -469,7 +469,7 @@ allocate are worse for use-cases that require maximum speed.
    [Duff's device][duff] is fun, but array-length-generic unrolled loops are
    unlikely to be faster than the equivalent optimized naïve code nowadays, since
    any optimizing compiler worth its bits will do this kind of optimization without
    having to mangle your code and pissing of future-you.
    having to mangle your code and ruining future-you's day.

    Having said that, if you know that an array is likely to be a multiple of N
    size, try making it a `&[[T; N]]` and operating on a `[T; N]` in each iteration.
  5. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -170,8 +170,8 @@ you already have an instance of the allocating type). `Box<T>` where `T` has a
    statically-known size also allocates, so be careful of recursive enums. If
    you're creating a tree structure that then becomes immutable, like with an AST,
    you might want to consider using a `TypedArena` to get tighter control over
    memory use. `TypedArena` is still unstable though, so consider whether the gains
    would be worth it.
    memory use. `TypedArena` is still unstable though, and it increases complexity,
    so it's not suitable for all use-cases.

    This is why you may have heard some complaints about Haskell's use of a linked
    list of characters to represent a string. I'm not going to beat [gankro's
  6. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 6 additions and 2 deletions.
    8 changes: 6 additions & 2 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -167,8 +167,11 @@ This means that `String`, `Vec`, `HashMap` and `Box<Trait>`/`Box<[T]>` all
    allocate, but any user-defined struct does not (it may contain something that
    _does_ allocate, but it doesn't require any extra allocation to construct if
    you already have an instance of the allocating type). `Box<T>` where `T` has a
    statically-known size also allocates, but boxing statically-sized types is very
    rare. The only use-case I can think of is sending huge objects between threads.
    statically-known size also allocates, so be careful of recursive enums. If
    you're creating a tree structure that then becomes immutable, like with an AST,
    you might want to consider using a `TypedArena` to get tighter control over
    memory use. `TypedArena` is still unstable though, so consider whether the gains
    would be worth it.

    This is why you may have heard some complaints about Haskell's use of a linked
    list of characters to represent a string. I'm not going to beat [gankro's
    @@ -256,6 +259,7 @@ up 64 bytes, meaning that it would only take 2 cache misses to load in the worst
    case.

    [linked lists]: http://cglab.ca/~abeinges/blah/too-many-lists/book/#an-obligatory-public-service-announcement
    [rust-forest]: https://github.com/SimonSapin/rust-forest

    ## Keep as much as possible in registers

  7. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 0 additions and 2 deletions.
    2 changes: 0 additions & 2 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,5 @@
    # Achieving warp speed with Rust

    ---

    ### Contents:
    - [Number one optimization tip: don't](#number-one-optimization-tip-dont)
    - [Never optimize blindly](#never-optimize-blindly)
  8. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 23 additions and 22 deletions.
    45 changes: 23 additions & 22 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -742,32 +742,33 @@ to speed it up.

    So now I've scared you off inlining, let's talk about when you should explicitly
    add inlining annotations. Small functions that are called often are a good
    target for inlining. `Iterator::next`, for example, or `Deref::deref`. These are
    likely to be automatically inlined when called internally, but marking these as
    `#[inline]` will allow users of your library to inline them too, even if they
    don't use LTO. Only functions marked `#[inline]` will be considered for
    cross-crate inlining, but that means the definition has to be stored in the
    compiled library, causing bloat and increasing compile times.
    `#[inline(always)]` is even more niche, but it's sometimes nice to ensure that a
    tiny function will be inlined, or as a kind of documentation that the function
    call is free for if someone comes along and tries to manually inline it to
    improve performance. It really is very rare that you would want to do this,
    though, and it's best to just trust the compiler.

    The other class of functions that are good targets for forced inlining are ones
    that you know to often be called with constant parameters. We go into this
    target for inlining. `Iterator::next`, for example, or `Deref::deref`. The
    overhead from calling these functions may be larger than the time it takes to
    run the function itself. These are likely to be automatically inlined when
    called internally, but marking these as `#[inline]` will allow users of your
    library to inline them too, even if they don't use LTO. Only functions marked
    `#[inline]` will be considered for cross-crate inlining, but that means the
    definition has to be stored in the compiled library, causing bloat and
    increasing compile times. `#[inline(always)]` is even more niche, but it's
    sometimes nice to ensure that a tiny function will be inlined, or as a kind of
    documentation that the function call is free for if someone comes along and
    tries to manually inline it to improve performance. It really is very rare that
    you would want to do this, though, and it's best to just trust the compiler.

    The other class of functions that are good targets for annotating inlining are
    ones that you know to often be called with constant parameters. We go into this
    later on, but `{integer}::from_str_radix` is an exellent example of this. Most
    uses of this function will have a constant as the second parameter, and so by
    judicious use of `#[inline]` we can prevent branching in the code. I tried
    benchmarking the difference between `#[inline(never)]` and `#[inline(always)]`
    on my reimplementation of `from_str_radix` (spoilers) but they're close to 3ns
    and inconsistent, which isn't within the error bars but is extremely minor.
    judicious use of `#[inline]` we can prevent branching and expensive operations
    like division for the consumers of our library. It's not worth losing sleep
    over though, since they could just use link-time optimization if they need to
    squeeze out every last drop of performance.

    Also, the compiler does really get it wrong sometimes, and can miss out on
    inlining opportunities that would improve code speed. However, only add the
    `#[inline]` pragma if you can prove with benchmarks that it improves the speed,
    and adding these pragmas is a bit of a dark art. You effort is probably better
    spent elsewhere.
    inlining opportunities that would improve code speed. However, only add
    `#[inline(always)]` pragma if you can prove with benchmarks that it improves the
    speed, and adding these pragmas is a bit of a dark art. You effort is probably
    better spent elsewhere.

    If you want to reduce the size of your code, you can try using
    `panic = "abort"`. This removes the "landing pads" that allow Rust to show a
  9. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,7 @@
    - [Loop unrolling is still cool](#loop-unrolling-is-still-cool)
    - [`assert!` conditions beforehand](#assert-conditions-beforehand)
    - [Use link-time optimization](#use-link-time-optimization)
    - [Don't use `#[inline]`](#dont-use-inlinealways)
    - [Don't use `#[inline(always)]`](#dont-use-inlinealways)
    - [Parallelize, but not how you think](#parallelize-but-not-how-you-think)
    - [A case study](#a-case-study)
    - [Wrapping up](#wrapping-up)
  10. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,7 @@
    - [Loop unrolling is still cool](#loop-unrolling-is-still-cool)
    - [`assert!` conditions beforehand](#assert-conditions-beforehand)
    - [Use link-time optimization](#use-link-time-optimization)
    - [Don't use `#[inline]`](#dont-use-inline)
    - [Don't use `#[inline]`](#dont-use-inlinealways)
    - [Parallelize, but not how you think](#parallelize-but-not-how-you-think)
    - [A case study](#a-case-study)
    - [Wrapping up](#wrapping-up)
  11. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 24 additions and 17 deletions.
    41 changes: 24 additions & 17 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -719,16 +719,16 @@ penalty. I am of the opinion that compile times only matter for debug builds, so
    that's a tradeoff I'm willing to make. As with everything else here, profile and
    check that the tradeoff is worthwhile.

    ## Don't use `#[inline]`
    ## Don't use `#[inline(always)]`

    `#[inline]` feels good as a performance hint, but the truth is that optimizing
    compilers are really good at working out when a function would benefit from
    being inlined, and Rust isn't constrained to the slower standardized C calling
    convention and can use `fastcc`, making function calls extremely cheap. You're
    more likely to cause the size of your executable to bloat. This takes up more
    space on your hard drive, of course, but that's not too much of a problem. If
    you have even a single bundled asset like images or audio they will likely dwarf
    the size of your executable.
    `#[inline(always)]` feels good as a performance hint, but the truth is that
    optimizing compilers are really good at working out when a function would
    benefit from being inlined, and Rust isn't constrained to the slower
    standardized C calling convention and can use `fastcc`, making function calls
    extremely cheap. You're more likely to cause the size of your executable to
    bloat. This takes up more space on your hard drive, of course, but that's not
    too much of a problem. If you have even a single bundled asset like images or
    audio they will likely dwarf the size of your executable.

    The real issue here is that it can make your program no longer fit in the CPU's
    instruction cache. The CPU will only have to go to RAM for its instructions when
    @@ -740,21 +740,28 @@ by manually marking it as inline, and you have benchmarks to back that up,
    it's just as likely that you'll slow a program down with careless inlining than
    to speed it up.

    So now I've scared you off inlining, let's talk about when inlining actually
    helps. Small functions that are called often are a good target for inlining.
    `Iterator::next`, for example, or `Deref::deref`. These are likely to be
    automatically inlined when called internally, but marking these as `#[inline]`
    will ensure this behaviour, as well as allow users of your library to inline
    them too, if they don't use LTO.
    So now I've scared you off inlining, let's talk about when you should explicitly
    add inlining annotations. Small functions that are called often are a good
    target for inlining. `Iterator::next`, for example, or `Deref::deref`. These are
    likely to be automatically inlined when called internally, but marking these as
    `#[inline]` will allow users of your library to inline them too, even if they
    don't use LTO. Only functions marked `#[inline]` will be considered for
    cross-crate inlining, but that means the definition has to be stored in the
    compiled library, causing bloat and increasing compile times.
    `#[inline(always)]` is even more niche, but it's sometimes nice to ensure that a
    tiny function will be inlined, or as a kind of documentation that the function
    call is free for if someone comes along and tries to manually inline it to
    improve performance. It really is very rare that you would want to do this,
    though, and it's best to just trust the compiler.

    The other class of functions that are good targets for forced inlining are ones
    that you know to often be called with constant parameters. We go into this
    later on, but `{integer}::from_str_radix` is an exellent example of this. Most
    uses of this function will have a constant as the second parameter, and so by
    judicious use of `#[inline]` we can prevent branching in the code. I tried
    benchmarking the difference between `#[inline(never)]` and `#[inline(always)]`
    on my reimplementation of `from_str_radix` (spoilers) but they're close to 3ns,
    which isn't within the error bars but is extremely minor.
    on my reimplementation of `from_str_radix` (spoilers) but they're close to 3ns
    and inconsistent, which isn't within the error bars but is extremely minor.

    Also, the compiler does really get it wrong sometimes, and can miss out on
    inlining opportunities that would improve code speed. However, only add the
  12. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 30 additions and 0 deletions.
    30 changes: 30 additions & 0 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,27 @@
    # Achieving warp speed with Rust

    ---

    ### Contents:
    - [Number one optimization tip: don't](#number-one-optimization-tip-dont)
    - [Never optimize blindly](#never-optimize-blindly)
    - [Don't bother optimizing one-time costs](#dont-bother-optimizing-one-time-costs)
    - [Improve your algorithms](#improve-your-algorithms)
    - [CPU architecture primer](#cpu-architecture-primer)
    - [Keep as much as possible in cache](#keep-as-much-as-possible-in-cache)
    - [Keep as much as possible in registers](#keep-as-much-as-possible-in-registers)
    - [Avoid `Box<Trait>`](#avoid-boxtrait)
    - [Use stack-based variable-length datatypes](#use-stack-based-variable-length-datatypes)
    - [Loop unrolling is still cool](#loop-unrolling-is-still-cool)
    - [`assert!` conditions beforehand](#assert-conditions-beforehand)
    - [Use link-time optimization](#use-link-time-optimization)
    - [Don't use `#[inline]`](#dont-use-inline)
    - [Parallelize, but not how you think](#parallelize-but-not-how-you-think)
    - [A case study](#a-case-study)
    - [Wrapping up](#wrapping-up)

    ---

    If you're looking to write fast code in Rust, good news! Rust makes it really
    easy to write really fast code. The focus on zero-cost abstractions, the
    lack of implicit boxing and the static memory management means that even naïve
    @@ -1012,6 +1034,14 @@ prediction. Ideally you'd just not do micro-benchmarks, but some functions do
    legitimately call for it. Don't trust a benchmark, especially a microbenchmark,
    until you've rerun it multiple times.

    When I reran this particular benchmark (at least 10 times in total, not
    including the benchmarks I ran while editing the code) to ensure that the
    numbers were stable, and although the averages are extremely stable (the native
    one sometimes was slightly slower, the 36ns value above is what I see most of
    the time), the variances are mostly 0-3ns with spikes of 13-26ns. I don't have a
    good explanation for this, expect a follow-up post with tips on writing better
    benchmarks.

    This is a perfect example of why low-level optimization is important, since this
    is exactly the kind of function that could be used hundreds of thousands of
    times in parsers of textual data and a 10ns speedup here could lead to
  13. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 0 additions and 8 deletions.
    8 changes: 0 additions & 8 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1012,14 +1012,6 @@ prediction. Ideally you'd just not do micro-benchmarks, but some functions do
    legitimately call for it. Don't trust a benchmark, especially a microbenchmark,
    until you've rerun it multiple times.

    When I reran this particular benchmark (at least 10 times in total, not
    including the benchmarks I ran while editing the code) to ensure that the
    numbers were stable, and although the averages are extremely stable (the native
    one sometimes was slightly slower, the 36ns value above is what I see most of
    the time), the variances are mostly 0-3ns with spikes of 13-26ns. I don't have a
    good explanation for this, expect a follow-up post with tips on writing better
    benchmarks.

    This is a perfect example of why low-level optimization is important, since this
    is exactly the kind of function that could be used hundreds of thousands of
    times in parsers of textual data and a 10ns speedup here could lead to
  14. @jFransham jFransham revised this gist May 16, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -890,7 +890,7 @@ correct behaviour. You'll notice some optimizations here:
    * We rely on overflow to test `A < x < B` comparisons. This is only useful here
    because we already need the `x - A` value, so we're saving an extra comparison
    and a bitwise and. In most code `A < x && x < B` is as cheap or cheaper than
    `x - A < B` with overflow. This is where most of our gains come from.
    `x - A < B` with overflow.
    * We use `| 32` to unify the codepaths for upper- and lowercase letters,
    reducing the number of comparisons we need to do.
    * We don't do `output + digit * mul` when `output == 0` and `mul == 1`. This
  15. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -158,10 +158,10 @@ element is small and the number of elements is large, because it needs to store
    a pointer for every element, instead of a single integer for the entire array.

    Not only that, but without some truly crazy compiler optimizations this means
    that each element may not be in the same cache line as the one before it and
    worse, may even be out of order (so you need to load cache line A to read
    element 1, cache line B to read element 2, and then reload cache line A to read
    element 3, to give a contrived example). This is why the performance-savvy in
    that each element may not be directly after the element before it in memory.
    Worse, the elements may even be out of order. For a contrived example, you might
    need to load cache line A to read element 1, cache line B to read element 2, and
    then reload cache line A to read element 3. This is why the performance-savvy in
    languages such as Haskell and Lisp know to use `Vec`-style constructs when
    possible.

  16. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 5 additions and 5 deletions.
    10 changes: 5 additions & 5 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -767,12 +767,12 @@ can safely be run simultaneously, so the CPU does so. This is essentially free
    parallelism without the need for locks, work queues or anything that affects
    your architecture at all, so you would be crazy not to take advantage of it.

    Just like loop unrolling, parallelizable computation also lends itself well to
    autovectorization, which is the process where the compiler realizes that you're
    doing the same thing to multiple different values and converts it to a special
    instruction that, well, does the same thing to multiple different values.
    Parallelizable computation also lends itself well to autovectorization, which is
    the process where the compiler realizes that you're doing the same thing to
    multiple different values and converts it to a special instruction that, well,
    does the same thing to multiple different values.

    You can translate the following numerical code:
    For example, the compiler could translate the following numerical code:

    ```rust
    (a1 + a2) + (b1 + b2) + (c1 + c2) + (d1 + d2)
  17. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 28 additions and 46 deletions.
    74 changes: 28 additions & 46 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -2,21 +2,20 @@

    If you're looking to write fast code in Rust, good news! Rust makes it really
    easy to write really fast code. The focus on zero-cost abstractions, the
    lack of implicit boxing and the lifetime system that means memory is managed
    statically means that even naïve code is often faster than the equivalent in
    most other languages out there, and certainly faster than naïve code in any
    equivalently-safe language. Maybe, though, like most programmers you've spent
    your whole programming career safely insulated from having to think about any of
    the details of the machine, and now you want to dig a little deeper and find out
    the real reason that Python script you rewrote in Rust runs 100x faster and uses
    a 10th of the memory. After all, they both do the same thing and run on the same
    CPU, right?
    lack of implicit boxing and the static memory management means that even naïve
    code is often faster than the equivalent in other languages, and certainly
    faster than naïve code in any equally-safe language. Maybe, though, like most
    programmers you've spent your whole programming career safely insulated from
    having to think about any of the details of the machine, and now you want to dig
    a little deeper and find out the real reason that Python script you rewrote in
    Rust runs 100x faster and uses a 10th of the memory. After all, they both do the
    same thing and run on the same CPU, right?

    So, here's an optimization guide, aimed at those who know how to program but
    maybe don't know how it maps to real ones and zeroes being bandied around on the
    bare metal. I'll try to weave practical tips about optimizing Rust code with
    explanations of the reason behind why some code is faster than others, and we'll
    end with a case study from the Rust standard library.
    maybe don't know how it maps to real ones and zeroes on the bare metal of your
    CPU. I'll try to weave practical tips about optimizing Rust code with
    explanations of the reason why it's faster than the alternative, and we'll end
    with a case study from the Rust standard library.

    This post assumes decent familiarity with programming, a beginner's familiarity
    with Rust and almost no familiarity with CPU architecture.
    @@ -878,30 +877,22 @@ I explicitly use `wrapping_*` functions not for optimization purposes (because
    overflow checks are removed at runtime), but because overflow is required for
    correct behaviour. You'll notice some optimizations here:

    * We start at the end and work backwards, keeping a "mul" counter. Since we have
    a finite number of supported radixes (radices?) and an upper bound on the size
    of the output number we could almost certainly generate a list of all possible
    values for `mul` and use indexing instead of multiplication here, but it's a
    lot of work for a speed boost that would almost certainly be minimal.
    Originally I wrote a version of this that works forwards and multiplies
    `output` by radix each loop but the backwards method is 10% faster. This seems
    to be due to better instruction-level parallelism. The multiplications can be
    parallelized and only the addition relies on the previous iteration's value
    for `output`, and addition is much faster. Additionally, we waste a cycle
    doing `0 * radix` in the other version. The forward method is still 20% faster
    than the standard library version. Any fold or reducing operations can be
    improved in this way by making more of the operations parallizable.
    * We start at the end and work backwards, keeping a "mul" counter. Originally I
    wrote a version of this that works forwards and multiplies `output` by radix
    each loop but the backwards method is 10% faster. This seems to be due to
    better instruction-level parallelism. The multiplications can be parallelized
    and only the addition relies on the previous iteration's value for `output`,
    and addition is much faster.

    Any folding operations can be improved in this way by exploiting the
    algebraic laws (in this case, the distributive law) to improve the number of
    operations that can be done in parallel.
    * We rely on overflow to test `A < x < B` comparisons. This is only useful here
    because we already need the `x - A` value, so we're saving an extra comparison
    and a bitwise and. In most code `A < x && x < B` is as cheap or cheaper than
    `x - A < B` with overflow, since even though it's one instruction extra the
    comparisons can be done in parallel, plus comparisons and bitwise `&` are
    incredibly cheap (technically `&&` should result in a branch and therefore be
    slower, and rustc does in fact emit a branch for that code, but after
    LLVM's optimizations it gets converted to a bitwise and). This is where most
    of our gains come from. The standard library's implementation doesn't unify
    the codepaths for upper- and lowercase letters like we do, and does a match on
    the entire unicode codepoint range.
    `x - A < B` with overflow. This is where most of our gains come from.
    * We use `| 32` to unify the codepaths for upper- and lowercase letters,
    reducing the number of comparisons we need to do.
    * We don't do `output + digit * mul` when `output == 0` and `mul == 1`. This
    seems to be consistently 1ns faster, but it's possible that this doesn't make
    a difference and the 1ns speedup I'm seeing is pure luck. I reran the
    @@ -915,18 +906,9 @@ correct behaviour. You'll notice some optimizations here:
    guarantees are useless because you need to drop down to unsafe to get any real
    speed" (I've seen this almost verbatim on Hacker News before) you can show
    them this.
    * I tried to unroll the loop in this function, but the logic became too
    convoluted and would have taken up more instructions. Usually you would only
    use unrolling if you can reduce the strength of operations (in the previous
    unrolling example, replacing division with comparison), which this does not,
    but it's worth trying optimizations just to see if you end up with faster
    code. Most code relies on running the same thing hundreds or thousands of
    times over, if you're already optimizing an oft-used function it's worth
    trying out different techniques and rerunning benchmarks, since you might
    stumble across a small improvement that leads to huge benefits overall.
    * We rely on const-folding for the comparison with `radix` to be optimized away,
    so we mark the function as `inline` in order to allow external crates to apply
    the same optimization.
    * We don't rely on const-folding in order to make our code fast, but it does run
    faster with a constant value for `radix`. Therefore, we add `#[inline]` to
    allow downstream crates to apply const-folding too.

    The method I used for this is the basic method you should use for any
    optimization work: write a representative benchmark and then progressively tweak
  18. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -450,8 +450,9 @@ having to mangle your code and pissing of future-you.

    Having said that, if you know that an array is likely to be a multiple of N
    size, try making it a `&[[T; N]]` and operating on a `[T; N]` in each iteration.
    This reduces the number of iterations you need to do in the loop and allows the
    compiler to operate more aggressively on the loop body.
    This reduces the number of iterations (and therefore, the number of times you
    need to recalculate the loop variables) and allows the compiler to operate more
    aggressively on the loop body.

    You can also use more classical loop unrolling if it allows you to reduce the
    "strength" of your operations. This means that if you have to calculate some
  19. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1068,8 +1068,8 @@ from which the `from_str_radix` and `log_base` code was adapted. A lot of the
    points in this article are expansions upon points behind one of those two links.

    I hope that whether you're a soft-shell Rustacean or a grizzled veteran, this
    has given you a better sense of when some code smells, from a performance
    perpective. Go make things go vroom.
    has given you a better sense of when some code may be poorly performing, and
    what to do about it. Go make things go vroom.

    [ripgrep article]: http://blog.burntsushi.net/ripgrep/
    [fastware]: https://www.youtube.com/watch?v=o4-CwDo2zpg
  20. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 4 additions and 5 deletions.
    9 changes: 4 additions & 5 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Rust Optimization Tips
    # Achieving warp speed with Rust

    If you're looking to write fast code in Rust, good news! Rust makes it really
    easy to write really fast code. The focus on zero-cost abstractions, the
    @@ -147,10 +147,9 @@ on the heap.
    This means that `String`, `Vec`, `HashMap` and `Box<Trait>`/`Box<[T]>` all
    allocate, but any user-defined struct does not (it may contain something that
    _does_ allocate, but it doesn't require any extra allocation to construct if
    you already have the allocated value to put inside of it). `Box<T>` where `T`
    has a statically-known size also allocates, but boxing statically-sized types is
    very rare. The only use-case I can think of is sending huge objects between
    threads.
    you already have an instance of the allocating type). `Box<T>` where `T` has a
    statically-known size also allocates, but boxing statically-sized types is very
    rare. The only use-case I can think of is sending huge objects between threads.

    This is why you may have heard some complaints about Haskell's use of a linked
    list of characters to represent a string. I'm not going to beat [gankro's
  21. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 11 additions and 3 deletions.
    14 changes: 11 additions & 3 deletions Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Blazing speed without segfaults in Rust
    # Rust Optimization Tips

    If you're looking to write fast code in Rust, good news! Rust makes it really
    easy to write really fast code. The focus on zero-cost abstractions, the
    @@ -85,7 +85,10 @@ have `O(n²)` complexity that won't appear until you've tried it on a larger
    dataset. If you don't know what your algorithm is, which is likely since most
    code is written without a specific algorithm in mind, just try to have as few
    loops as possible and remember that every use of `collect` has to iterate over
    the entire collection.
    the entire collection at least once, and so the more work you can do using less
    loops the better. This is the same as optimization in any language though, so
    this is all I'll say on algorithmic complexity for now. If you want to find out
    more, there are some excellent resources out there.

    ## CPU architecture primer

    @@ -109,7 +112,12 @@ in increasingly small, increasingly fast caches. If it tries to access data that
    isn't in the smallest cache, it has to read the slightly larger cache,
    continuing up until it reaches RAM. The upshot is: you want to keep your data as
    small as possible, and for data that is accessed together to be close to each
    other so the CPU loads as much of it at once as possible.
    other so the CPU loads as much of it at once as possible. This should be enough
    information to get you through the rest of this article, but if you want to dive
    deeper into it you can check out the [structure and implementation section in
    the Wikipedia page for the CPU][cpu structure].

    [cpu structure]: https://en.wikipedia.org/wiki/Central_processing_unit#Structure_and_implementation

    ## Keep as much as possible in cache

  22. @jFransham jFransham renamed this gist May 15, 2017. 1 changed file with 45 additions and 42 deletions.
    87 changes: 45 additions & 42 deletions rust-opt-draft-2.md → Rust Optimization.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Rust Optimization Tips
    # Blazing speed without segfaults in Rust

    If you're looking to write fast code in Rust, good news! Rust makes it really
    easy to write really fast code. The focus on zero-cost abstractions, the
    @@ -234,7 +234,7 @@ case.
    ## Keep as much as possible in registers

    Now, the absolute best place for your data - registers. The more work you can do
    without non-local writes the more that `rustc` and LLVM can assume about your
    without non-local writes the more that rustc and LLVM can assume about your
    data's access patterns. This is good because it means that data can be mapped to
    the CPU's physical registers, which are the fastest memory on your entire
    computer, but even better, if you make your data suitable for registers then
    @@ -249,37 +249,37 @@ of C and C-family languages. Even if they did implement relaxations on the
    reordering rules, however, storing data in registers will still be easier to
    optimize.

    Essentially, if you can make your calculations pure, then do so. Writing to
    pointers is worse than writing to local variables. As much as possible, you
    should try to constrain mutable writes to the data that you have ownership over,
    rather than mutable references. So a mutable loop counter is fine, but passing a
    mutable reference to a loop counter through multiple layers of functions is not
    (unless they end up getting inlined, of course). This is really just an
    extension of one of my first points: clean, boring code is easier to optimize
    than spaghetti.
    So how do you get rustc to allocate things to registers? Essentially, the less
    pointers you have to write at runtime the better. Writing to local variables
    is better than writing through a mutable pointer. As much as possible, you
    should try to constrain mutable writes to the data that you have ownership over.
    So a mutable loop counter is fine, but passing a mutable reference to a loop
    counter through multiple layers of functions is not (unless they end up getting
    inlined, of course). This is really just an extension of one of my first points:
    clean, boring code is easier to optimize than spaghetti.

    ## Use `&mut Trait` over `Box<Trait>`
    ## Avoid `Box<Trait>`

    The canonical way to create trait objects is `Box<Trait>`, but the majority of
    code can get away with `&mut Trait`, which saves an allocation. If you
    absolutely need ownership then use a `Box`, but most use-cases can use an
    `&Trait` or `&mut Trait`. If possible, though, the best thing to do in these
    cases is to avoid using a trait object all together. `impl Trait` is the obvious
    way to avoid them, but that doesn't allow dynamic dispatch since it's basically
    type inference in a fancy hat. A good trick is, if you want to allow a variable
    but finite number of implementors of a type because you want to choose between
    them or iterate over them, use either a tuple or a recursive generic struct like
    this:
    code can get away with `&mut Trait`, which also has dynamic dispatch but saves
    an allocation. If you absolutely need ownership then use a `Box`, but most
    use-cases can use an `&Trait` or `&mut Trait`. Even better is to avoid using a
    trait object all together. `impl Trait` is the obvious way to avoid them, but
    that doesn't allow you to store a heterogenous collection of elements that
    implement a single trait since it's basically type inference in a fancy hat. A
    good trick for when you want to allow a variable but finite number of
    implementors of a type because you want to choose between them or iterate over
    them, use either a tuple or a recursive generic struct like this:

    ```rust
    struct SomeStruct<Head, Tail>(Head, Tail);
    struct Cons<Head, Tail>(Head, Tail);
    ```

    Since data structures in Rust don't add any indirection or space overhead, you
    can implement a trait for this structure recursively and have something that
    would be as fast as a representation that could only take a fixed number of
    items. Here's an example of how this could look for a function that takes a list
    of functions and calls them:
    can implement a trait for this structure recursively and have a function that
    can take any number of parameters that runs as fast as an equivalent function
    that takes a fixed number of parameters. Here's an example of how this could
    look for a function that takes a list of functions and calls them:

    Allocating version:

    @@ -294,11 +294,11 @@ fn call_all_fns(fns: Vec<Box<FnBox() -> ()>>) {
    Allocation-free version:

    ```rust
    struct And<First, Second>(First, Second);
    struct Cons<First, Second>(First, Second);

    trait HCons: Sized {
    fn cons<T>(self, other: T) -> And<Self, T> {
    And(self, other)
    fn cons<T>(self, other: T) -> Cons<Self, T> {
    Cons(self, other)
    }
    }

    @@ -314,7 +314,7 @@ impl<F: Fn() -> ()> Callable for F {
    fn call(self) { self() }
    }

    impl<First: Callable, Second: Callable> Callable for And<First, Second> {
    impl<First: Callable, Second: Callable> Callable for Cons<First, Second> {
    fn call(self) {
    self.0.call();
    self.1.call();
    @@ -344,14 +344,15 @@ fn main() {

    The functions passed to `call_all_fns_no_alloc` are eligible for inlining, they
    require no space overhead, and their instructions and data are directly next to
    each other in memory and are much faster to access than if each of them were
    boxed. For example, in `combine` there's a `choice` function that takes an array
    that could contain trait objects, but it also supplies a `.or()` combinator (and
    a `choice!` macro that expands to recursive `.or` calls) that returns an `Or<A,
    B>` that in turn implements `Parser`. This means that dispatch is static and the
    objects are all stored in order in memory (because it's just a set of recursive
    structs). You will still need dynamic dispatch for some cases, but using this
    method means that the number of cases where this is necessary is very small.
    each other in memory and are therefore much faster to access than if each of
    them were boxed. For example, in `combine` there's a `choice` function that
    takes an array that could contain trait objects, but it also supplies a `.or()`
    combinator (and a `choice!` macro that expands to recursive `.or` calls) that
    returns an `Or<A, B>` that in turn implements `Parser`. This means that dispatch
    is static and the objects are all stored in order in memory (because it's just a
    set of recursive structs). You will still need dynamic dispatch for some cases,
    but using this method means that the number of cases where this is necessary is
    very small.

    ## Use stack-based variable-length datatypes

    @@ -516,10 +517,10 @@ fn bench_log_base_unrolled(b: &mut Bencher) {
    }
    ```

    `test::black_box` a magic function that prevents `rustc` and LLVM calculating
    `test::black_box` a magic function that prevents rustc and LLVM calculating
    those function calls at compile-time and converting them into a constant, which
    usually they would (actually, it's not magic, it's just some inline assembly
    that doesn't do anything, since neither `rustc` nor LLVM will try to optimize
    that doesn't do anything, since neither rustc nor LLVM will try to optimize
    anything that's been accessed by inline assembly).

    This gives the following results:
    @@ -615,10 +616,12 @@ Let's check out the results now:

    ```
    test bench_log_base ... bench: 18 ns/iter (+/- 0)
    test bench_log_base_unrolled ... bench: 5 ns/iter (+/- 0)
    test bench_log_base_increasing ... bench: 6 ns/iter (+/- 0)
    test bench_log_base_nonconstbase ... bench: 199 ns/iter (+/- 5)
    test bench_log_base_unrolled ... bench: 5 ns/iter (+/- 0)
    test bench_log_base_unrolled_nonconstbase ... bench: 37 ns/iter (+/- 1)
    test bench_log_base_increasing ... bench: 6 ns/iter (+/- 0)
    test bench_log_base_increasing_nonconstbase ... bench: 8 ns/iter (+/- 1)
    ```

    @@ -886,7 +889,7 @@ correct behaviour. You'll notice some optimizations here:
    `x - A < B` with overflow, since even though it's one instruction extra the
    comparisons can be done in parallel, plus comparisons and bitwise `&` are
    incredibly cheap (technically `&&` should result in a branch and therefore be
    slower, and `rustc` does in fact emit a branch for that code, but after
    slower, and rustc does in fact emit a branch for that code, but after
    LLVM's optimizations it gets converted to a bitwise and). This is where most
    of our gains come from. The standard library's implementation doesn't unify
    the codepaths for upper- and lowercase letters like we do, and does a match on
  23. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 160 additions and 156 deletions.
    316 changes: 160 additions & 156 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -83,39 +83,9 @@ performance problems it's more likely to be due to poor algorithms than to poor
    implementations. Most programmers test their code on small datasets, but if you
    have `O(n²)` complexity that won't appear until you've tried it on a larger
    dataset. If you don't know what your algorithm is, which is likely since most
    code is written without a specific algorithm in mind, there is at least one tip
    I can give to improve the complexity of the code as written.

    ## `collect` is a code smell

    Every use of `collect` is a new intermediate container created, and unless
    you're using a variable-length container that can be stored on the stack, it's
    one or more allocations too (if you don't know what that means or why it's
    important, we'll get to it later). Now with Rust's new `impl Trait` syntax you
    can write functions returning `impl Iterator<Item=MyThing>`, so you can separate
    your code into functions without affecting the complexity. [`itertools`][
    itertools] is a great tool to do more of your work lazily.

    This is especially true for reading data from a file or over a network - using
    iterators allows you to do work in constant space and a minimum of `O(n)` time,
    using `collect` forces you to allocate the whole file in memory and then iterate
    over it again, using `O(n)` space and a minimum of `O(n²)` time.

    One trick to help you doing this is collecting values into a tuple. For example,
    if you need to average a collection, sum the length and the total into a
    `(length, total)` tuple instead of iterating twice (unless you have a concrete
    type like a slice or a vector for which getting the length is free). If you need
    to iterate over multiple collections in sequence, you can use `Iterator::zip` or
    `Itertools::zip_longest`. The former will produce a sequence the length of its
    shortest argument, the latter will produce a sequence the length of its longest
    argument (using `Option<T>` to represent when one of the sequences is finished).

    This may sound familiar or obvious to you, since this is just a form of [loop
    jamming][loop jamming] but leveraging Rust's better tools for abstraction
    compared to languages like C.

    [loop jamming]: https://en.wikipedia.org/wiki/Loop_fusion
    [itertools]: https://github.com/bluss/rust-itertools
    code is written without a specific algorithm in mind, just try to have as few
    loops as possible and remember that every use of `collect` has to iterate over
    the entire collection.

    ## CPU architecture primer

    @@ -131,30 +101,15 @@ it on. The instructions, and the data.
    Instructions are stored in the instruction cache - a chunk of really, really
    fast memory that's directly readable by the CPU. Each instruction can put and/or
    take data from the CPU's registers, which is a small number of small pieces of
    memory, either 32 or 64 bits depending on your computer's word
    size<a name="floats-back"></a>[<sup>note: floats</sup>](#floats). Only a small
    amount of data can be in registers at any one time, however, and you can't take
    a pointer to a register, so sometimes the CPU must access the computer's RAM.
    Since RAM is slow, the CPU tries to read in bulk and then store the result in
    increasingly small, increasingly fast caches. If it tries to access data that
    memory, either 32 or 64 bits depending on your computer's word size. Only a
    small amount of data can be in registers at any one time, however, and you can't
    take a pointer to a register, so sometimes the CPU must access the computer's
    RAM. Since RAM is slow, the CPU tries to read in bulk and then store the result
    in increasingly small, increasingly fast caches. If it tries to access data that
    isn't in the smallest cache, it has to read the slightly larger cache,
    continuing up until it reaches RAM<a name="swap-space-back"></a>[<sup>note: swap
    space</sup>](#swap-space). The upshot is: you want to keep your data as small as
    possible, and for data that is accessed together to be close to each other so
    the CPU loads as much of it at once as possible.

    <a name="floats"></a>
    [Note: floats](#floats-back) - The CPU has a separate set of registers purely
    for acting on floating point numbers too, but that doesn't really affect how you
    code so I'm only mentioning it for completeness.

    <a name="swap-space"></a>
    [Note: swap space](#swap-space-back) - After RAM you have another layer of
    dynamic memory called swap space. This is part of your hard drive which acts as
    RAM in an emergency. In an ideal world you should have this disabled but
    unfortunately you probably have programs on your computer that don't or can't
    respect the amount of RAM you have. Memory allocations failing is still a
    problem we haven't solved as a community.
    continuing up until it reaches RAM. The upshot is: you want to keep your data as
    small as possible, and for data that is accessed together to be close to each
    other so the CPU loads as much of it at once as possible.

    ## Keep as much as possible in cache

    @@ -170,12 +125,11 @@ heap is significantly slower, since it's much less likely they they're directly
    next to each other. We'll go into exactly why this is in a moment.

    If you have to allocate, because you need variable-size containers, shared
    ownership or owned trait objects<a
    name="you-dont-need-trait-objects-back"></a>
    [<sup>note: trait objects](#you-dont-need-trait-objects), try to put data that
    will be accessed in sequence in order in RAM, so that when the CPU reads one
    element it necessarily has to read the next few elements too, meaning it doesn't
    need to stall waiting for RAM in order to operate on them.
    ownership or owned trait objects (see below for why you probably don't need
    trait objects), try to put data that will be accessed in
    sequence in order in RAM, so that when the CPU reads one element it necessarily
    has to read the next few elements too, meaning it doesn't need to stall waiting
    for RAM in order to operate on them.

    As a rule of thumb for whether something has to allocate: if you can tell me the
    amount of space the value will use up without running the program, it's stored
    @@ -275,11 +229,6 @@ runtime. The 4x4 array in the previous example would be `[[f32; 4]; 4]` and take
    up 64 bytes, meaning that it would only take 2 cache misses to load in the worst
    case.

    <a name="you-dont-need-trait-objects"></a>
    [Note: trait objects](#you-dont-need-trait-objects-back) - See below for why you
    probably don't need owned trait objects plus a way to have better cache locality
    for variable-size containers

    [linked lists]: http://cglab.ca/~abeinges/blah/too-many-lists/book/#an-obligatory-public-service-announcement

    ## Keep as much as possible in registers
    @@ -410,14 +359,13 @@ Fixed-length datatypes are trivially storable on the stack, but for
    dynamically-sized data it's not so simple. However, [`smallvec`][small1],
    [`smallstring`][small2] and [`tendril`][small3] are all variable-length
    datatypes that allow you to store small numbers of elements on the
    stack<a name="shamesless-plug-back"></a>[<sup>note: shameless plug</sup>](
    #shameless-plug). Due to the law of small numbers, you are very likely to have
    more of these small strings than larger ones. This is good because it reduces
    allocation, but it's _great_ if you're storing these in a `Vec` or `HashMap`,
    since you will have less indirection and therefore better cache use. A good rule
    of thumb is to never have more than one layer of pointers to dereference before
    you reach your value (NASA enforces this rule in their C code, albeit for
    reliability and not performance).
    stack (shameless plug: `smallstring` was written by me). Due to the law of small
    numbers, you are very likely to have more of these small strings than larger
    ones. This is good because it reduces allocation, but it's _great_ if you're
    storing these in a `Vec` or `HashMap`, since you will have less indirection and
    therefore better cache use. A good rule of thumb is to never have more than one
    layer of pointers to dereference before you reach your value (NASA enforces this
    rule in their C code, albeit for reliability and not performance).

    Libraries like `smallvec` are great for cache locality, since an array of
    `SmallVec<[T; 4]>` will have exactly the same cache-locality as an array of just
    @@ -481,17 +429,11 @@ often countered by "`malloc` is fast". Both statements are true - the actual
    process of allocating and deallocating memory is fast, but data structures that
    allocate are worse for use-cases that require maximum speed.

    <a name="shameless-plug"></a>
    [Note: shameless plug alert](#shameless-plug-back) - I wrote `smallstring`, but
    although it's similar to `tendril`, it's not a strict subset. It's smaller and
    more flexible with regards to the number of bytes it can store on the stack
    before requiring an allocation.

    [small1]: https://github.com/servo/rust-smallvec
    [small2]: https://github.com/jFransham/smallstring
    [small3]: https://github.com/servo/tendril

    ## Loop unrolling for the modern age
    ## Loop unrolling is still cool

    [Duff's device][duff] is fun, but array-length-generic unrolled loops are
    unlikely to be faster than the equivalent optimized naïve code nowadays, since
    @@ -500,20 +442,14 @@ having to mangle your code and pissing of future-you.

    Having said that, if you know that an array is likely to be a multiple of N
    size, try making it a `&[[T; N]]` and operating on a `[T; N]` in each iteration.
    This allows the compiler to batch instructions better including implicit
    vectorization, and if and when Rust supports explicit vectorization it will
    allow you to safely specify that you want these operations to be done in
    parallel (see below for an explanation of vectorization). You might even be able
    to convert a flat array into this form with an assert followed by a transmute if
    you're writing an unsafe abstraction (the first rule of Rust is you never use
    `unsafe`, the second rule of Rust is that you never use `unsafe` in application
    code).
    This reduces the number of iterations you need to do in the loop and allows the
    compiler to operate more aggressively on the loop body.

    You can also use more classical loop unrolling if it allows you to reduce the
    strength of your operations. This means that if you have to calculate some value
    for each iteration of the loop and calculating this value takes longer than the
    body itself, manually unroll the body so you can calculate it less. Example: you
    can implement an integer logarithm function like so:
    "strength" of your operations. This means that if you have to calculate some
    value for each iteration of the loop and calculating this value takes longer
    than the body itself, manually unroll the body so you can calculate it less.
    Example: you can implement an integer logarithm function like so:

    ```rust
    fn log_base(mut n: usize, base: usize) -> usize {
    @@ -528,22 +464,15 @@ fn log_base(mut n: usize, base: usize) -> usize {
    }
    ```

    However, `n /= base; out += 1;` is slower to calculate than `n < base`<a
    name="constant-base-back"></a>[<sup>note: optimizing division</sup>](
    #constant-base). To take advantage of this fact, you can unroll the loop like
    so:
    However, `n /= base; out += 1;` is slower to calculate than `n < base`. To take
    advantage of this fact, you can unroll the loop like so:

    ```rust
    fn log_base_unrolled(mut n: usize, base: usize) -> usize {
    const UNROLL_COUNT: usize = 4;

    // Use a fixed-size array because it still gets enregistered but we can
    // ensure that we don't get the array count and the `out` skip value out of
    // sync.
    //
    // A good solution would be to use a macro to generate this array and the
    // `if` chain below. Tip: you can write macros inside functions and they
    // are scoped to that function definition.
    // We use a fixed-size array to ensure that we don't get the array count and
    // the `out` skip value out of sync.
    let premultiplied_base: [_; UNROLL_COUNT] = [
    base,
    base * base,
    @@ -565,62 +494,25 @@ fn log_base_unrolled(mut n: usize, base: usize) -> usize {
    }
    ```

    This is inelegant and could be better written with iterators, but from checking
    benchmarks it seems that this form is significantly (about 3x) faster than a
    version that replaces the hand-written set of `if` statements with a `for` loop
    over `[n1, n2, n3, n4].iter().enumerate()`, and 3x faster than the naïve
    implementation that doesn't unroll the loop at all<a
    name="benchmark-back"></a>[<sup>note: benchmarks</sup>](#benchmark). Besides,
    that's what the abstraction boundary is for - to put a pretty dress on this
    disfigured wretch so that it's acceptable to be seen in public. This is by far
    the best kind of optimization: one that can be constrained to a single function
    that callers don't have to know about the implementation of.

    A small note about the iterator benchmarks above: you should always use
    iterators for variable-length data, but for fixed-length data it seems to mess
    with enregistering and so you should use indexing where possible. The compiler
    will tell you if you index out-of-bounds (although by default it just throws a
    warning and not an error, I don't know why).

    The fastest method by far converts to an `f64`, calls `.log(base)` on that, and
    then converts back. It doesn't work for large numbers, however, because of loss
    of precision. An `f128` type would probably fix that. This is probably a good
    time to note that although adding and multiplying integers is faster than doing
    the same for floats, for code that does division by a variable (i.e. _not_
    division by a constant, since that can be optimized by the compiler into a
    faster form, see below) or something more complex like trigonometry, you should
    probably use floats. The compiler can't do that for you - it doesn't preserve
    semantics since you can lose precision - but you can check for areas where this
    would be an improvement and make the change manually.

    <a name="constant-base"></a>
    [Note: optimizing division](#constant-base-back) - If this function is called
    with a constant value for `base` then the program will never do a division here,
    assuming const-folding and inlining takes effect. There's a [clever
    transformation][division] the compiler can do to convert an integer division by
    a constant into a combination of a multiplication and a shift, taking advantage
    of overflow. As with all mechanical optimizations, you don't want to write this
    one yourself. Despite this, the `n /= base` line is _still_ significantly slower
    than the comparison. Comparisons are really, really fast compared to other
    calculations.

    <a name="benchmark"></a>
    [Note: benchmarks](#benchmark-back) - The benchmarking functions I used are
    here:
    Here are the benchmarks I used:

    ```rust
    #[bench]
    fn log_base_bench(b: &mut Bencher) {
    let input = test::black_box(5000000120510250);
    fn bench_log_base(b: &mut Bencher) {
    b.iter(|| {
    let input = black_box(5000000120510250);

    b.iter(|| log_base(input, 10));
    assert_eq!(log_base(input, 10), 16);
    });
    }

    #[bench]
    fn log_base_unrolled_bench(b: &mut Bencher) {
    let input = test::black_box(5000000120510250);
    fn bench_log_base_unrolled(b: &mut Bencher) {
    b.iter(|| {
    let input = black_box(5000000120510250);

    b.iter(|| log_base_unrolled(input, 10));
    assert_eq!(log_base(input, 10), 16);
    });
    }
    ```

    @@ -632,10 +524,122 @@ anything that's been accessed by inline assembly).

    This gives the following results:

    ```shell
    test log_base_bench ... bench: 14 ns/iter (+/- 0)
    test log_base_unrolled_bench ... bench: 5 ns/iter (+/- 0)
    ```
    test bench_log_base ... bench: 18 ns/iter (+/- 0)
    test bench_log_base_unrolled ... bench: 5 ns/iter (+/- 0)
    ```

    Wait a minute, though, what happens when we give a non-constant value for
    `base`?

    ```rust
    #[bench]
    fn bench_log_base_nonconstbase(b: &mut Bencher) {
    b.iter(|| {
    let input = black_box(5000000120510250);
    let base = black_box(10);

    assert_eq!(log_base(input, base), 16);
    });
    }

    #[bench]
    fn bench_log_base_unrolled_nonconstbase(b: &mut Bencher) {
    b.iter(|| {
    let input = black_box(5000000120510250);
    let base = black_box(10);

    assert_eq!(log_base_unrolled(input, base), 16);
    });
    }
    ```

    ```
    test bench_log_base_unrolled_nonconstbase ... bench: 37 ns/iter (+/- 1)
    test bench_log_base_nonconstbase ... bench: 199 ns/iter (+/- 5)
    ```

    They're both much slower! Can we do better? Turns out yes, we can:

    ```rust
    fn log_base_increasing(n: usize, base: usize) -> usize {
    const UNROLL_COUNT: usize = 4;

    let premultiplied_base: [_; UNROLL_COUNT] = [
    base,
    base * base,
    base * base * base,
    base * base * base * base,
    ];

    if n < premultiplied_base[0] { return 1; }
    if n < premultiplied_base[1] { return 2; }
    if n < premultiplied_base[2] { return 3; }
    if n < premultiplied_base[3] { return 4; }

    let mut out = UNROLL_COUNT + 1;
    let mut mul = premultiplied_base[UNROLL_COUNT - 1];

    loop {
    if n < premultiplied_base[0] * mul { return out; }
    if n < premultiplied_base[1] * mul { return out + 1; }
    if n < premultiplied_base[2] * mul { return out + 2; }
    if n < premultiplied_base[3] * mul { return out + 3; }

    mul *= premultiplied_base[UNROLL_COUNT - 1];
    out += UNROLL_COUNT;
    }
    }

    #[bench]
    fn bench_log_base_increasing(b: &mut Bencher) {
    b.iter(|| {
    let input = black_box(5000000120510250);

    assert_eq!(log_base_increasing(input, 10), 16);
    });
    }

    #[bench]
    fn bench_log_base_increasing_nonconstbase(b: &mut Bencher) {
    b.iter(|| {
    let input = black_box(5000000120510250);
    let base = black_box(10);

    assert_eq!(log_base_increasing(input, base), 16);
    });
    }
    ```

    Let's check out the results now:

    ```
    test bench_log_base ... bench: 18 ns/iter (+/- 0)
    test bench_log_base_unrolled ... bench: 5 ns/iter (+/- 0)
    test bench_log_base_increasing ... bench: 6 ns/iter (+/- 0)
    test bench_log_base_nonconstbase ... bench: 199 ns/iter (+/- 5)
    test bench_log_base_unrolled_nonconstbase ... bench: 37 ns/iter (+/- 1)
    test bench_log_base_increasing_nonconstbase ... bench: 8 ns/iter (+/- 1)
    ```

    Turns out the compiler was doing something sneaky: it can optimize integer
    division by a constant [into a multiplication combined with a shift][division].
    When it could no longer fold the constant into the function it slowed down
    considerably. It's ok to rely on const-folding if it allows you to gain
    considerable speedups and you know that the function will usually be called with
    constant arguments, but be careful. The things to look out for are if statements
    and integer division, both of which can be much slower with non-constant values
    compared to constants.

    The fastest method by far converts to an `f64`, calls `.log(base)` on that, and
    then converts back. It doesn't work for large numbers, however, because of loss
    of precision. This is probably a good time to note that although adding and
    multiplying integers is faster than doing the same for floats, for code that
    does division by a non-constant value or something more complex like
    trigonometry, you should definitely use floats. The compiler can't do the
    conversion for you - it won't apply optimizations that make your code less
    precise - but you can check for areas where this would be an improvement and
    make the change manually.

    [duff]: https://en.wikipedia.org/wiki/Duff's_device
    [division]: http://embeddedgurus.com/stack-overflow/2009/06/division-of-integers-by-constants/
  24. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -319,11 +319,11 @@ cases is to avoid using a trait object all together. `impl Trait` is the obvious
    way to avoid them, but that doesn't allow dynamic dispatch since it's basically
    type inference in a fancy hat. A good trick is, if you want to allow a variable
    but finite number of implementors of a type because you want to choose between
    them or iterate over them, use either a tuple or this generic-based
    representation:
    them or iterate over them, use either a tuple or a recursive generic struct like
    this:

    ```rust
    struct HList<Head, Tail>(Head, Tail);
    struct SomeStruct<Head, Tail>(Head, Tail);
    ```

    Since data structures in Rust don't add any indirection or space overhead, you
  25. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 14 additions and 15 deletions.
    29 changes: 14 additions & 15 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -133,16 +133,15 @@ fast memory that's directly readable by the CPU. Each instruction can put and/or
    take data from the CPU's registers, which is a small number of small pieces of
    memory, either 32 or 64 bits depending on your computer's word
    size<a name="floats-back"></a>[<sup>note: floats</sup>](#floats). Only a small
    amount of data can be in registers at any one time, however, and data in
    registers cannot be accessed via pointer, so sometimes the CPU must access the
    computer's RAM. Since RAM is slow, the CPU tries to read in bulk and then store
    the result in increasingly small, increasingly fast caches. If it tries to
    access data that isn't in the smallest cache, it has to read the slightly larger
    cache, continuing up until it reaches RAM<a
    name="swap-space-back"></a>[<sup>note: swap space</sup>](#swap-space). The
    upshot is: you want to keep your data as small as possible, and for data that is
    accessed together to be close to each other so the CPU loads as much of it at
    once as possible.
    amount of data can be in registers at any one time, however, and you can't take
    a pointer to a register, so sometimes the CPU must access the computer's RAM.
    Since RAM is slow, the CPU tries to read in bulk and then store the result in
    increasingly small, increasingly fast caches. If it tries to access data that
    isn't in the smallest cache, it has to read the slightly larger cache,
    continuing up until it reaches RAM<a name="swap-space-back"></a>[<sup>note: swap
    space</sup>](#swap-space). The upshot is: you want to keep your data as small as
    possible, and for data that is accessed together to be close to each other so
    the CPU loads as much of it at once as possible.

    <a name="floats"></a>
    [Note: floats](#floats-back) - The CPU has a separate set of registers purely
    @@ -159,16 +158,16 @@ problem we haven't solved as a community.

    ## Keep as much as possible in cache

    Anything that's not stored inside your program's memory is bad for performance.
    The very worst place for data to be is on a different computer. A less awful -
    but still very awful - place for data to be is on your hard drive. Better still
    is in your RAM but as mentioned before, RAM is slow. _Almost_ the best possible
    The further away your data is from your CPU the slower your program will be. The
    very worst place for data to be is on a different computer. A less awful - but
    still very awful - place for data to be is on your hard drive. Better still is
    in your RAM but as mentioned before, RAM is slow. _Almost_ the best possible
    place for your data is in CPU cache. You may have heard some folklore that
    allocating is bad, and this is the main reason why. Accessing two different
    locations one after another on the stack is fine, since they're likely to be on
    the same cache line. Accessing two different locations one after another on the
    heap is significantly slower, since it's much less likely they they're directly
    next to each other. We'll go into exactly why this is in a moment.
    next to each other. We'll go into exactly why this is in a moment.

    If you have to allocate, because you need variable-size containers, shared
    ownership or owned trait objects<a
  26. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 79 additions and 58 deletions.
    137 changes: 79 additions & 58 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -310,46 +310,6 @@ mutable reference to a loop counter through multiple layers of functions is not
    extension of one of my first points: clean, boring code is easier to optimize
    than spaghetti.

    ## Use a `HashSet` to deduplicate data

    If you read the Accidentally Quadratic blog you'll know the number of times the
    problem was using linear search to deduplicate data. `HashSet`s have `O(1)`
    amortized performance to query whether or not it contains an item. This means
    that no matter how big the `HashSet` gets it will still take the same amount of
    time to check. With a `Vec` checking membership is `O(n)`, meaning that it takes
    50x longer to check membership for a 50-element vector than it does for a
    1-element vector. Even better, if you're only using deduplication as an
    optimization instead of for correctness (i.e. "I want to process as little data
    as possible so I'll only process each unique element once") you can use a
    `HashSet` with bounded memory usage in order to prevent the high memory usage
    that using a `HashSet` can cause. This is a valueless cache with a discard
    policy of ignoring new values after a certain load is reached. You could also
    use a valueless cache with a different discard policy, like LRU (least-recently
    used) or something even more complex. A simple `HashSet` can take you very far,
    however, and you should only resort to other forms of cache if you are running
    into unbounded memory use.

    This is a smaller part of a wider point, which is an extension of the point
    about algorithms above - choose the right data structure. Think of the
    high-level operations you want to perform on the data structure (checking
    membership, insertion, deletion, appending, sorting, etc.) and then choose a
    structure with the best complexity over these operations. If you need to get the
    lowest value, use a priority queue/heap. If you need the lowest value _and_ you
    need the elements to be unique, use a `BTreeMap`/`BTreeSet`. As mentioned above,
    if you need fast membership querying use a `HashSet`.

    The standard library's `HashSet` allows you to choose the hashing algorithm too,
    so choose a fast hashing implementation like `FNV`, as provided by the
    [`rust-fnv`][fnv] crate. One good trick is that if your data is the same size as
    a `u64` (for example, pointers), you can simply transmute the data to a `u64`
    and use that as the hash. Don't do this blindly, though, because it opens you up
    to denial-of-service attacks (which the standard library's default hasher
    is designed to prevent) and may cause the `HashSet` to get an uneven load,
    which makes it as slow as using a `Vec`. As with everything, you should try it
    and run benchmarks to see if it makes a difference.

    [fnv]: https://github.com/servo/rust-fnv

    ## Use `&mut Trait` over `Box<Trait>`

    The canonical way to create trait objects is `Box<Trait>`, but the majority of
    @@ -360,29 +320,90 @@ cases is to avoid using a trait object all together. `impl Trait` is the obvious
    way to avoid them, but that doesn't allow dynamic dispatch since it's basically
    type inference in a fancy hat. A good trick is, if you want to allow a variable
    but finite number of implementors of a type because you want to choose between
    them or iterate over them, use either a tuple or some kind of heterogenous list:
    them or iterate over them, use either a tuple or this generic-based
    representation:

    ```rust
    struct HList<Head, Tail>(Head, Tail);
    struct HNil;
    ```

    Then you can implement whatever trait you were going to turn into a trait object
    for this datatype and allow users to pass this in instead. For example,
    in `combine` there's a `choice` function that takes an array that could contain
    trait objects, but it also supplies a `.or()` combinator (and a `choice!` macro
    that expands to recursive `.or` calls) that returns an `Or<A, B>` that in turn
    implements `Parser`. This means that dispatch is static and the objects are all
    stored in order in memory (because it's just a set of recursive structs). Using
    traits and generics in this way means that you can only use dynamic dispatch for
    rarest of cases - essentially only when you need to have a collection of
    datatypes with heterogenous behaviour with a set of elements that isn't known
    until runtime.

    The more you can avoid dynamic dispatch, the better. It makes your code harder
    to reason about both by humans and by the compiler, and it causes each call to
    these functions to go through an indirection that messes with the CPU's
    instruction cache.
    Since data structures in Rust don't add any indirection or space overhead, you
    can implement a trait for this structure recursively and have something that
    would be as fast as a representation that could only take a fixed number of
    items. Here's an example of how this could look for a function that takes a list
    of functions and calls them:

    Allocating version:

    ```rust
    fn call_all_fns(fns: Vec<Box<FnBox() -> ()>>) {
    for f in fns {
    f();
    }
    }
    ```

    Allocation-free version:

    ```rust
    struct And<First, Second>(First, Second);

    trait HCons: Sized {
    fn cons<T>(self, other: T) -> And<Self, T> {
    And(self, other)
    }
    }

    impl<T: Sized> HCons for T {}

    // This is a hack to get around the fact that manually implementing the `Fn`
    // traits is currently unstable.
    trait Callable {
    fn call(self);
    }

    impl<F: Fn() -> ()> Callable for F {
    fn call(self) { self() }
    }

    impl<First: Callable, Second: Callable> Callable for And<First, Second> {
    fn call(self) {
    self.0.call();
    self.1.call();
    }
    }

    fn call_all_fns_no_alloc<T: Callable>(fns: T) {
    fns.call();
    }
    ```

    Here's what they both look like in use:

    ```rust
    fn main() {
    let first_fn = || { println!("Hello!"); };
    let second_fn = || { println!("World!"); };

    call_all_fns(vec![Box::new(first_fn), Box::new(second_fn)]);

    let first_fn = || { println!("Hello!"); };
    let second_fn = || { println!("World!"); };

    call_all_fns_no_alloc(first_fn.cons(second_fn));
    }
    ```

    The functions passed to `call_all_fns_no_alloc` are eligible for inlining, they
    require no space overhead, and their instructions and data are directly next to
    each other in memory and are much faster to access than if each of them were
    boxed. For example, in `combine` there's a `choice` function that takes an array
    that could contain trait objects, but it also supplies a `.or()` combinator (and
    a `choice!` macro that expands to recursive `.or` calls) that returns an `Or<A,
    B>` that in turn implements `Parser`. This means that dispatch is static and the
    objects are all stored in order in memory (because it's just a set of recursive
    structs). You will still need dynamic dispatch for some cases, but using this
    method means that the number of cases where this is necessary is very small.

    ## Use stack-based variable-length datatypes

  27. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 8 additions and 8 deletions.
    16 changes: 8 additions & 8 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -390,14 +390,14 @@ Fixed-length datatypes are trivially storable on the stack, but for
    dynamically-sized data it's not so simple. However, [`smallvec`][small1],
    [`smallstring`][small2] and [`tendril`][small3] are all variable-length
    datatypes that allow you to store small numbers of elements on the
    stack<a name="shamesless-plug-back"></a>[note: shameless plug](#shameless-plug).
    Due to the law of small numbers, you are very likely to have more of these small
    strings than larger ones. This is good because it reduces allocation, but it's
    _great_ if you're storing these in a `Vec` or `HashMap`, since you will have
    less indirection and therefore better cache use. A good rule of thumb is to
    never have more than one layer of pointers to dereference before you reach your
    value (NASA enforces this rule in their C code, albeit for reliability and not
    performance).
    stack<a name="shamesless-plug-back"></a>[<sup>note: shameless plug</sup>](
    #shameless-plug). Due to the law of small numbers, you are very likely to have
    more of these small strings than larger ones. This is good because it reduces
    allocation, but it's _great_ if you're storing these in a `Vec` or `HashMap`,
    since you will have less indirection and therefore better cache use. A good rule
    of thumb is to never have more than one layer of pointers to dereference before
    you reach your value (NASA enforces this rule in their C code, albeit for
    reliability and not performance).

    Libraries like `smallvec` are great for cache locality, since an array of
    `SmallVec<[T; 4]>` will have exactly the same cache-locality as an array of just
  28. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -21,7 +21,7 @@ end with a case study from the Rust standard library.
    This post assumes decent familiarity with programming, a beginner's familiarity
    with Rust and almost no familiarity with CPU architecture.

    ## First rule of optimization: don't
    ## Number one optimization tip: don't

    Ok, I'll start with a few disclaimers before I get into the meat. Firstly,
    unless you're running into performance problems in real-life usage, optimize
  29. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -670,9 +670,9 @@ compilers are really good at working out when a function would benefit from
    being inlined, and Rust isn't constrained to the slower standardized C calling
    convention and can use `fastcc`, making function calls extremely cheap. You're
    more likely to cause the size of your executable to bloat. This takes up more
    space on your hard drive, but who cares. If you have even a single
    bundled asset like images or audio they will likely dwarf the size of your
    executable.
    space on your hard drive, of course, but that's not too much of a problem. If
    you have even a single bundled asset like images or audio they will likely dwarf
    the size of your executable.

    The real issue here is that it can make your program no longer fit in the CPU's
    instruction cache. The CPU will only have to go to RAM for its instructions when
  30. @jFransham jFransham revised this gist May 15, 2017. 1 changed file with 20 additions and 15 deletions.
    35 changes: 20 additions & 15 deletions rust-opt-draft-2.md
    Original file line number Diff line number Diff line change
    @@ -776,20 +776,24 @@ enum IntErrorKind {

    #[inline]
    fn from_str_radix(input: &str, radix: usize) -> Result<usize, ParseIntError> {
    #[inline]
    fn to_digit_ascii(ascii: u8, radix: usize) -> Result<usize, ParseIntError> {
    let decimal_digit = ascii.wrapping_sub(b'0');

    let out = if radix > 10 && decimal_digit > 9 {
    (ascii | 32).wrapping_sub(b'a') + 10
    } else {
    decimal_digit
    } as usize;
    if radix > 10 && decimal_digit > 9 {
    let out = (ascii | 32).wrapping_sub(b'a') as usize;

    if out > radix {
    Err(ParseIntError { kind: IntErrorKind::InvalidDigit })
    if out > radix - 10 {
    Err(ParseIntError { kind: IntErrorKind::InvalidDigit })
    } else {
    Ok(out + 10)
    }
    } else {
    Ok(out)
    let decimal_digit = decimal_digit as usize;
    if decimal_digit > radix {
    Err(ParseIntError { kind: IntErrorKind::InvalidDigit })
    } else {
    Ok(decimal_digit)
    }
    }
    }

    @@ -891,12 +895,13 @@ correct behaviour. You'll notice some optimizations here:

    The method I used for this is the basic method you should use for any
    optimization work: write a representative benchmark and then progressively tweak
    and rerun benchmarks. Doing this for pure functions is much easier, so one of
    the first things you should do to optimize any function that's called in a tight
    loop is to make it pure. This avoids indirect writes and reads (reads and writes
    to places in memory that are likely to be outside cache lines) and makes
    benchmarking much, much easier. If you use test-driven development for
    reliability, this is the equivalent for performance.
    and rerun benchmarks until you can't shave off any more cycles. Doing this for
    pure functions is much easier, so one of the first things you should do to
    optimize any function that's called in a tight loop is to make it pure. This
    avoids indirect writes and reads (reads and writes to places in memory that are
    likely to be outside cache lines) and makes benchmarking much, much easier. If
    you use test-driven development for reliability, this is the equivalent for
    performance.

    Extending this to work on signed integer types is an exercise for the reader.
    Tip: unlike C, you can rely on signed integers overflowing with 2's complement