Skip to content

Instantly share code, notes, and snippets.

@mrange
Last active July 18, 2024 07:24
Show Gist options
  • Save mrange/1f26f3bd219da4d407ed6dde56c461dc to your computer and use it in GitHub Desktop.
Save mrange/1f26f3bd219da4d407ed6dde56c461dc to your computer and use it in GitHub Desktop.

Revisions

  1. mrange renamed this gist Jul 2, 2018. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. mrange revised this gist Jul 2, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -596,7 +596,7 @@ for i in 0..10000000 do

    If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#.

    # Conclusion
    # Wrapping up

    ## Tail recursion can be used to eliminate `StackOverflowException`

  3. mrange created this gist Jun 23, 2018.
    628 changes: 628 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,628 @@
    # On the topic of tail calls in .NET

    Let's say you have implemented a small data pipeline library to replace LINQ.

    ```fsharp
    module TrivialStream =
    type Receiver<'T> = 'T -> unit
    type Stream<'T> = Receiver<'T> -> unit
    module Details =
    module Loop =
    let rec range s e r i = if i <= e then r i; range s e r (i + s)
    open Details
    let inline range b s e : Stream<int> =
    fun r -> Loop.range s e r b
    let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then (r v))
    let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> (r (m v)))
    let inline sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s (fun v -> ss <- ss + v)
    ss
    ```

    Then you test compare the overhead of the pipeline against LINQ:

    ```fsharp
    let trivialTest n =
    TrivialStream.range 0 1 n
    |> TrivialStream.map (fun v -> int64 v)
    |> TrivialStream.filter (fun v -> v &&& 1L = 0L)
    |> TrivialStream.map (fun v -> v + 1L)
    |> TrivialStream.sum
    let linqTest n =
    Enumerable
    .Range(0, n + 1)
    .Select(fun v -> int64 v)
    .Where(fun v -> v &&& 1L = 0L)
    .Select(fun v -> v + 1L)
    .Sum()
    ```

    You are very satisfied to see that your trivial push data pipeline has significantly lower overhead than LINQ.

    ```
    Running LINQ with total=100000000, outer=10, inner=10000000 ...
    ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    Running trivialpush with total=100000000, outer=10, inner=10000000 ...
    ... 375 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    ```

    Almost 6x lower overhead!

    ## Paul Westcott ruins our day

    Unfortunately you run into [Paul Westcott](https://github.com/manofstick) who asks what happens if you run the benchhmark in `x86` mode rather than `x64` mode.

    ```
    Running LINQ with total=100000000, outer=10, inner=10000000 ...
    ... 1773 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    Running trivialpush with total=100000000, outer=10, inner=10000000 ...
    ... 3116 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    ```

    Now the tables are turned, your beautiful little library is actually slower than LINQ! Paul then says: _"Just apply some magic oil!"_ and suggests you change your code into:

    ```fsharp
    module TrivialStream =
    type Receiver<'T> = 'T -> unit
    type Stream<'T> = Receiver<'T> -> unit
    module Details =
    let inline paulWestcottsMagicOil u = match u with () -> ()
    module Loop =
    let rec range s e r i = if i <= e then r i; range s e r (i + s)
    open Details
    let inline range b s e : Stream<int> =
    fun r -> Loop.range s e r b
    let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then paulWestcottsMagicOil (r v))
    let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> paulWestcottsMagicOil (r (m v)))
    let inline sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s (fun v -> ss <- ss + v)
    ss
    ```

    The function `let inline paulWestcottsMagicOil u = match u with () -> ()` does nothing but to your disbelief it actually helps!

    ```
    Running LINQ with total=100000000, outer=10, inner=10000000 ...
    ... 1770 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    Running trivialpush with total=100000000, outer=10, inner=10000000 ...
    ... 561 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    ```

    Performance in `x64` mode isn't as good as before but respectable:

    ```
    Running LINQ with total=100000000, outer=10, inner=10000000 ...
    ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    Running trivialpush with total=100000000, outer=10, inner=10000000 ...
    ... 483 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
    ```

    ## What's going on?

    It's great that we have a potential solution but some question marks still remains:

    1. What does this have to do with tail calls in .NET (as implied by the title of the blog)?
    1. Why the significant drop in performance between `x64` and `x86` modes?
    1. `paulWestcottsMagicOil` does nothing yet it helps, why?
    1. What is tail call optimization (TCO) I sometimes hear about?

    ## Let's start with tail calls.

    A [tail call](https://en.wikipedia.org/wiki/Tail_call) is a function call that is the final operation in a function. When the F# compiler detects that a function is the final operation it decorates the call site with a `.tail` attribute.

    In fact, in our data pipeline the calls to the next step in pipeline are tail calls.

    ```fsharp
    let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then (r v)) // <-- r v is a tail call
    let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> (r (m v))) // <-- r v is a tail call
    ```

    If we would look at result of a function call that call isn't a tail call because it is no longer the final operation.

    ### A silly code example

    ```fsharp
    let rec x n = if n > 0 then n + y (n - 1) else 0 // y (n - 1) is not a tail call
    and y n = if n > 0 then -n + x (n - 1) else 0 // x (n - 1) is not a tail call
    let rec v s n = if n > 0 then w (s + n) (n - 1) else s // w (s + n) (n - 1) is a tail call
    and w s n = if n > 0 then v (s - n) (n - 1) else s // v (s - n) (n - 1) is a tail call
    ```

    These functions don't do anything really useful except demonstrating two different ways to implement the same functionality. One using tail calls (v & w) and one without tail calls (x & y). The F# compiler then should inject a `.tail` attribute for v & w but not for x & y.

    To see this we need to peek at the generated IL assembler.

    Here is an excerpt of the IL assembler for `x`:

    ```asm
    ; Calls y
    IL_0008: call int32 Program/Samples::y(int32)
    ; Increments the result
    IL_000d: add
    ; Returns from the function
    IL_000e: ret
    ```

    `x` calls `y`. When `y` returns we accumulate the result and returns. Because the call isn't the final operation it is not a tail call as expected.

    It looks different for `v`:

    ```asm
    ; The following call is a tail call
    IL_000a: tail.
    ; Calls w
    IL_000c: call int32 Program/Samples::w(int32, int32)
    ; Returns from the function
    IL_0011: ret
    ```

    `v` calls `w` but as this is the final operation the call site is decorated with `.tail`.

    ### Why are tail calls a good thing?

    I am sure you have run into your fair share of `StackOverflowException`. For developers that aren't aware of the runtime ABI the name `StackOverflowException` must seem cryptic but we quickly learn it typically means we have a recursive function that keeps calling itself over and over until we overflow.

    The F# and C# compiler translates our source into a stream of op codes that are executed by the .NET virtual machine.

    When the virtual machine hits the op code `ret` it mean it should continue at the code site that called this function. The virtual machine needs information where to continue executing and that information is in .NET stored on the stack.

    What happens is that when the virtual machine executes a `call` it stores the position of the op code following `call` on the stack. Later when `ret` is executed it reads that value from the `stack` and the virtual machine continue executing the instruction following `call`.

    `StackOverflowException` happens when we run out of space to store the return positions.

    _In reality there is more information that is stored on the stack than just the return position but let's ignore that for now._

    Tail calls allow us to not run out of stack space when implementing a recursive algorithm.

    If we have a call that looks like this:

    ```asm
    IL_000c: call int32 Program/Samples::w(int32, int32)
    IL_0011: ret
    ```

    Since the call is the final operation it is a tail call. It seems unnecessary to return to `IL_0011` to just return the the calling function. What if we did this?

    ```asm
    IL_000c: jmp int32 Program/Samples::w(int32, int32)
    ```

    Then we jump to `w` but we don't store the return address to us. When `w` return it returns to the code that called us directly. We save space and we save time.

    This is roughly what the jitter tries to do when it sees a tall call.

    Let's test our function above with this theory:

    ```fsharp
    // Remember x is not tail recursive
    printfn "%A" <| x 10000 // Prints 5000
    printfn "%A" <| x 100000 // Crashes on my machine with StackOverflowException
    // Remember v is tail recursive
    printfn "%A" <| v 0 10000 // Prints 5000
    printfn "%A" <| v 0 100000 // Prints 50000
    printfn "%A" <| v 0 1000000000 // Prints 500000000 (takes awhile)
    ```

    Tail calls are brilliant!

    ## Why did our code perform worse than LINQ in `x86` mode?

    That is because tail calls sucks!

    If we profile the `x64` code the top 3 functions are:

    ```
    [email protected] 32 %
    [email protected] 20 %
    [email protected] 18 %
    ```

    In `x86` mode the top 3 functions are:

    ```
    [email protected] 27 %
    [clr.dll] 24 %
    [email protected] 17 %
    ```

    What are the Invoke methods? What does the CLR do? Even if we eliminate the CLR part it does not explain the complete drop in performance.

    ### Mapping Invoke methods to the code

    One way to interpret the name `[email protected]` is that this is lambda function generated for line 71. In my code line 71, 72 and 73 looks like this:

    ```fsharp
    |> TrivialStream.map (fun v -> int64 v)
    |> TrivialStream.filter (fun v -> v &&& 1L = 0L)
    |> TrivialStream.map (fun v -> v + 1L)
    ```

    Because F# inlines small lambdas the lambda functions that show up in the performance run are the data pipeline continuation lambdas for `map` and `filter`

    ### What is the CLR doing

    You are going to love this! To answer this we have to dig into the jitted `x64` and `x86` code.

    Let's start with `x64` code:

    ```asm
    ; This is - TrivialStream.filter (fun v -> v &&& 1L = 0L)
    00007ff8`5af01160 488bc2 mov rax,rdx
    ; Note that the test (fun v -> v &&& 1L = 0L) is inlined
    ; Test if the if the number is even
    00007ff8`5af01163 a801 test al,1
    00007ff8`5af01165 7512 jne 00007ff8`5af01179
    ; It is even, then we should call the next step in the data pipeline.
    ; That is a virtual call.
    ; The jitter has devirtualization tricks but they have failed to apply here so
    ; we have to do the full virtual call...
    ; First load the pointer the classes that implements the next step in the data pipeline
    00007ff8`5af01167 488b4908 mov rcx,qword ptr [rcx+8]
    ; Load the object Method Table Pointer
    00007ff8`5af0116b 488b01 mov rax,qword ptr [rcx]
    ; Load the class Pointer to Methods
    00007ff8`5af0116e 488b4040 mov rax,qword ptr [rax+40h]
    ; Load the address to virtual method Invoke
    00007ff8`5af01172 488b4020 mov rax,qword ptr [rax+20h]
    ; Jump to next Invoke (note this is a tail call so need to return here)
    00007ff8`5af01176 48ffe0 jmp rax
    ; The number was odd meaning we return to the caller
    00007ff8`5af01179 33c0 xor eax,eax
    00007ff8`5af0117b c3 ret
    ```

    Apart from the unfortunate fact that the devirtualization has failed to apply which prevents further inlining of the next step the code above is _not_ that bad.

    Sure there is a lot of overhead compared to the simple test on odd/even but it
    could be worse.

    Speaking of much worse, let's look at the `x86` code for the same function:

    ```asm
    ; This is - TrivialStream.filter (fun v -> v &&& 1L = 0L)
    ; Method prelude storing various registers... they weren't here in x64 mode :(
    014e1090 55 push ebp
    014e1091 8bec mov ebp,esp
    014e1093 57 push edi
    014e1094 56 push esi
    014e1095 53 push ebx
    014e1096 8bf1 mov esi,ecx
    ; Loading the value to test from the stack rather than picking it up from a register :(
    014e1098 8b4508 mov eax,dword ptr [ebp+8]
    ; Well at least the test is simple enough and inlined
    014e109b a901000000 test eax,1
    014e10a0 7526 jne 014e10c8
    ; The number is even
    ; But why this test? It wasn't here in x64 mode
    014e10a2 833d4010005d00 cmp dword ptr [clr!g_TrapReturningThreads (5d001040)],0
    014e10a9 7526 jne 014e10d1
    ; Ok, here we go loading the pointer to next step in the pipeline
    014e10ab 8b4e04 mov ecx,dword ptr [esi+4]
    ; Pushing input arguments on the stack as required by the x86 ABI
    ; x64 ABI supports passing values in registers
    014e10ae ff750c push dword ptr [ebp+0Ch]
    014e10b1 ff7508 push dword ptr [ebp+8]
    ; Load the object Method Table Pointer
    014e10b4 8b01 mov eax,dword ptr [ecx]
    ; Load the class Pointer to Methods
    014e10b6 8b4028 mov eax,dword ptr [eax+28h]
    ; Load the address to virtual method Invoke
    014e10b9 8b4010 mov eax,dword ptr [eax+10h]
    ; More arguments pushed to stack?
    014e10bc 6a02 push 2
    014e10be 6a02 push 2
    014e10c0 6a01 push 1
    014e10c2 50 push eax
    ; And now we call clr!JIT_TailCall? Not Invoke?
    014e10c3 e877cf4a5b call clr!JIT_TailCall (5c98e03f)
    ; The number is odd so let's return
    014e10c8 33c0 xor eax,eax
    ; Method epilogue :(
    014e10ca 5b pop ebx
    014e10cb 5e pop esi
    014e10cc 5f pop edi
    014e10cd 5d pop ebp
    014e10ce c20800 ret 8
    ```

    Some differences can be attributed to that the ABI for `x64` and `x86` modes, but what is the `clr!JIT_TailCall` doing?

    Let's look at it:

    ```asm
    ; Why are we looking the current thread? :(
    5c98e03f e844060000 call clr!GetThread (5c98e688)
    5c98e044 50 push eax
    5c98e045 51 push ecx
    5c98e046 52 push edx
    ; Something about the current thread is interesting?
    5c98e047 f7400480000000 test dword ptr [eax+4],80h
    5c98e04e 7406 je clr!JIT_TailCall+0x17 (5c98e056)
    5c98e050 50 push eax
    5c98e051 e870213400 call clr!JIT_TailCallHelper (5ccd01c6)
    ; Seems we usually end up here!
    ; Ok, let's load the arguments we pushed when calling clr!JIT_TailCall
    5c98e056 8b542414 mov edx,dword ptr [esp+14h]
    5c98e05a 8b44241c mov eax,dword ptr [esp+1Ch]
    5c98e05e 8b4c2418 mov ecx,dword ptr [esp+18h]
    ; A lot of stuff happening, but it seems to be to eliminate the stackframe
    ; of the current function. Which makes sense because we don't want the stack to grow
    ; on a tail call
    5c98e062 f7c201000000 test edx,1
    5c98e068 7409 je clr!JIT_TailCall+0x34 (5c98e073)
    5c98e06a 8b7dfc mov edi,dword ptr [ebp-4]
    5c98e06d 8b75f8 mov esi,dword ptr [ebp-8]
    5c98e070 8b5df4 mov ebx,dword ptr [ebp-0Ch]
    5c98e073 ff7504 push dword ptr [ebp+4]
    5c98e076 57 push edi
    5c98e077 56 push esi
    5c98e078 8d7c8508 lea edi,[ebp+eax*4+8]
    5c98e07c 8d748c2c lea esi,[esp+ecx*4+2Ch]
    5c98e080 8b6d00 mov ebp,dword ptr [ebp]
    5c98e083 f7c202000000 test edx,2
    5c98e089 752b jne clr!JIT_TailCallLeave+0x1 (5c98e0b6)
    5c98e08b 85c9 test ecx,ecx
    5c98e08d 740e je clr!JIT_TailCall+0x5e (5c98e09d)
    5c98e08f 8b46fc mov eax,dword ptr [esi-4]
    5c98e092 83ef04 sub edi,4
    5c98e095 83ee04 sub esi,4
    5c98e098 8907 mov dword ptr [edi],eax
    ; Hmm, a loop isn't what we want to see in a time-critical part
    5c98e09a 49 dec ecx
    5c98e09b 75f2 jne clr!JIT_TailCall+0x50 (5c98e08f)
    5c98e09d 8b442408 mov eax,dword ptr [esp+8]
    5c98e0a1 8b4c241c mov ecx,dword ptr [esp+1Ch]
    5c98e0a5 8947fc mov dword ptr [edi-4],eax
    5c98e0a8 894ff8 mov dword ptr [edi-8],ecx
    5c98e0ab 8d47f8 lea eax,[edi-8]
    ; Ok, getting near the end, restoring the registers
    5c98e0ae 5e pop esi
    5c98e0af 5f pop edi
    5c98e0b0 59 pop ecx
    5c98e0b1 5a pop edx
    5c98e0b2 59 pop ecx
    5c98e0b3 8be0 mov esp,eax
    clr!JIT_TailCallLeave:
    ; This actually "returns" to next step in the pipeline rather then the current step
    5c98e0b5 c3 ret
    ```

    If you want to look at the actual source code for `clr!JIT_TailCall` you can have a peek at dotnet.core: https://github.com/dotnet/coreclr/blob/master/src/vm/i386/jithelp.asm#L927

    So it seems that `clr!Jit_TailCall` eliminates the current stackframe in order to make sure the stack don't grow. In `x64` mode the jitter was able to just do a `jmp` but that doesn't work for some reason in `x86` mode. Unfortunately it seems to be quite lot of work eliminating the stackframe.

    `clr!Jit_TailCall` is most likely what's tagged as `[clr.dll]` when we measured
    the most costly functions.

    ## What does the function `paulWestcottsMagicOil` do again?

    Let's see what the "magic oil does to functions `x`, `y` , `v` and `w` we looked at before:

    ```fsharp
    let inline paulWestcottsMagicOil u = match u with i -> i
    let rec x n = if n > 0 then n + paulWestcottsMagicOil (y (n - 1)) else 0 // y (n - 1) is not a tail call
    and y n = if n > 0 then -n + paulWestcottsMagicOil (x (n - 1)) else 0 // x (n - 1) is not a tail call
    let rec v s n = if n > 0 then paulWestcottsMagicOil (w (s + n) (n - 1)) else s // w (s + n) (n - 1) is a tail call
    and w s n = if n > 0 then paulWestcottsMagicOil (v (s - n) (n - 1)) else s // v (s - n) (n - 1) is a tail call
    ```

    `paulWestcottsMagicOil` runs a match on input argument and whatever it is that is returned. It does nothing!

    Let's see how the magic function changes the IL Assembler.

    Let's start with `x`:

    ```asm
    IL_0008: call int32 Program/Sample2::y(int32)
    IL_000d: add
    IL_000e: ret
    ```

    Same as before. The magic oil function does nothing!

    Let's look at `v`:

    ```asm
    IL_000a: call int32 Program/Sample2::w(int32, int32)
    IL_000f: ret
    ```

    The `.tail` attribute is gone?! It seems that because match semantically looks at the value returned by `w` the F# compiler don't realize this is in essence a tail call and omits the `.tail` attribute.

    This is the purpose of `paulWestcottsMagicOil` - to eliminate the `.tail` attribute. This means that we will get `StackOverflowException` when call `v 0 20000`.

    However, it seems to help the performance of `x86` code.

    Let's see how this affects the `x64` assembler code.

    ```asm
    ; Reserve some stack space
    00007ff8`5aee1160 4883ec28 sub rsp,28h
    00007ff8`5aee1164 488bc2 mov rax,rdx
    ; The test
    00007ff8`5aee1167 a801 test al,1
    00007ff8`5aee1169 7515 jne 00007ff8`5aee1180
    ; Number is even, let's call the next step in the datapipeline
    ; Load the pointer to the next step in the pipeline
    00007ff8`5aee116b 488b4908 mov rcx,qword ptr [rcx+8]
    ; Load the object Method Table Pointer
    00007ff8`5aee116f 488b01 mov rax,qword ptr [rcx]
    ; Load the class Pointer to Methods
    00007ff8`5aee1172 488b4040 mov rax,qword ptr [rax+40h]
    ; Load and call the address to virtual method Invoke
    00007ff8`5aee1176 ff5020 call qword ptr [rax+20h]
    00007ff8`5aee1179 33c0 xor eax,eax
    ; Unreserve stack space
    00007ff8`5aee117b 4883c428 add rsp,28h
    ; We are done!
    00007ff8`5aee117f c3 ret
    ; Number is odd, just return
    00007ff8`5aee1180 33c0 xor eax,eax
    ; Unreserve stack space
    00007ff8`5aee1182 4883c428 add rsp,28h
    ; We are done!
    00007ff8`5aee1186 c3 ret
    ```

    In `x64` mode the code when `.tail` is present is better as we could `jmp` to the next step in the pipeline, now the jitter choses `call` instead. We lose some performance (25%) and we risk `StackOverflowException`.

    Let's look at `x86` code:

    ```asm
    ; Saving the framepointer. This is a good thing!
    02b61079 8bec mov ebp,esp
    02b61078 55 push ebp
    ; Trying to walk a stack manually without framepointers are not fun.
    ; However, ARM ABI on Linux seems to place the framepointer in a "random" position
    ; so it doesn't help there but that is a story for another blog.
    ; Load number from stack
    02b6107b 8b4508 mov eax,dword ptr [ebp+8]
    ; Test number if even
    02b6107e a901000000 test eax,1
    02b61083 7517 jne 02b6109c
    ; Ok the number is even. Let's call the next step in the pipeline
    ; Load the pointer to the next step in the pipeline
    02b61085 8b4904 mov ecx,dword ptr [ecx+4]
    ; Push the input arguments on the stack
    02b61088 ff750c push dword ptr [ebp+0Ch]
    02b6108b ff7508 push dword ptr [ebp+8]
    ; Load the object Method Table Pointer
    02b6108e 8b01 mov eax,dword ptr [ecx]
    ; Load the class Pointer to Methods
    02b61090 8b4028 mov eax,dword ptr [eax+28h]
    ; Load and call the address to virtual method Invoke
    02b61093 ff5010 call dword ptr [eax+10h]
    02b61096 33c0 xor eax,eax
    ; Restore framepointer
    02b61098 5d pop ebp
    ; We are done!
    02b61099 c20800 ret 8
    ; Ok, the number was odd, just return
    02b6109c 33c0 xor eax,eax
    ; Restore framepointer
    02b6109e 5d pop ebp
    ; We are done!
    02b6109f c20800 ret 8
    ```

    This looks much cleaner and it performs much better. The gain seems to primarily to come from calling the next step in the pipeline directly without having to go through `clr!JIT_TailCall`. We risk `StackOverflowException` but perhaps it's worth it.

    ## What is Tail Call Optimization (TCO)?

    F# doesn't have `break` and `continue` as common in other languages. In F# if you want to implement a loop that stops at a certain condition the idiom is to use tail recursion:

    ```fsharp
    let rec find f = function
    | [] -> None
    | v::vs -> if f v then Some v else find vs
    ```

    But what about efficiency? Doesn't the use of tail calls especially on `x86` mode kill the performance.

    Here is where F# compiler applies TCO and unrolls the code above to a loop. If we reverse engineer the IL code into C# it looks something like this

    ```c#
    public static FSharpOption<a> find<a>(FSharpFunc<a, bool> f, FSharpList<a> _arg1)
    {
    while (_arg1.get_TailOrNull() != null)
    {
    FSharpList<a> fsharpList = _arg1;
    FSharpList<a> tailOrNull = fsharpList.get_TailOrNull();
    a headOrDefault = fsharpList.get_HeadOrDefault();
    if (f.Invoke(headOrDefault))
    return FSharpOption<a>.Some(headOrDefault);
    FSharpFunc<a, bool> fsharpFunc = f;
    _arg1 = tailOrNull;
    f = fsharpFunc;
    }
    return (FSharpOption<a>) null;
    }
    ```

    So our tail recursive function becomes an efficient loop instead. TCO can not be applied in every situation, for example `v` and `w` above are mutually recursive. TCO doesn't apply there.

    Tail recursion to implements loops are more useful than you might think because in F# identical loops are vary a lot in terms of performance.

    ```fsharp
    // These loops are fast
    for i in 0..10 do
    compute i
    for i in 0..1..10 do
    compute i
    for i in 0..-1..10 do
    compute i
    // These loops are slow
    for i in 0L..10L do
    compute i
    for i in 0..2..10 do
    compute i
    // This cause a significant GC overhead in addition to being slow
    for i in 0..10000000 do
    for j in 0L..10L do
    compute i
    ```

    If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#.

    # Conclusion

    ## Tail recursion can be used to eliminate `StackOverflowException`

    Because we seen that tail recursion cause the jitter to avoid generating a stackframe (`x64` mode) or to eliminate the stackframe (`x86` mode) tail recursion allows us to implement deep recursion without `StackOverflowException`.

    ## Tail recursion can have a significant performance impact.

    In `x86` mode was saw that eliminating the stackframe using `clr!JIT_TailCall` caused a significant overhead. In `x64` mode we instead observed a performance improvement as the jitter was able to avoid a stackframe altogether.

    However, it is possible that for more complex `x64` code the jitter end up needing to call `clr!JIT_TailCall`.

    As far as I know, C# compiler don't generate the `.tail` attribute meaning that the JIT performance team has not optimized tail call as efficiently as possible as F# is a minor .NET language (I wish it was otherwise).

    There are things the jitter could do to improve the stackframe elimination, like inlining `clr!JIT_TailCall` and unroll the loops.

    ## Tail Call Optimization (TCO) is our friend!

    Implementing efficient non-trivial loops in F# is best done with tail recursion. In addition; tail recursion also allows us to avoid mutable state. F# has damaged me and now I tend to write tail recursion even in language that has support for break, continue and TCO like kotlin.

    C# doesn't currently support TCO so tail recursion is a bad idea in C#.

    ## Paul Westcott is a cool guy

    I hope Paul forgives me for naming a function after him. I first saw tail call suppression trick in an ambitious PR from Paul that aimed to improve the `Seq` module performance.

    I see PRs from Paul that tries to improve the performance of various core library like `Seq`, `generic compare/equality` and `Lazy`. All of these libraries _need_ to be improved but I realize the difficulty in getting the PRs approved as they typically involve massive and risky changes. I have a beer instead and tries to forget it.

    Paul does try and that's why I think he should win the "Tenacious F# developer of the year" award.