mrange · July 18, 2024 07:24 · Jul 2, 2018 · Jul 2, 2018 · Jun 23, 2018
diff --git a/README.md → on-tail-recursion.md b/README.md → on-tail-recursion.md
diff --git a/README.md b/README.md
@@ -596,7 +596,7 @@ for i in 0..10000000 do
 
 If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#.
 
-# Conclusion
+# Wrapping up
 
 ## Tail recursion can be used to eliminate `StackOverflowException`
 

diff --git a/README.md b/README.md
@@ -0,0 +1,628 @@
+# On the topic of tail calls in .NET
+
+Let's say you have implemented a small data pipeline library to replace LINQ.
+
+```fsharp
+module TrivialStream =
+  type Receiver<'T> = 'T            -> unit
+  type Stream<'T>   = Receiver<'T>  -> unit
+
+  module Details =
+    module Loop =
+      let rec range s e r i = if i <= e then r i; range s e r (i + s)
+
+  open Details
+
+  let inline range b s e : Stream<int> =
+    fun r -> Loop.range s e r b
+
+  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
+    fun r -> s (fun v -> if f v then (r v))
+
+  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
+    fun r -> s (fun v -> (r (m v)))
+
+  let inline sum (s : Stream<'T>) : 'T =
+    let mutable ss = LanguagePrimitives.GenericZero
+    s (fun v -> ss <- ss + v)
+    ss
+```
+
+Then you test compare the overhead of the pipeline against LINQ:
+
+```fsharp
+let trivialTest n =
+  TrivialStream.range       0 1 n
+  |> TrivialStream.map      (fun v -> int64 v)
+  |> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
+  |> TrivialStream.map      (fun v -> v + 1L)
+  |> TrivialStream.sum
+
+let linqTest n =
+  Enumerable
+    .Range(0, n + 1)
+    .Select(fun v -> int64 v)
+    .Where(fun v -> v &&& 1L = 0L)
+    .Select(fun v -> v + 1L)
+    .Sum()
+```
+
+You are very satisfied to see that your trivial push data pipeline has significantly lower overhead than LINQ.
+
+```
+Running LINQ with total=100000000, outer=10, inner=10000000 ...
+  ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+Running trivialpush with total=100000000, outer=10, inner=10000000 ...
+  ... 375 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+```
+
+Almost 6x lower overhead!
+
+## Paul Westcott ruins our day
+
+Unfortunately you run into [Paul Westcott](https://github.com/manofstick) who asks what happens if you run the benchhmark in `x86` mode rather than `x64` mode.
+
+```
+Running LINQ with total=100000000, outer=10, inner=10000000 ...
+  ... 1773 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+Running trivialpush with total=100000000, outer=10, inner=10000000 ...
+  ... 3116 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+```
+
+Now the tables are turned, your beautiful little library is actually slower than LINQ! Paul then says: _"Just apply some magic oil!"_ and suggests you change your code into:
+
+```fsharp
+module TrivialStream =
+  type Receiver<'T> = 'T            -> unit
+  type Stream<'T>   = Receiver<'T>  -> unit
+
+  module Details =
+    let inline paulWestcottsMagicOil u = match u with () -> ()
+    module Loop =
+      let rec range s e r i = if i <= e then r i; range s e r (i + s)
+
+  open Details
+
+  let inline range b s e : Stream<int> =
+    fun r -> Loop.range s e r b
+
+  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
+    fun r -> s (fun v -> if f v then paulWestcottsMagicOil (r v))
+
+  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
+    fun r -> s (fun v -> paulWestcottsMagicOil (r (m v)))
+
+  let inline sum (s : Stream<'T>) : 'T =
+    let mutable ss = LanguagePrimitives.GenericZero
+    s (fun v -> ss <- ss + v)
+    ss
+```
+
+The function `let inline paulWestcottsMagicOil u = match u with () -> ()` does nothing but to your disbelief it actually helps!
+
+```
+Running LINQ with total=100000000, outer=10, inner=10000000 ...
+  ... 1770 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+Running trivialpush with total=100000000, outer=10, inner=10000000 ...
+  ... 561 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+```
+
+Performance in `x64` mode isn't as good as before but respectable:
+
+```
+Running LINQ with total=100000000, outer=10, inner=10000000 ...
+  ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+Running trivialpush with total=100000000, outer=10, inner=10000000 ...
+  ... 483 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
+```
+
+## What's going on?
+
+It's great that we have a potential solution but some question marks still remains:
+
+1. What does this have to do with tail calls in .NET (as implied by the title of the blog)?
+1. Why the significant drop in performance between `x64` and `x86` modes?
+1. `paulWestcottsMagicOil` does nothing yet it helps, why?
+1. What is tail call optimization (TCO) I sometimes hear about?
+
+## Let's start with tail calls.
+
+A [tail call](https://en.wikipedia.org/wiki/Tail_call) is a function call that is the final operation in a function. When the F# compiler detects that a function is the final operation it decorates the call site with a `.tail` attribute.
+
+In fact, in our data pipeline the calls to the next step in pipeline are tail calls.
+
+```fsharp
+let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
+  fun r -> s (fun v -> if f v then (r v)) // <-- r v is a tail call
+
+let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
+  fun r -> s (fun v -> (r (m v))) // <-- r v is a tail call
+```
+
+If we would look at result of a function call that call isn't a tail call because it is no longer the final operation.
+
+### A silly code example
+
+```fsharp
+let rec x n = if n > 0 then  n + y (n - 1) else 0  // y (n - 1) is not a tail call
+and     y n = if n > 0 then -n + x (n - 1) else 0  // x (n - 1) is not a tail call
+
+let rec v  s n = if n > 0 then w (s + n) (n - 1) else s  // w (s + n) (n - 1) is a tail call
+and     w  s n = if n > 0 then v (s - n) (n - 1) else s  // v (s - n) (n - 1) is a tail call
+```
+
+These functions don't do anything really useful except demonstrating two different ways to implement the same functionality. One using tail calls (v & w) and one without tail calls (x & y). The F# compiler then should inject a `.tail` attribute for v & w but not for x & y.
+
+To see this we need to peek at the generated IL assembler.
+
+Here is an excerpt of the  IL assembler for `x`:
+
+```asm
+; Calls y
+IL_0008: call         int32 Program/Samples::y(int32)
+; Increments the result
+IL_000d: add
+; Returns from the function
+IL_000e: ret
+```
+
+`x` calls `y`. When `y` returns we accumulate the result and returns. Because the call isn't the final operation it is not a tail call as expected.
+
+It looks different for `v`:
+
+```asm
+; The following call is a tail call
+IL_000a: tail.
+; Calls w
+IL_000c: call         int32 Program/Samples::w(int32, int32)
+; Returns from the function
+IL_0011: ret
+```
+
+`v` calls `w` but as this is the final operation the call site is decorated with `.tail`.
+
+### Why are tail calls a good thing?
+
+I am sure you have run into your fair share of `StackOverflowException`. For developers that aren't aware of the runtime ABI the name `StackOverflowException` must seem cryptic but we quickly learn it typically means we have a recursive function that keeps calling itself over and over until we overflow.
+
+The F# and C# compiler translates our source into a stream of op codes that are executed by the .NET virtual machine.
+
+When the virtual machine hits the op code `ret` it mean it should continue at the code site that called this function. The virtual machine needs information where to continue executing and that information is in .NET stored on the stack.
+
+What happens is that when the virtual machine executes a `call` it stores the position of the op code following `call` on the stack. Later when `ret` is executed it reads that value from the `stack` and the virtual machine continue executing the instruction following `call`.
+
+`StackOverflowException` happens when we run out of space to store the return positions.
+
+_In reality there is more information that is stored on the stack than just the return position but let's ignore that for now._
+
+Tail calls allow us to not run out of stack space when implementing a recursive algorithm.
+
+If we have a call that looks like this:
+
+```asm
+IL_000c: call         int32 Program/Samples::w(int32, int32)
+IL_0011: ret
+```
+
+Since the call is the final operation it is a tail call. It seems unnecessary to return to `IL_0011` to just return the the calling function. What if we did this?
+
+```asm
+IL_000c: jmp          int32 Program/Samples::w(int32, int32)
+```
+
+Then we jump to `w` but we don't store the return address to us. When `w` return it returns to the code that called us directly. We save space and we save time.
+
+This is roughly what the jitter tries to do when it sees a tall call.
+
+Let's test our function above with this theory:
+
+```fsharp
+// Remember x is not tail recursive
+printfn "%A" <| x 10000  // Prints 5000
+printfn "%A" <| x 100000 // Crashes on my machine with StackOverflowException
+
+// Remember v is tail recursive
+printfn "%A" <| v 0 10000  // Prints 5000
+printfn "%A" <| v 0 100000 // Prints 50000
+printfn "%A" <| v 0 1000000000 // Prints 500000000 (takes awhile)
+```
+
+Tail calls are brilliant!
+
+## Why did our code perform worse than LINQ in `x86` mode?
+
+That is because tail calls sucks!
+
+If we profile the `x64` code the top 3 functions are:
+
+```
+[email protected]    32 %
+[email protected]  20 %
+[email protected]  18 %
+```
+
+In `x86` mode the top 3 functions are:
+
+```
+[email protected]  27 %
+[clr.dll]	                                        24 %
+[email protected]  17 %
+```
+
+What are the Invoke methods? What does the CLR do? Even if we eliminate the CLR part it does not explain the complete drop in performance.
+
+### Mapping Invoke methods to the code
+
+One way to interpret the name `[email protected]` is that this is lambda function generated for line 71. In my code line 71, 72 and 73 looks like this:
+
+```fsharp
+|> TrivialStream.map      (fun v -> int64 v)
+|> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
+|> TrivialStream.map      (fun v -> v + 1L)
+```
+
+Because F# inlines small lambdas the lambda functions that show up in the performance run are the data pipeline continuation lambdas for `map` and `filter`
+
+### What is the CLR doing
+
+You are going to love this! To answer this we have to dig into the jitted `x64` and `x86` code.
+
+Let's start with `x64` code:
+
+```asm
+; This is - TrivialStream.filter   (fun v -> v &&& 1L = 0L)
+00007ff8`5af01160 488bc2          mov     rax,rdx
+; Note that the test (fun v -> v &&& 1L = 0L) is inlined
+; Test if the if the number is even
+00007ff8`5af01163 a801            test    al,1
+00007ff8`5af01165 7512            jne     00007ff8`5af01179
+; It is even, then we should call the next step in the data pipeline.
+; That is a virtual call.
+; The jitter has devirtualization tricks but they have failed to apply here so
+; we have to do the full virtual call...
+; First load the pointer the classes that implements the next step in the data pipeline
+00007ff8`5af01167 488b4908        mov     rcx,qword ptr [rcx+8]
+; Load the object Method Table Pointer
+00007ff8`5af0116b 488b01          mov     rax,qword ptr [rcx]
+; Load the class Pointer to Methods
+00007ff8`5af0116e 488b4040        mov     rax,qword ptr [rax+40h]
+; Load the address to virtual method Invoke
+00007ff8`5af01172 488b4020        mov     rax,qword ptr [rax+20h]
+; Jump to next Invoke (note this is a tail call so need to return here)
+00007ff8`5af01176 48ffe0          jmp     rax
+; The number was odd meaning we return to the caller
+00007ff8`5af01179 33c0            xor     eax,eax
+00007ff8`5af0117b c3              ret
+```
+
+Apart from the unfortunate fact that the devirtualization has failed to apply which prevents further inlining of the next step the code above is _not_ that bad.
+
+Sure there is a lot of overhead compared to the simple test on odd/even but it
+could be worse.
+
+Speaking of much worse, let's look at the `x86` code for the same function:
+
+```asm
+; This is - TrivialStream.filter   (fun v -> v &&& 1L = 0L)
+; Method prelude storing various registers... they weren't here in x64 mode :(
+014e1090 55              push    ebp
+014e1091 8bec            mov     ebp,esp
+014e1093 57              push    edi
+014e1094 56              push    esi
+014e1095 53              push    ebx
+014e1096 8bf1            mov     esi,ecx
+; Loading the value to test from the stack rather than picking it up from a register :(
+014e1098 8b4508          mov     eax,dword ptr [ebp+8]
+; Well at least the test is simple enough and inlined
+014e109b a901000000      test    eax,1
+014e10a0 7526            jne     014e10c8
+; The number is even
+; But why this test? It wasn't here in x64 mode
+014e10a2 833d4010005d00  cmp     dword ptr [clr!g_TrapReturningThreads (5d001040)],0
+014e10a9 7526            jne     014e10d1
+; Ok, here we go loading the pointer to next step in the pipeline
+014e10ab 8b4e04          mov     ecx,dword ptr [esi+4]
+; Pushing input arguments on the stack as required by the x86 ABI
+; x64 ABI supports passing values in registers
+014e10ae ff750c          push    dword ptr [ebp+0Ch]
+014e10b1 ff7508          push    dword ptr [ebp+8]
+; Load the object Method Table Pointer
+014e10b4 8b01            mov     eax,dword ptr [ecx]
+; Load the class Pointer to Methods
+014e10b6 8b4028          mov     eax,dword ptr [eax+28h]
+; Load the address to virtual method Invoke
+014e10b9 8b4010          mov     eax,dword ptr [eax+10h]
+; More arguments pushed to stack?
+014e10bc 6a02            push    2
+014e10be 6a02            push    2
+014e10c0 6a01            push    1
+014e10c2 50              push    eax
+; And now we call clr!JIT_TailCall? Not Invoke?
+014e10c3 e877cf4a5b      call    clr!JIT_TailCall (5c98e03f)
+; The number is odd so let's return
+014e10c8 33c0            xor     eax,eax
+; Method epilogue :(
+014e10ca 5b              pop     ebx
+014e10cb 5e              pop     esi
+014e10cc 5f              pop     edi
+014e10cd 5d              pop     ebp
+014e10ce c20800          ret     8
+```
+
+Some differences can be attributed to that the ABI for `x64` and `x86` modes, but what is the `clr!JIT_TailCall` doing?
+
+Let's look at it:
+
+```asm
+; Why are we looking the current thread? :(
+5c98e03f e844060000      call    clr!GetThread (5c98e688)
+5c98e044 50              push    eax
+5c98e045 51              push    ecx
+5c98e046 52              push    edx
+; Something about the current thread is interesting?
+5c98e047 f7400480000000  test    dword ptr [eax+4],80h
+5c98e04e 7406            je      clr!JIT_TailCall+0x17 (5c98e056)
+5c98e050 50              push    eax
+5c98e051 e870213400      call    clr!JIT_TailCallHelper (5ccd01c6)
+; Seems we usually end up here!
+; Ok, let's load the arguments we pushed when calling clr!JIT_TailCall
+5c98e056 8b542414        mov     edx,dword ptr [esp+14h]
+5c98e05a 8b44241c        mov     eax,dword ptr [esp+1Ch]
+5c98e05e 8b4c2418        mov     ecx,dword ptr [esp+18h]
+; A lot of stuff happening, but it seems to be to eliminate the stackframe
+; of the current function. Which makes sense because we don't want the stack to grow
+; on a tail call
+5c98e062 f7c201000000    test    edx,1
+5c98e068 7409            je      clr!JIT_TailCall+0x34 (5c98e073)
+5c98e06a 8b7dfc          mov     edi,dword ptr [ebp-4]
+5c98e06d 8b75f8          mov     esi,dword ptr [ebp-8]
+5c98e070 8b5df4          mov     ebx,dword ptr [ebp-0Ch]
+5c98e073 ff7504          push    dword ptr [ebp+4]
+5c98e076 57              push    edi
+5c98e077 56              push    esi
+5c98e078 8d7c8508        lea     edi,[ebp+eax*4+8]
+5c98e07c 8d748c2c        lea     esi,[esp+ecx*4+2Ch]
+5c98e080 8b6d00          mov     ebp,dword ptr [ebp]
+5c98e083 f7c202000000    test    edx,2
+5c98e089 752b            jne     clr!JIT_TailCallLeave+0x1 (5c98e0b6)
+5c98e08b 85c9            test    ecx,ecx
+5c98e08d 740e            je      clr!JIT_TailCall+0x5e (5c98e09d)
+5c98e08f 8b46fc          mov     eax,dword ptr [esi-4]
+5c98e092 83ef04          sub     edi,4
+5c98e095 83ee04          sub     esi,4
+5c98e098 8907            mov     dword ptr [edi],eax
+; Hmm, a loop isn't what we want to see in a time-critical part
+5c98e09a 49              dec     ecx
+5c98e09b 75f2            jne     clr!JIT_TailCall+0x50 (5c98e08f)
+5c98e09d 8b442408        mov     eax,dword ptr [esp+8]
+5c98e0a1 8b4c241c        mov     ecx,dword ptr [esp+1Ch]
+5c98e0a5 8947fc          mov     dword ptr [edi-4],eax
+5c98e0a8 894ff8          mov     dword ptr [edi-8],ecx
+5c98e0ab 8d47f8          lea     eax,[edi-8]
+; Ok, getting near the end, restoring the registers
+5c98e0ae 5e              pop     esi
+5c98e0af 5f              pop     edi
+5c98e0b0 59              pop     ecx
+5c98e0b1 5a              pop     edx
+5c98e0b2 59              pop     ecx
+5c98e0b3 8be0            mov     esp,eax
+clr!JIT_TailCallLeave:
+; This actually "returns" to next step in the pipeline rather then the current step
+5c98e0b5 c3              ret
+```
+
+If you want to look at the actual source code for `clr!JIT_TailCall` you can have a peek at dotnet.core: https://github.com/dotnet/coreclr/blob/master/src/vm/i386/jithelp.asm#L927
+
+So it seems that `clr!Jit_TailCall` eliminates the current stackframe in order to make sure the stack don't grow. In `x64` mode the jitter was able to just do a `jmp` but that doesn't work for some reason in `x86` mode. Unfortunately it seems to be quite lot of work eliminating the stackframe.
+
+`clr!Jit_TailCall` is most likely what's tagged as `[clr.dll]` when we measured
+the most costly functions.
+
+## What does the function `paulWestcottsMagicOil` do again?
+
+Let's see what the "magic oil does to functions `x`, `y` , `v` and `w` we looked at before:
+
+```fsharp
+let inline paulWestcottsMagicOil u = match u with i -> i
+
+let rec x n = if n > 0 then  n + paulWestcottsMagicOil (y (n - 1)) else 0  // y (n - 1) is not a tail call
+and     y n = if n > 0 then -n + paulWestcottsMagicOil (x (n - 1)) else 0  // x (n - 1) is not a tail call
+
+let rec v  s n = if n > 0 then paulWestcottsMagicOil (w (s + n) (n - 1)) else s  // w (s + n) (n - 1) is a tail call
+and     w  s n = if n > 0 then paulWestcottsMagicOil (v (s - n) (n - 1)) else s  // v (s - n) (n - 1) is a tail call
+
+```
+
+`paulWestcottsMagicOil` runs a match on input argument and whatever it is that is returned. It does nothing!
+
+Let's see how the magic function changes the IL Assembler.
+
+Let's start with `x`:
+
+```asm
+IL_0008: call         int32 Program/Sample2::y(int32)
+IL_000d: add
+IL_000e: ret
+```
+
+Same as before. The magic oil function does nothing!
+
+Let's look at `v`:
+
+```asm
+IL_000a: call         int32 Program/Sample2::w(int32, int32)
+IL_000f: ret
+```
+
+The `.tail` attribute is gone?! It seems that because match semantically looks at the value returned by `w` the F# compiler don't realize this is in essence a tail call and omits the `.tail` attribute.
+
+This is the purpose of `paulWestcottsMagicOil` - to eliminate the `.tail` attribute. This means that we will get `StackOverflowException` when call `v 0 20000`.
+
+However, it seems to help the performance of `x86` code.
+
+Let's see how this affects the `x64` assembler code.
+
+```asm
+; Reserve some stack space
+00007ff8`5aee1160 4883ec28        sub     rsp,28h
+00007ff8`5aee1164 488bc2          mov     rax,rdx
+; The test
+00007ff8`5aee1167 a801            test    al,1
+00007ff8`5aee1169 7515            jne     00007ff8`5aee1180
+; Number is even, let's call the next step in the datapipeline
+; Load the pointer to the next step in the pipeline
+00007ff8`5aee116b 488b4908        mov     rcx,qword ptr [rcx+8]
+; Load the object Method Table Pointer
+00007ff8`5aee116f 488b01          mov     rax,qword ptr [rcx]
+; Load the class Pointer to Methods
+00007ff8`5aee1172 488b4040        mov     rax,qword ptr [rax+40h]
+; Load and call the address to virtual method Invoke
+00007ff8`5aee1176 ff5020          call    qword ptr [rax+20h]
+00007ff8`5aee1179 33c0            xor     eax,eax
+; Unreserve stack space
+00007ff8`5aee117b 4883c428        add     rsp,28h
+; We are done!
+00007ff8`5aee117f c3              ret
+; Number is odd, just return
+00007ff8`5aee1180 33c0            xor     eax,eax
+; Unreserve stack space
+00007ff8`5aee1182 4883c428        add     rsp,28h
+; We are done!
+00007ff8`5aee1186 c3              ret
+```
+
+In `x64` mode the code when `.tail` is present is better as we could `jmp` to the next step in the pipeline, now the jitter choses `call` instead. We lose some performance (25%) and we risk `StackOverflowException`.
+
+Let's look at `x86` code:
+
+```asm
+; Saving the framepointer. This is a good thing!
+02b61079 8bec            mov     ebp,esp
+02b61078 55              push    ebp
+; Trying to walk a stack manually without framepointers are not fun.
+;   However, ARM ABI on Linux seems to place the framepointer in a "random" position
+;   so it doesn't help there but that is a story for another blog.
+; Load number from stack
+02b6107b 8b4508          mov     eax,dword ptr [ebp+8]
+; Test number if even
+02b6107e a901000000      test    eax,1
+02b61083 7517            jne     02b6109c
+; Ok the number is even. Let's call the next step in the pipeline
+; Load the pointer to the next step in the pipeline
+02b61085 8b4904          mov     ecx,dword ptr [ecx+4]
+; Push the input arguments on the stack
+02b61088 ff750c          push    dword ptr [ebp+0Ch]
+02b6108b ff7508          push    dword ptr [ebp+8]
+; Load the object Method Table Pointer
+02b6108e 8b01            mov     eax,dword ptr [ecx]
+; Load the class Pointer to Methods
+02b61090 8b4028          mov     eax,dword ptr [eax+28h]
+; Load and call the address to virtual method Invoke
+02b61093 ff5010          call    dword ptr [eax+10h]
+02b61096 33c0            xor     eax,eax
+; Restore framepointer
+02b61098 5d              pop     ebp
+; We are done!
+02b61099 c20800          ret     8
+; Ok, the number was odd, just return
+02b6109c 33c0            xor     eax,eax
+; Restore framepointer
+02b6109e 5d              pop     ebp
+; We are done!
+02b6109f c20800          ret     8
+```
+
+This looks much cleaner and it performs much better. The gain seems to primarily to come from calling the next step in the pipeline directly without having to go through `clr!JIT_TailCall`. We risk `StackOverflowException` but perhaps it's worth it.
+
+## What is Tail Call Optimization (TCO)?
+
+F# doesn't have `break` and `continue` as common in other languages. In F# if you want to implement a loop that stops at a certain condition the idiom is to use tail recursion:
+
+```fsharp
+let rec find f = function
+  | []    -> None
+  | v::vs -> if f v then Some v else find vs
+```
+
+But what about efficiency? Doesn't the use of tail calls especially on `x86` mode kill the performance.
+
+Here is where F# compiler applies TCO and unrolls the code above to a loop. If we reverse engineer the IL code into C# it looks something like this
+
+```c#
+public static FSharpOption<a> find<a>(FSharpFunc<a, bool> f, FSharpList<a> _arg1)
+{
+  while (_arg1.get_TailOrNull() != null)
+  {
+    FSharpList<a> fsharpList = _arg1;
+    FSharpList<a> tailOrNull = fsharpList.get_TailOrNull();
+    a headOrDefault = fsharpList.get_HeadOrDefault();
+    if (f.Invoke(headOrDefault))
+      return FSharpOption<a>.Some(headOrDefault);
+    FSharpFunc<a, bool> fsharpFunc = f;
+    _arg1 = tailOrNull;
+    f = fsharpFunc;
+  }
+  return (FSharpOption<a>) null;
+}
+```
+
+So our tail recursive function becomes an efficient loop instead. TCO can not be applied in every situation, for example `v` and `w` above are mutually recursive. TCO doesn't apply there.
+
+Tail recursion to implements loops are more useful than you might think because in F# identical loops are vary a lot in terms of performance.
+
+```fsharp
+// These loops are fast
+for i in 0..10 do
+  compute i
+
+for i in 0..1..10 do
+  compute i
+
+for i in 0..-1..10 do
+  compute i
+
+// These loops are slow
+for i in 0L..10L do
+  compute i
+
+for i in 0..2..10 do
+  compute i
+
+// This cause a significant GC overhead in addition to being slow
+for i in 0..10000000 do
+  for j in 0L..10L do
+    compute i
+```
+
+If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#.
+
+# Conclusion
+
+## Tail recursion can be used to eliminate `StackOverflowException`
+
+Because we seen that tail recursion cause the jitter to avoid generating a stackframe (`x64` mode) or to eliminate the stackframe (`x86` mode) tail recursion allows us to implement deep recursion without `StackOverflowException`.
+
+## Tail recursion can have a significant performance impact.
+
+In `x86` mode was saw that eliminating the stackframe using `clr!JIT_TailCall` caused a significant overhead. In `x64` mode we instead observed a performance improvement as the jitter was able to avoid a stackframe altogether.
+
+However, it is possible that for more complex `x64` code the jitter end up needing to call `clr!JIT_TailCall`.
+
+As far as I know, C# compiler don't generate the `.tail` attribute meaning that the JIT performance team has not optimized tail call as efficiently as possible as F# is a minor .NET language (I wish it was otherwise).
+
+There are things the jitter could do to improve the stackframe elimination, like inlining `clr!JIT_TailCall` and unroll the loops.
+
+## Tail Call Optimization (TCO) is our friend!
+
+Implementing efficient non-trivial loops in F# is best done with tail recursion. In addition; tail recursion also allows us to avoid mutable state. F# has damaged me and now I tend to write tail recursion even in language that has support for break, continue and TCO like kotlin.
+
+C# doesn't currently support TCO so tail recursion is a bad idea in C#.
+
+## Paul Westcott is a cool guy
+
+I hope Paul forgives me for naming a function after him. I first saw tail call suppression trick in an ambitious PR from Paul that aimed to improve the `Seq` module performance.
+
+I see PRs from Paul that tries to improve the performance of various core library like `Seq`, `generic compare/equality` and `Lazy`. All of these libraries _need_ to be improved but I realize the difficulty in getting the PRs approved as they typically involve massive and risky changes. I have a beer instead and tries to forget it.
+
+Paul does try and that's why I think he should win the "Tenacious F# developer of the year" award.
+