Last active
July 18, 2024 07:24
-
-
Save mrange/1f26f3bd219da4d407ed6dde56c461dc to your computer and use it in GitHub Desktop.
Revisions
-
mrange renamed this gist
Jul 2, 2018 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
mrange revised this gist
Jul 2, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -596,7 +596,7 @@ for i in 0..10000000 do If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#. # Wrapping up ## Tail recursion can be used to eliminate `StackOverflowException` -
mrange created this gist
Jun 23, 2018 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,628 @@ # On the topic of tail calls in .NET Let's say you have implemented a small data pipeline library to replace LINQ. ```fsharp module TrivialStream = type Receiver<'T> = 'T -> unit type Stream<'T> = Receiver<'T> -> unit module Details = module Loop = let rec range s e r i = if i <= e then r i; range s e r (i + s) open Details let inline range b s e : Stream<int> = fun r -> Loop.range s e r b let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> = fun r -> s (fun v -> if f v then (r v)) let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> = fun r -> s (fun v -> (r (m v))) let inline sum (s : Stream<'T>) : 'T = let mutable ss = LanguagePrimitives.GenericZero s (fun v -> ss <- ss + v) ss ``` Then you test compare the overhead of the pipeline against LINQ: ```fsharp let trivialTest n = TrivialStream.range 0 1 n |> TrivialStream.map (fun v -> int64 v) |> TrivialStream.filter (fun v -> v &&& 1L = 0L) |> TrivialStream.map (fun v -> v + 1L) |> TrivialStream.sum let linqTest n = Enumerable .Range(0, n + 1) .Select(fun v -> int64 v) .Where(fun v -> v &&& 1L = 0L) .Select(fun v -> v + 1L) .Sum() ``` You are very satisfied to see that your trivial push data pipeline has significantly lower overhead than LINQ. ``` Running LINQ with total=100000000, outer=10, inner=10000000 ... ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L Running trivialpush with total=100000000, outer=10, inner=10000000 ... ... 375 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L ``` Almost 6x lower overhead! ## Paul Westcott ruins our day Unfortunately you run into [Paul Westcott](https://github.com/manofstick) who asks what happens if you run the benchhmark in `x86` mode rather than `x64` mode. ``` Running LINQ with total=100000000, outer=10, inner=10000000 ... ... 1773 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L Running trivialpush with total=100000000, outer=10, inner=10000000 ... ... 3116 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L ``` Now the tables are turned, your beautiful little library is actually slower than LINQ! Paul then says: _"Just apply some magic oil!"_ and suggests you change your code into: ```fsharp module TrivialStream = type Receiver<'T> = 'T -> unit type Stream<'T> = Receiver<'T> -> unit module Details = let inline paulWestcottsMagicOil u = match u with () -> () module Loop = let rec range s e r i = if i <= e then r i; range s e r (i + s) open Details let inline range b s e : Stream<int> = fun r -> Loop.range s e r b let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> = fun r -> s (fun v -> if f v then paulWestcottsMagicOil (r v)) let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> = fun r -> s (fun v -> paulWestcottsMagicOil (r (m v))) let inline sum (s : Stream<'T>) : 'T = let mutable ss = LanguagePrimitives.GenericZero s (fun v -> ss <- ss + v) ss ``` The function `let inline paulWestcottsMagicOil u = match u with () -> ()` does nothing but to your disbelief it actually helps! ``` Running LINQ with total=100000000, outer=10, inner=10000000 ... ... 1770 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L Running trivialpush with total=100000000, outer=10, inner=10000000 ... ... 561 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L ``` Performance in `x64` mode isn't as good as before but respectable: ``` Running LINQ with total=100000000, outer=10, inner=10000000 ... ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L Running trivialpush with total=100000000, outer=10, inner=10000000 ... ... 483 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L ``` ## What's going on? It's great that we have a potential solution but some question marks still remains: 1. What does this have to do with tail calls in .NET (as implied by the title of the blog)? 1. Why the significant drop in performance between `x64` and `x86` modes? 1. `paulWestcottsMagicOil` does nothing yet it helps, why? 1. What is tail call optimization (TCO) I sometimes hear about? ## Let's start with tail calls. A [tail call](https://en.wikipedia.org/wiki/Tail_call) is a function call that is the final operation in a function. When the F# compiler detects that a function is the final operation it decorates the call site with a `.tail` attribute. In fact, in our data pipeline the calls to the next step in pipeline are tail calls. ```fsharp let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> = fun r -> s (fun v -> if f v then (r v)) // <-- r v is a tail call let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> = fun r -> s (fun v -> (r (m v))) // <-- r v is a tail call ``` If we would look at result of a function call that call isn't a tail call because it is no longer the final operation. ### A silly code example ```fsharp let rec x n = if n > 0 then n + y (n - 1) else 0 // y (n - 1) is not a tail call and y n = if n > 0 then -n + x (n - 1) else 0 // x (n - 1) is not a tail call let rec v s n = if n > 0 then w (s + n) (n - 1) else s // w (s + n) (n - 1) is a tail call and w s n = if n > 0 then v (s - n) (n - 1) else s // v (s - n) (n - 1) is a tail call ``` These functions don't do anything really useful except demonstrating two different ways to implement the same functionality. One using tail calls (v & w) and one without tail calls (x & y). The F# compiler then should inject a `.tail` attribute for v & w but not for x & y. To see this we need to peek at the generated IL assembler. Here is an excerpt of the IL assembler for `x`: ```asm ; Calls y IL_0008: call int32 Program/Samples::y(int32) ; Increments the result IL_000d: add ; Returns from the function IL_000e: ret ``` `x` calls `y`. When `y` returns we accumulate the result and returns. Because the call isn't the final operation it is not a tail call as expected. It looks different for `v`: ```asm ; The following call is a tail call IL_000a: tail. ; Calls w IL_000c: call int32 Program/Samples::w(int32, int32) ; Returns from the function IL_0011: ret ``` `v` calls `w` but as this is the final operation the call site is decorated with `.tail`. ### Why are tail calls a good thing? I am sure you have run into your fair share of `StackOverflowException`. For developers that aren't aware of the runtime ABI the name `StackOverflowException` must seem cryptic but we quickly learn it typically means we have a recursive function that keeps calling itself over and over until we overflow. The F# and C# compiler translates our source into a stream of op codes that are executed by the .NET virtual machine. When the virtual machine hits the op code `ret` it mean it should continue at the code site that called this function. The virtual machine needs information where to continue executing and that information is in .NET stored on the stack. What happens is that when the virtual machine executes a `call` it stores the position of the op code following `call` on the stack. Later when `ret` is executed it reads that value from the `stack` and the virtual machine continue executing the instruction following `call`. `StackOverflowException` happens when we run out of space to store the return positions. _In reality there is more information that is stored on the stack than just the return position but let's ignore that for now._ Tail calls allow us to not run out of stack space when implementing a recursive algorithm. If we have a call that looks like this: ```asm IL_000c: call int32 Program/Samples::w(int32, int32) IL_0011: ret ``` Since the call is the final operation it is a tail call. It seems unnecessary to return to `IL_0011` to just return the the calling function. What if we did this? ```asm IL_000c: jmp int32 Program/Samples::w(int32, int32) ``` Then we jump to `w` but we don't store the return address to us. When `w` return it returns to the code that called us directly. We save space and we save time. This is roughly what the jitter tries to do when it sees a tall call. Let's test our function above with this theory: ```fsharp // Remember x is not tail recursive printfn "%A" <| x 10000 // Prints 5000 printfn "%A" <| x 100000 // Crashes on my machine with StackOverflowException // Remember v is tail recursive printfn "%A" <| v 0 10000 // Prints 5000 printfn "%A" <| v 0 100000 // Prints 50000 printfn "%A" <| v 0 1000000000 // Prints 500000000 (takes awhile) ``` Tail calls are brilliant! ## Why did our code perform worse than LINQ in `x86` mode? That is because tail calls sucks! If we profile the `x64` code the top 3 functions are: ``` [email protected] 32 % [email protected] 20 % [email protected] 18 % ``` In `x86` mode the top 3 functions are: ``` [email protected] 27 % [clr.dll] 24 % [email protected] 17 % ``` What are the Invoke methods? What does the CLR do? Even if we eliminate the CLR part it does not explain the complete drop in performance. ### Mapping Invoke methods to the code One way to interpret the name `[email protected]` is that this is lambda function generated for line 71. In my code line 71, 72 and 73 looks like this: ```fsharp |> TrivialStream.map (fun v -> int64 v) |> TrivialStream.filter (fun v -> v &&& 1L = 0L) |> TrivialStream.map (fun v -> v + 1L) ``` Because F# inlines small lambdas the lambda functions that show up in the performance run are the data pipeline continuation lambdas for `map` and `filter` ### What is the CLR doing You are going to love this! To answer this we have to dig into the jitted `x64` and `x86` code. Let's start with `x64` code: ```asm ; This is - TrivialStream.filter (fun v -> v &&& 1L = 0L) 00007ff8`5af01160 488bc2 mov rax,rdx ; Note that the test (fun v -> v &&& 1L = 0L) is inlined ; Test if the if the number is even 00007ff8`5af01163 a801 test al,1 00007ff8`5af01165 7512 jne 00007ff8`5af01179 ; It is even, then we should call the next step in the data pipeline. ; That is a virtual call. ; The jitter has devirtualization tricks but they have failed to apply here so ; we have to do the full virtual call... ; First load the pointer the classes that implements the next step in the data pipeline 00007ff8`5af01167 488b4908 mov rcx,qword ptr [rcx+8] ; Load the object Method Table Pointer 00007ff8`5af0116b 488b01 mov rax,qword ptr [rcx] ; Load the class Pointer to Methods 00007ff8`5af0116e 488b4040 mov rax,qword ptr [rax+40h] ; Load the address to virtual method Invoke 00007ff8`5af01172 488b4020 mov rax,qword ptr [rax+20h] ; Jump to next Invoke (note this is a tail call so need to return here) 00007ff8`5af01176 48ffe0 jmp rax ; The number was odd meaning we return to the caller 00007ff8`5af01179 33c0 xor eax,eax 00007ff8`5af0117b c3 ret ``` Apart from the unfortunate fact that the devirtualization has failed to apply which prevents further inlining of the next step the code above is _not_ that bad. Sure there is a lot of overhead compared to the simple test on odd/even but it could be worse. Speaking of much worse, let's look at the `x86` code for the same function: ```asm ; This is - TrivialStream.filter (fun v -> v &&& 1L = 0L) ; Method prelude storing various registers... they weren't here in x64 mode :( 014e1090 55 push ebp 014e1091 8bec mov ebp,esp 014e1093 57 push edi 014e1094 56 push esi 014e1095 53 push ebx 014e1096 8bf1 mov esi,ecx ; Loading the value to test from the stack rather than picking it up from a register :( 014e1098 8b4508 mov eax,dword ptr [ebp+8] ; Well at least the test is simple enough and inlined 014e109b a901000000 test eax,1 014e10a0 7526 jne 014e10c8 ; The number is even ; But why this test? It wasn't here in x64 mode 014e10a2 833d4010005d00 cmp dword ptr [clr!g_TrapReturningThreads (5d001040)],0 014e10a9 7526 jne 014e10d1 ; Ok, here we go loading the pointer to next step in the pipeline 014e10ab 8b4e04 mov ecx,dword ptr [esi+4] ; Pushing input arguments on the stack as required by the x86 ABI ; x64 ABI supports passing values in registers 014e10ae ff750c push dword ptr [ebp+0Ch] 014e10b1 ff7508 push dword ptr [ebp+8] ; Load the object Method Table Pointer 014e10b4 8b01 mov eax,dword ptr [ecx] ; Load the class Pointer to Methods 014e10b6 8b4028 mov eax,dword ptr [eax+28h] ; Load the address to virtual method Invoke 014e10b9 8b4010 mov eax,dword ptr [eax+10h] ; More arguments pushed to stack? 014e10bc 6a02 push 2 014e10be 6a02 push 2 014e10c0 6a01 push 1 014e10c2 50 push eax ; And now we call clr!JIT_TailCall? Not Invoke? 014e10c3 e877cf4a5b call clr!JIT_TailCall (5c98e03f) ; The number is odd so let's return 014e10c8 33c0 xor eax,eax ; Method epilogue :( 014e10ca 5b pop ebx 014e10cb 5e pop esi 014e10cc 5f pop edi 014e10cd 5d pop ebp 014e10ce c20800 ret 8 ``` Some differences can be attributed to that the ABI for `x64` and `x86` modes, but what is the `clr!JIT_TailCall` doing? Let's look at it: ```asm ; Why are we looking the current thread? :( 5c98e03f e844060000 call clr!GetThread (5c98e688) 5c98e044 50 push eax 5c98e045 51 push ecx 5c98e046 52 push edx ; Something about the current thread is interesting? 5c98e047 f7400480000000 test dword ptr [eax+4],80h 5c98e04e 7406 je clr!JIT_TailCall+0x17 (5c98e056) 5c98e050 50 push eax 5c98e051 e870213400 call clr!JIT_TailCallHelper (5ccd01c6) ; Seems we usually end up here! ; Ok, let's load the arguments we pushed when calling clr!JIT_TailCall 5c98e056 8b542414 mov edx,dword ptr [esp+14h] 5c98e05a 8b44241c mov eax,dword ptr [esp+1Ch] 5c98e05e 8b4c2418 mov ecx,dword ptr [esp+18h] ; A lot of stuff happening, but it seems to be to eliminate the stackframe ; of the current function. Which makes sense because we don't want the stack to grow ; on a tail call 5c98e062 f7c201000000 test edx,1 5c98e068 7409 je clr!JIT_TailCall+0x34 (5c98e073) 5c98e06a 8b7dfc mov edi,dword ptr [ebp-4] 5c98e06d 8b75f8 mov esi,dword ptr [ebp-8] 5c98e070 8b5df4 mov ebx,dword ptr [ebp-0Ch] 5c98e073 ff7504 push dword ptr [ebp+4] 5c98e076 57 push edi 5c98e077 56 push esi 5c98e078 8d7c8508 lea edi,[ebp+eax*4+8] 5c98e07c 8d748c2c lea esi,[esp+ecx*4+2Ch] 5c98e080 8b6d00 mov ebp,dword ptr [ebp] 5c98e083 f7c202000000 test edx,2 5c98e089 752b jne clr!JIT_TailCallLeave+0x1 (5c98e0b6) 5c98e08b 85c9 test ecx,ecx 5c98e08d 740e je clr!JIT_TailCall+0x5e (5c98e09d) 5c98e08f 8b46fc mov eax,dword ptr [esi-4] 5c98e092 83ef04 sub edi,4 5c98e095 83ee04 sub esi,4 5c98e098 8907 mov dword ptr [edi],eax ; Hmm, a loop isn't what we want to see in a time-critical part 5c98e09a 49 dec ecx 5c98e09b 75f2 jne clr!JIT_TailCall+0x50 (5c98e08f) 5c98e09d 8b442408 mov eax,dword ptr [esp+8] 5c98e0a1 8b4c241c mov ecx,dword ptr [esp+1Ch] 5c98e0a5 8947fc mov dword ptr [edi-4],eax 5c98e0a8 894ff8 mov dword ptr [edi-8],ecx 5c98e0ab 8d47f8 lea eax,[edi-8] ; Ok, getting near the end, restoring the registers 5c98e0ae 5e pop esi 5c98e0af 5f pop edi 5c98e0b0 59 pop ecx 5c98e0b1 5a pop edx 5c98e0b2 59 pop ecx 5c98e0b3 8be0 mov esp,eax clr!JIT_TailCallLeave: ; This actually "returns" to next step in the pipeline rather then the current step 5c98e0b5 c3 ret ``` If you want to look at the actual source code for `clr!JIT_TailCall` you can have a peek at dotnet.core: https://github.com/dotnet/coreclr/blob/master/src/vm/i386/jithelp.asm#L927 So it seems that `clr!Jit_TailCall` eliminates the current stackframe in order to make sure the stack don't grow. In `x64` mode the jitter was able to just do a `jmp` but that doesn't work for some reason in `x86` mode. Unfortunately it seems to be quite lot of work eliminating the stackframe. `clr!Jit_TailCall` is most likely what's tagged as `[clr.dll]` when we measured the most costly functions. ## What does the function `paulWestcottsMagicOil` do again? Let's see what the "magic oil does to functions `x`, `y` , `v` and `w` we looked at before: ```fsharp let inline paulWestcottsMagicOil u = match u with i -> i let rec x n = if n > 0 then n + paulWestcottsMagicOil (y (n - 1)) else 0 // y (n - 1) is not a tail call and y n = if n > 0 then -n + paulWestcottsMagicOil (x (n - 1)) else 0 // x (n - 1) is not a tail call let rec v s n = if n > 0 then paulWestcottsMagicOil (w (s + n) (n - 1)) else s // w (s + n) (n - 1) is a tail call and w s n = if n > 0 then paulWestcottsMagicOil (v (s - n) (n - 1)) else s // v (s - n) (n - 1) is a tail call ``` `paulWestcottsMagicOil` runs a match on input argument and whatever it is that is returned. It does nothing! Let's see how the magic function changes the IL Assembler. Let's start with `x`: ```asm IL_0008: call int32 Program/Sample2::y(int32) IL_000d: add IL_000e: ret ``` Same as before. The magic oil function does nothing! Let's look at `v`: ```asm IL_000a: call int32 Program/Sample2::w(int32, int32) IL_000f: ret ``` The `.tail` attribute is gone?! It seems that because match semantically looks at the value returned by `w` the F# compiler don't realize this is in essence a tail call and omits the `.tail` attribute. This is the purpose of `paulWestcottsMagicOil` - to eliminate the `.tail` attribute. This means that we will get `StackOverflowException` when call `v 0 20000`. However, it seems to help the performance of `x86` code. Let's see how this affects the `x64` assembler code. ```asm ; Reserve some stack space 00007ff8`5aee1160 4883ec28 sub rsp,28h 00007ff8`5aee1164 488bc2 mov rax,rdx ; The test 00007ff8`5aee1167 a801 test al,1 00007ff8`5aee1169 7515 jne 00007ff8`5aee1180 ; Number is even, let's call the next step in the datapipeline ; Load the pointer to the next step in the pipeline 00007ff8`5aee116b 488b4908 mov rcx,qword ptr [rcx+8] ; Load the object Method Table Pointer 00007ff8`5aee116f 488b01 mov rax,qword ptr [rcx] ; Load the class Pointer to Methods 00007ff8`5aee1172 488b4040 mov rax,qword ptr [rax+40h] ; Load and call the address to virtual method Invoke 00007ff8`5aee1176 ff5020 call qword ptr [rax+20h] 00007ff8`5aee1179 33c0 xor eax,eax ; Unreserve stack space 00007ff8`5aee117b 4883c428 add rsp,28h ; We are done! 00007ff8`5aee117f c3 ret ; Number is odd, just return 00007ff8`5aee1180 33c0 xor eax,eax ; Unreserve stack space 00007ff8`5aee1182 4883c428 add rsp,28h ; We are done! 00007ff8`5aee1186 c3 ret ``` In `x64` mode the code when `.tail` is present is better as we could `jmp` to the next step in the pipeline, now the jitter choses `call` instead. We lose some performance (25%) and we risk `StackOverflowException`. Let's look at `x86` code: ```asm ; Saving the framepointer. This is a good thing! 02b61079 8bec mov ebp,esp 02b61078 55 push ebp ; Trying to walk a stack manually without framepointers are not fun. ; However, ARM ABI on Linux seems to place the framepointer in a "random" position ; so it doesn't help there but that is a story for another blog. ; Load number from stack 02b6107b 8b4508 mov eax,dword ptr [ebp+8] ; Test number if even 02b6107e a901000000 test eax,1 02b61083 7517 jne 02b6109c ; Ok the number is even. Let's call the next step in the pipeline ; Load the pointer to the next step in the pipeline 02b61085 8b4904 mov ecx,dword ptr [ecx+4] ; Push the input arguments on the stack 02b61088 ff750c push dword ptr [ebp+0Ch] 02b6108b ff7508 push dword ptr [ebp+8] ; Load the object Method Table Pointer 02b6108e 8b01 mov eax,dword ptr [ecx] ; Load the class Pointer to Methods 02b61090 8b4028 mov eax,dword ptr [eax+28h] ; Load and call the address to virtual method Invoke 02b61093 ff5010 call dword ptr [eax+10h] 02b61096 33c0 xor eax,eax ; Restore framepointer 02b61098 5d pop ebp ; We are done! 02b61099 c20800 ret 8 ; Ok, the number was odd, just return 02b6109c 33c0 xor eax,eax ; Restore framepointer 02b6109e 5d pop ebp ; We are done! 02b6109f c20800 ret 8 ``` This looks much cleaner and it performs much better. The gain seems to primarily to come from calling the next step in the pipeline directly without having to go through `clr!JIT_TailCall`. We risk `StackOverflowException` but perhaps it's worth it. ## What is Tail Call Optimization (TCO)? F# doesn't have `break` and `continue` as common in other languages. In F# if you want to implement a loop that stops at a certain condition the idiom is to use tail recursion: ```fsharp let rec find f = function | [] -> None | v::vs -> if f v then Some v else find vs ``` But what about efficiency? Doesn't the use of tail calls especially on `x86` mode kill the performance. Here is where F# compiler applies TCO and unrolls the code above to a loop. If we reverse engineer the IL code into C# it looks something like this ```c# public static FSharpOption<a> find<a>(FSharpFunc<a, bool> f, FSharpList<a> _arg1) { while (_arg1.get_TailOrNull() != null) { FSharpList<a> fsharpList = _arg1; FSharpList<a> tailOrNull = fsharpList.get_TailOrNull(); a headOrDefault = fsharpList.get_HeadOrDefault(); if (f.Invoke(headOrDefault)) return FSharpOption<a>.Some(headOrDefault); FSharpFunc<a, bool> fsharpFunc = f; _arg1 = tailOrNull; f = fsharpFunc; } return (FSharpOption<a>) null; } ``` So our tail recursive function becomes an efficient loop instead. TCO can not be applied in every situation, for example `v` and `w` above are mutually recursive. TCO doesn't apply there. Tail recursion to implements loops are more useful than you might think because in F# identical loops are vary a lot in terms of performance. ```fsharp // These loops are fast for i in 0..10 do compute i for i in 0..1..10 do compute i for i in 0..-1..10 do compute i // These loops are slow for i in 0L..10L do compute i for i in 0..2..10 do compute i // This cause a significant GC overhead in addition to being slow for i in 0..10000000 do for j in 0L..10L do compute i ``` If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#. # Conclusion ## Tail recursion can be used to eliminate `StackOverflowException` Because we seen that tail recursion cause the jitter to avoid generating a stackframe (`x64` mode) or to eliminate the stackframe (`x86` mode) tail recursion allows us to implement deep recursion without `StackOverflowException`. ## Tail recursion can have a significant performance impact. In `x86` mode was saw that eliminating the stackframe using `clr!JIT_TailCall` caused a significant overhead. In `x64` mode we instead observed a performance improvement as the jitter was able to avoid a stackframe altogether. However, it is possible that for more complex `x64` code the jitter end up needing to call `clr!JIT_TailCall`. As far as I know, C# compiler don't generate the `.tail` attribute meaning that the JIT performance team has not optimized tail call as efficiently as possible as F# is a minor .NET language (I wish it was otherwise). There are things the jitter could do to improve the stackframe elimination, like inlining `clr!JIT_TailCall` and unroll the loops. ## Tail Call Optimization (TCO) is our friend! Implementing efficient non-trivial loops in F# is best done with tail recursion. In addition; tail recursion also allows us to avoid mutable state. F# has damaged me and now I tend to write tail recursion even in language that has support for break, continue and TCO like kotlin. C# doesn't currently support TCO so tail recursion is a bad idea in C#. ## Paul Westcott is a cool guy I hope Paul forgives me for naming a function after him. I first saw tail call suppression trick in an ambitious PR from Paul that aimed to improve the `Seq` module performance. I see PRs from Paul that tries to improve the performance of various core library like `Seq`, `generic compare/equality` and `Lazy`. All of these libraries _need_ to be improved but I realize the difficulty in getting the PRs approved as they typically involve massive and risky changes. I have a beer instead and tries to forget it. Paul does try and that's why I think he should win the "Tenacious F# developer of the year" award.