Skip to content

Instantly share code, notes, and snippets.

@topin89
Last active March 20, 2023 19:21
Show Gist options
  • Save topin89/84d1f78511dee9d32782b96638d94b6b to your computer and use it in GitHub Desktop.
Save topin89/84d1f78511dee9d32782b96638d94b6b to your computer and use it in GitHub Desktop.

Revisions

  1. topin89 revised this gist Mar 20, 2023. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -243,6 +243,7 @@ Bench: virtual functions optimized
    ```
    * MSVC 2022 17.4.3
    Release:
    ```
    @@ -418,6 +419,7 @@ Bench: virtual functions optimized
    ```
    * clang 15.0.7
    Release:
    ```
  2. topin89 revised this gist Mar 20, 2023. 1 changed file with 22 additions and 10 deletions.
    32 changes: 22 additions & 10 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -185,7 +185,9 @@ Edit: there's new measurements from Windows machine:
    Windows 10 (different machine, so I can't compare it with the above benchmarks):
    * clang 15.0.6 (MSVC runtime)
    * Release
    Release:
    ```
    Bench: virtual functions
    Regular loop, time: 154 ms area: 7.50355e+14
    @@ -212,7 +214,8 @@ Bench: virtual functions optimized
    Unrolled loop, time: 39 ms area: 7.39093e+14
    ```
    * Debug:
    Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 251 ms area: 7.50355e+14
    @@ -240,7 +243,8 @@ Bench: virtual functions optimized
    ```
    * MSVC 2022 17.4.3
    * Release
    Release:
    ```
    Bench: virtual functions
    Regular loop, time: 153 ms area: 7.50355e+14
    @@ -267,7 +271,8 @@ Bench: virtual functions optimized
    Unrolled loop, time: 37 ms area: 7.39093e+14
    ```
    * Debug
    Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 265 ms area: 7.50355e+14
    @@ -296,7 +301,9 @@ Bench: virtual functions optimized
    Ubuntu 20.04, chrooted in Ubuntu 22.04
    * gcc 12.1.0
    * Release
    Release:
    ```
    Bench: virtual functions
    Regular loop, time: 141 ms area: 7.45026e+14
    @@ -323,7 +330,8 @@ Bench: virtual functions optimized
    Unrolled loop, time: 31 ms area: 7.36963e+14
    ```
    * Debug:
    Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 224 ms area: 7.45026e+14
    @@ -352,7 +360,8 @@ Bench: virtual functions optimized
    Ubuntu 20.04
    * gcc 11.1.0
    * Release
    Release:
    ```
    Bench: virtual functions
    @@ -380,7 +389,8 @@ Bench: virtual functions optimized
    Unrolled loop, time: 28 ms area: 7.36963e+14
    ```
    * Debug:
    Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 227 ms area: 7.45026e+14
    @@ -408,7 +418,8 @@ Bench: virtual functions optimized
    ```
    * clang 15.0.7
    * Release
    Release:
    ```
    Bench: virtual functions
    Regular loop, time: 136 ms area: 7.45026e+14
    @@ -435,7 +446,8 @@ Bench: virtual functions optimized
    Unrolled loop, time: 31 ms area: 7.36963e+14
    ```
    * Debug:
    Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 208 ms area: 7.45026e+14
  3. topin89 revised this gist Mar 20, 2023. 1 changed file with 7 additions and 3 deletions.
    10 changes: 7 additions & 3 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -185,7 +185,7 @@ Edit: there's new measurements from Windows machine:
    Windows 10 (different machine, so I can't compare it with the above benchmarks):
    * clang 15.0.6 (MSVC runtime)
    ** Release
    * Release
    ```
    Bench: virtual functions
    Regular loop, time: 154 ms area: 7.50355e+14
    @@ -212,7 +212,7 @@ Bench: virtual functions optimized
    Unrolled loop, time: 39 ms area: 7.39093e+14
    ```
    ** Debug:
    * Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 251 ms area: 7.50355e+14
    @@ -266,6 +266,7 @@ Bench: virtual functions optimized
    Regular loop, time: 41 ms area: 7.50355e+14
    Unrolled loop, time: 37 ms area: 7.39093e+14
    ```
    * Debug
    ```
    Bench: virtual functions
    @@ -321,6 +322,7 @@ Bench: virtual functions optimized
    Regular loop, time: 45 ms area: 7.45026e+14
    Unrolled loop, time: 31 ms area: 7.36963e+14
    ```
    * Debug:
    ```
    Bench: virtual functions
    @@ -349,9 +351,9 @@ Bench: virtual functions optimized
    ```
    Ubuntu 20.04
    ```
    * gcc 11.1.0
    * Release
    ```
    Bench: virtual functions
    Regular loop, time: 128 ms area: 7.45026e+14
    @@ -377,6 +379,7 @@ Bench: virtual functions optimized
    Regular loop, time: 43 ms area: 7.45026e+14
    Unrolled loop, time: 28 ms area: 7.36963e+14
    ```
    * Debug:
    ```
    Bench: virtual functions
    @@ -431,6 +434,7 @@ Bench: virtual functions optimized
    Regular loop, time: 67 ms area: 7.45026e+14
    Unrolled loop, time: 31 ms area: 7.36963e+14
    ```
    * Debug:
    ```
    Bench: virtual functions
  4. topin89 revised this gist Mar 20, 2023. 1 changed file with 210 additions and 3 deletions.
    213 changes: 210 additions & 3 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -240,7 +240,7 @@ Bench: virtual functions optimized
    ```
    * MSVC 2022 17.4.3
    ** Release
    * Release
    ```
    Bench: virtual functions
    Regular loop, time: 153 ms area: 7.50355e+14
    @@ -266,7 +266,7 @@ Bench: virtual functions optimized
    Regular loop, time: 41 ms area: 7.50355e+14
    Unrolled loop, time: 37 ms area: 7.39093e+14
    ```
    ** Debug
    * Debug
    ```
    Bench: virtual functions
    Regular loop, time: 265 ms area: 7.50355e+14
    @@ -293,4 +293,211 @@ Bench: virtual functions optimized
    Unrolled loop, time: 91 ms area: 7.39093e+14
    ```
    Edit: old result and conclusions are removed, since they were incorrectly measured. You still can find them in old revisions
    Ubuntu 20.04, chrooted in Ubuntu 22.04
    * gcc 12.1.0
    * Release
    ```
    Bench: virtual functions
    Regular loop, time: 141 ms area: 7.45026e+14
    Unrolled loop, time: 128 ms area: 7.36963e+14

    Bench: switch case
    Regular loop, time: 83 ms area: 7.45026e+14
    Unrolled loop, time: 56 ms area: 7.36963e+14

    Bench: lookup table
    Regular loop, time: 21 ms area: 7.5042e+14
    Unrolled loop, time: 12 ms area: 7.65487e+14

    Bench: std::variant
    Regular loop, time: 78 ms area: 7.45026e+14
    Unrolled loop, time: 54 ms area: 7.36963e+14

    Bench: std::variant optimized
    Regular loop, time: 68 ms area: 7.45026e+14
    Unrolled loop, time: 29 ms area: 7.36963e+14

    Bench: virtual functions optimized
    Regular loop, time: 45 ms area: 7.45026e+14
    Unrolled loop, time: 31 ms area: 7.36963e+14
    ```
    * Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 224 ms area: 7.45026e+14
    Unrolled loop, time: 186 ms area: 7.36963e+14

    Bench: switch case
    Regular loop, time: 236 ms area: 7.45026e+14
    Unrolled loop, time: 197 ms area: 7.36963e+14

    Bench: lookup table
    Regular loop, time: 66 ms area: 7.5042e+14
    Unrolled loop, time: 47 ms area: 7.65487e+14

    Bench: std::variant
    Regular loop, time: 1092 ms area: 7.45026e+14
    Unrolled loop, time: 1076 ms area: 7.36963e+14

    Bench: std::variant optimized
    Regular loop, time: 1098 ms area: 7.45026e+14
    Unrolled loop, time: 1087 ms area: 7.36963e+14

    Bench: virtual functions optimized
    Regular loop, time: 83 ms area: 7.45026e+14
    Unrolled loop, time: 57 ms area: 7.36963e+14
    ```
    Ubuntu 20.04
    ```
    * gcc 11.1.0
    * Release
    ```
    Bench: virtual functions
    Regular loop, time: 128 ms area: 7.45026e+14
    Unrolled loop, time: 128 ms area: 7.36963e+14
    Bench: switch case
    Regular loop, time: 88 ms area: 7.45026e+14
    Unrolled loop, time: 55 ms area: 7.36963e+14
    Bench: lookup table
    Regular loop, time: 19 ms area: 7.5042e+14
    Unrolled loop, time: 10 ms area: 7.65487e+14
    Bench: std::variant
    Regular loop, time: 114 ms area: 7.45026e+14
    Unrolled loop, time: 108 ms area: 7.36963e+14
    Bench: std::variant optimized
    Regular loop, time: 115 ms area: 7.45026e+14
    Unrolled loop, time: 111 ms area: 7.36963e+14
    Bench: virtual functions optimized
    Regular loop, time: 43 ms area: 7.45026e+14
    Unrolled loop, time: 28 ms area: 7.36963e+14
    ```
    * Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 227 ms area: 7.45026e+14
    Unrolled loop, time: 188 ms area: 7.36963e+14
    Bench: switch case
    Regular loop, time: 231 ms area: 7.45026e+14
    Unrolled loop, time: 198 ms area: 7.36963e+14
    Bench: lookup table
    Regular loop, time: 65 ms area: 7.5042e+14
    Unrolled loop, time: 48 ms area: 7.65487e+14
    Bench: std::variant
    Regular loop, time: 1140 ms area: 7.45026e+14
    Unrolled loop, time: 1165 ms area: 7.36963e+14
    Bench: std::variant optimized
    Regular loop, time: 1150 ms area: 7.45026e+14
    Unrolled loop, time: 1151 ms area: 7.36963e+14
    Bench: virtual functions optimized
    Regular loop, time: 79 ms area: 7.45026e+14
    Unrolled loop, time: 59 ms area: 7.36963e+14
    ```

    * clang 15.0.7
    * Release
    ```
    Bench: virtual functions
    Regular loop, time: 136 ms area: 7.45026e+14
    Unrolled loop, time: 127 ms area: 7.36963e+14
    Bench: switch case
    Regular loop, time: 119 ms area: 7.45026e+14
    Unrolled loop, time: 112 ms area: 7.36963e+14
    Bench: lookup table
    Regular loop, time: 19 ms area: 7.5042e+14
    Unrolled loop, time: 15 ms area: 7.65487e+14
    Bench: std::variant
    Regular loop, time: 116 ms area: 7.45026e+14
    Unrolled loop, time: 110 ms area: 7.36963e+14
    Bench: std::variant optimized
    Regular loop, time: 122 ms area: 7.45026e+14
    Unrolled loop, time: 117 ms area: 7.36963e+14
    Bench: virtual functions optimized
    Regular loop, time: 67 ms area: 7.45026e+14
    Unrolled loop, time: 31 ms area: 7.36963e+14
    ```
    * Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 208 ms area: 7.45026e+14
    Unrolled loop, time: 178 ms area: 7.36963e+14
    Bench: switch case
    Regular loop, time: 290 ms area: 7.45026e+14
    Unrolled loop, time: 264 ms area: 7.36963e+14
    Bench: lookup table
    Regular loop, time: 105 ms area: 7.5042e+14
    Unrolled loop, time: 91 ms area: 7.65487e+14
    Bench: std::variant
    Regular loop, time: 586 ms area: 7.45026e+14
    Unrolled loop, time: 603 ms area: 7.36963e+14
    Bench: std::variant optimized
    Regular loop, time: 619 ms area: 7.45026e+14
    Unrolled loop, time: 631 ms area: 7.36963e+14
    Bench: virtual functions optimized
    Regular loop, time: 60 ms area: 7.45026e+14
    Unrolled loop, time: 48 ms area: 7.36963e+14
    ```

    So, lets compare it by categories, for regular loops:

    | | gcc 12 | gcc 11 | clang 15 Lin | clang 15 Win | MSVC |
    |-----------------------------|--------|--------|--------------|--------------|------|
    | virtual functions | 141 | 128 | 136 | 154 | 152 |
    | switch case | 83 | 88 | 119 | 131 | 130 |
    | lookup table | 21 | 19 | 19 | 18 | 21 |
    | std::variant | 78 | 114 | 116 | 133 | 124 |
    | std::variant optimized | 68 | 115 | 122 | 19 | 118 |
    | virtual functions optimized | 45 | 43 | 67 | 41 | 45 |

    and for unrolled loops:

    | | gcc 12 | gcc 11 | clang 15 Lin | clang 15 Win | MSVC |
    |-----------------------------|--------|--------|--------------|--------------|------|
    | virtual functions | 128 | 128 | 127 | 148 | 139 |
    | switch case | 56 | 55 | 112 | 119 | 90 |
    | lookup table | 12 | 10 | 15 | 12 | 18 |
    | std::variant | 54 | 108 | 110 | 121 | 79 |
    | std::variant optimized | 29 | 111 | 117 | 18 | 78 |
    | virtual functions optimized | 31 | 28 | 31 | 39 | 40 |

    Note, Win machine is not the same as Lin, so there will be no direct comparison.
    I don't think it's worth comparing Debug versions with each other, so I'll leave it as an exercise for a reader.

    So, conclusions:
    * Yes, counting area of 20 megashapes all placed in memory is not the same as counting area of 8 kiloshapes 2.5k times. All data fitting in at least L3 cache can significantly change results of a benchmark
    * Nevertheless, virtual shapes with that have one single base class counts faster than naive shapes with 4 different area functions
    * Lookup table version gained from about 2x to just a bit speed, depending on a compiler. 2x is for gcc at least on Linux, MSVC gives no significant speedup
    * Effect of unrolling switches and variants is compiler and platform dependent, but never to a degree the LuT code has.
    * On most platforms and compilers in the table, `variant` and `visit` is about the same speed as switch, except for GCC 11.
    * On most platform, the "optimized variant" code has no speedup compared to naive implementation. GCC 12 somehow improved on unrolled loops, and only Clang on windows showed exactly the same performance as the LuT code. I'm not sure why. My guess is, Windows linker can eliminate same code ([Partial-redundancy elimination](https://en.wikipedia.org/wiki/Partial-redundancy_elimination)) and clang can take advantage of it and throw away all branches, leaving single function aside. But, again, I may be completely wrong and clang 15.0.6 can do it everywhere, but clang 15.0.7 has some regression and so it can't optimize it like that. Or some clang defaults on Linux and Windows are different enough so it works on Win10 but not on Ubuntu 20.04.
    * Finally, optimized version of shapes with single middle-class (let's call it that) is still faster than four naive classes each with a different area counting function. About 3 to 4 times faster now. Using the smae function for both cases. My guess is, it's branch prediction at work, but I can't neither proof nor disproof it. *All I want to say, it's not really fair to compare optimized procedural code with unoptimized OO, especially since it's still 2 to 4 times faster.*

    I think I should clarify lots of stuff:
    * I totally agree OO is way less optimizable then Operation First, procedural style. At least, that's what my personal experience is
    * I agree that to OO is less understandable than procedural. I mean, just try to understand how FizzBuzzEnterpriseEdition even works. I tried, and I failed to find where `main` is. Maybe one or two level of indirection is fine, but five or more are pain
    * Therefore, I agree that it's almost impossible to find optimization opportunities in OO code, with IDE always pointing to base class and not the one class I need, and code being dispersed across way too many files
    * I agree there is no place for OO in tight loops that should work fast with lots of data
    * I do *not* agree you can't have same performance with Operand First approach. `variant`, `visit` and clang on visit shows that it's at least possible with modern C++ compilers. Yes, no base class, no vtables, but the structure of a code is about the same. Should we use it or not is beside the point (but probably we shouldn't except for some strange cases)
    * I do *not* agree we should not use clean code approach. Yet, do think that we should limit levels of indirection only for user API, and using rarely and carefully in a code that we fully control.

    Well, that's all for now.
  5. topin89 revised this gist Mar 18, 2023. 1 changed file with 9 additions and 2 deletions.
    11 changes: 9 additions & 2 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -4,7 +4,7 @@ Edit:
    2. This benchmark was designed on assuption that since we access shapes in predicatable way, cache won't give us much trouble. Well, [cmuratori](https://twitter.com/cmuratori/status/1637011323170209792) and [AcheretoTM](https://twitter.com/AcheretoTM/status/1637080367055159299) pointed that it's probably wrong. I checked, and I was wrong. Now virtual calls are even faster, so please disregard everything else as well.
    Now, the code don't generate all 20 MiB objects for each case, but iterate over 8192 (about 128 KB, half of my L2 cache for a single core) 2560 times. That's the same 20 MiB iterations. Also, vectors are made before every loop and destroyed after every measurement. So, it's always cold start for every loop and low memory consumption for the bench.

    Original:

    Recently, Case Muratori made an article and a video named ["Clean" Code, Horrible Performance](https://www.computerenhance.com/p/clean-code-horrible-performance)</br>

    I agree that sometimes, a particular software architecture that is regarded as better (that is, Clean Code) in general can reduce performance quite a bit. I also agree that the Clean Code example can be misleading and make people think it's fine even in cases like shape area size. There is a reason why shaders are not written in object-oriented style after all.
    @@ -186,6 +186,7 @@ Windows 10 (different machine, so I can't compare it with the above benchmarks):
    * clang 15.0.6 (MSVC runtime)
    ** Release
    ```
    Bench: virtual functions
    Regular loop, time: 154 ms area: 7.50355e+14
    Unrolled loop, time: 148 ms area: 7.39093e+14
    @@ -209,8 +210,10 @@ Bench: std::variant optimized
    Bench: virtual functions optimized
    Regular loop, time: 41 ms area: 7.50355e+14
    Unrolled loop, time: 39 ms area: 7.39093e+14
    ```
    ** Debug:
    ```
    Bench: virtual functions
    Regular loop, time: 251 ms area: 7.50355e+14
    Unrolled loop, time: 209 ms area: 7.39093e+14
    @@ -234,9 +237,11 @@ Bench: std::variant optimized
    Bench: virtual functions optimized
    Regular loop, time: 99 ms area: 7.50355e+14
    Unrolled loop, time: 74 ms area: 7.39093e+14
    ```
    * MSVC 2022 17.4.3
    ** Release
    ```
    Bench: virtual functions
    Regular loop, time: 153 ms area: 7.50355e+14
    Unrolled loop, time: 142 ms area: 7.39093e+14
    @@ -260,8 +265,9 @@ Bench: std::variant optimized
    Bench: virtual functions optimized
    Regular loop, time: 41 ms area: 7.50355e+14
    Unrolled loop, time: 37 ms area: 7.39093e+14
    ```
    ** Debug
    ```
    Bench: virtual functions
    Regular loop, time: 265 ms area: 7.50355e+14
    Unrolled loop, time: 233 ms area: 7.39093e+14
    @@ -285,5 +291,6 @@ Bench: std::variant optimized
    Bench: virtual functions optimized
    Regular loop, time: 115 ms area: 7.50355e+14
    Unrolled loop, time: 91 ms area: 7.39093e+14
    ```
    Edit: old result and conclusions are removed, since they were incorrectly measured. You still can find them in old revisions
  6. topin89 revised this gist Mar 18, 2023. 3 changed files with 153 additions and 211 deletions.
    2 changes: 1 addition & 1 deletion CMakeLists.txt
    Original file line number Diff line number Diff line change
    @@ -3,4 +3,4 @@ project(unclean VERSION 0.1.0)

    set (CMAKE_CXX_STANDARD 17)

    add_executable(unclean main.cpp)
    add_executable(shape_bench main.cpp)
    238 changes: 101 additions & 137 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,10 @@
    Edit:

    1. Please disregard MSVC Results. I missed that CMake generates an app in `$builddir/Release`, and not in `$builddir`. MSVC's results were so close to clang's because they were, in fact, clang's. But don't worry it doesn't matter much, because
    2. This benchmark was designed on assuption that since we access shapes in predicatable way, cache won't give us much trouble. Well, [cmuratori](https://twitter.com/cmuratori/status/1637011323170209792) and [AcheretoTM](https://twitter.com/AcheretoTM/status/1637080367055159299) pointed that it's probably wrong. I checked, and I was wrong. Now virtual calls are even faster, so please disregard everything else as well.
    Now, the code don't generate all 20 MiB objects for each case, but iterate over 8192 (about 128 KB, half of my L2 cache for a single core) 2560 times. That's the same 20 MiB iterations. Also, vectors are made before every loop and destroyed after every measurement. So, it's always cold start for every loop and low memory consumption for the bench.

    Original:
    Recently, Case Muratori made an article and a video named ["Clean" Code, Horrible Performance](https://www.computerenhance.com/p/clean-code-horrible-performance)</br>

    I agree that sometimes, a particular software architecture that is regarded as better (that is, Clean Code) in general can reduce performance quite a bit. I also agree that the Clean Code example can be misleading and make people think it's fine even in cases like shape area size. There is a reason why shaders are not written in object-oriented style after all.
    @@ -173,153 +180,110 @@ f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes)

    ```
    I did not bother with shapes, not relative performance counter, it was 2 pm after all, so here are the results for different builds with different compilers on different OSes. Also, instead of running the benchmark 100 times, I benchmarked on 20*1024*1024 shapes instead of just 1MiB shapes. While it's not as accurate, it should be fine for our comparison.
    Edit: there's new measurements from Windows machine:
    Windows 10 (different machine, so I can't compare it with the above benchmarks):
    Ubuntu 20.04:
    * gcc 11.1.0:
    * clang 15.0.6 (MSVC runtime)
    ** Release
    Bench: virtual functions
    Regular loop, time: 154 ms area: 7.50355e+14
    Unrolled loop, time: 148 ms area: 7.39093e+14
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 148 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 144 ms
    TotalAreaSwitch area: 7.07209e+14 time: 116 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 106 ms
    TotalAreaUnion area: 7.07209e+14 time: 44 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 16 ms
    TotalAreaVariant area: 7.07209e+14 time: 123 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 116 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 122 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 118 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 49 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 43 ms
    ```
    Bench: switch case
    Regular loop, time: 131 ms area: 7.50355e+14
    Unrolled loop, time: 119 ms area: 7.39093e+14
    ** Debug
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 231 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 207 ms
    TotalAreaSwitch area: 7.07209e+14 time: 242 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 210 ms
    TotalAreaUnion area: 7.07209e+14 time: 66 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 49 ms
    TotalAreaVariant area: 7.07209e+14 time: 1143 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 1163 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 1162 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 1164 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 82 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 67 ms
    ```
    Bench: lookup table
    Regular loop, time: 18 ms area: 7.45486e+14
    Unrolled loop, time: 12 ms area: 7.60836e+14
    * clang 15.0.7
    ** Release
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 147 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 149 ms
    TotalAreaSwitch area: 7.07209e+14 time: 121 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 122 ms
    TotalAreaUnion area: 7.07209e+14 time: 20 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 18 ms
    TotalAreaVariant area: 7.07209e+14 time: 122 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 126 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 129 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 125 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 50 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 46 ms
    ```
    ** Debug
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 219 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 197 ms
    TotalAreaSwitch area: 7.07209e+14 time: 304 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 276 ms
    TotalAreaUnion area: 7.07209e+14 time: 106 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 95 ms
    TotalAreaVariant area: 7.07209e+14 time: 596 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 617 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 627 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 643 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 67 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 56 ms
    ```
    Bench: std::variant
    Regular loop, time: 133 ms area: 7.50355e+14
    Unrolled loop, time: 121 ms area: 7.39093e+14
    Windows 10 (different machine, so I can't compare it with the above benchmarks):
    Bench: std::variant optimized
    Regular loop, time: 19 ms area: 7.50355e+14
    Unrolled loop, time: 18 ms area: 7.39093e+14
    * clang 15.0.6 (MSVC runtime)
    ** Release
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 199 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 206 ms
    TotalAreaSwitch area: 7.0699e+14 time: 136 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 130 ms
    TotalAreaUnion area: 7.0699e+14 time: 23 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 20 ms
    TotalAreaVariant area: 7.0699e+14 time: 145 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 132 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 30 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 30 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 101 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 103 ms
    ```
    ** Debug
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 390 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 329 ms
    TotalAreaSwitch area: 7.0699e+14 time: 288 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 252 ms
    TotalAreaUnion area: 7.0699e+14 time: 124 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 123 ms
    TotalAreaVariant area: 7.0699e+14 time: 700 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 698 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 706 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 726 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 269 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 231 ms
    ```
    * MSVC 2022 (17.4.3)
    Bench: virtual functions optimized
    Regular loop, time: 41 ms area: 7.50355e+14
    Unrolled loop, time: 39 ms area: 7.39093e+14
    ** Debug:
    Bench: virtual functions
    Regular loop, time: 251 ms area: 7.50355e+14
    Unrolled loop, time: 209 ms area: 7.39093e+14
    Bench: switch case
    Regular loop, time: 278 ms area: 7.50355e+14
    Unrolled loop, time: 237 ms area: 7.39093e+14
    Bench: lookup table
    Regular loop, time: 120 ms area: 7.45486e+14
    Unrolled loop, time: 114 ms area: 7.60836e+14
    Bench: std::variant
    Regular loop, time: 684 ms area: 7.50355e+14
    Unrolled loop, time: 687 ms area: 7.39093e+14
    Bench: std::variant optimized
    Regular loop, time: 687 ms area: 7.50355e+14
    Unrolled loop, time: 701 ms area: 7.39093e+14
    Bench: virtual functions optimized
    Regular loop, time: 99 ms area: 7.50355e+14
    Unrolled loop, time: 74 ms area: 7.39093e+14
    * MSVC 2022 17.4.3
    ** Release
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 200 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 204 ms
    TotalAreaSwitch area: 7.0699e+14 time: 137 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 129 ms
    TotalAreaUnion area: 7.0699e+14 time: 23 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 20 ms
    TotalAreaVariant area: 7.0699e+14 time: 146 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 134 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 28 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 30 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 103 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 99 ms
    ```
    Bench: virtual functions
    Regular loop, time: 153 ms area: 7.50355e+14
    Unrolled loop, time: 142 ms area: 7.39093e+14
    Bench: switch case
    Regular loop, time: 125 ms area: 7.50355e+14
    Unrolled loop, time: 91 ms area: 7.39093e+14
    Bench: lookup table
    Regular loop, time: 21 ms area: 7.45486e+14
    Unrolled loop, time: 18 ms area: 7.60836e+14
    Bench: std::variant
    Regular loop, time: 124 ms area: 7.50355e+14
    Unrolled loop, time: 79 ms area: 7.39093e+14
    Bench: std::variant optimized
    Regular loop, time: 116 ms area: 7.50355e+14
    Unrolled loop, time: 79 ms area: 7.39093e+14
    Bench: virtual functions optimized
    Regular loop, time: 41 ms area: 7.50355e+14
    Unrolled loop, time: 37 ms area: 7.39093e+14
    ** Debug
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 384 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 321 ms
    TotalAreaSwitch area: 7.0699e+14 time: 285 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 249 ms
    TotalAreaUnion area: 7.0699e+14 time: 123 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 120 ms
    TotalAreaVariant area: 7.0699e+14 time: 698 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 691 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 708 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 762 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 274 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 241 ms
    ```
    Bench: virtual functions
    Regular loop, time: 265 ms area: 7.50355e+14
    Unrolled loop, time: 233 ms area: 7.39093e+14
    Bench: switch case
    Regular loop, time: 398 ms area: 7.50355e+14
    Unrolled loop, time: 384 ms area: 7.39093e+14
    Bench: lookup table
    Regular loop, time: 243 ms area: 7.45486e+14
    Unrolled loop, time: 232 ms area: 7.60836e+14
    So, in conclusion:
    * Loop unrolled version does not produce the same result as the regular. That's totally expected, and to be fair, unrolled version should be a bit more accurate.
    * Across all machines, optimized version of `Area()` is faster than four separate versions.
    * **There can be optimized version of the `Area()` both for OO style and Operation Primal style**
    * At least in one case, clang 15.0.7 debug, OO-style works faster. But that case is not relevant for performance at all.
    * At least in one case, gcc 11.1.0, there was no difference in OO and direct call performance. It is somewhat relevant, but more about gcc being worse than MSVC in this particular case.
    * `std::variant` is some kind of middle ground on Windows release builds, always a bit slower than switch and direct call and always a bit faster than virtual functions. In some cases, it can be a good alternative for both switches and vtables. Side note: it looks like it's that good because MSVC linker can eliminate the same code and then eliminate hidden switch in `std::visit` as well.
    * That same `variant` is horrible for debugging, just like most heavy templated tools. Compiler error messages are not helpful as well.
    * It is unfair to compare optimized `Area()` procedural style with unoptimized OO style. There is a huge gap anyway, but 2-5x is not the same as 10x.
    Bench: std::variant
    Regular loop, time: 894 ms area: 7.50355e+14
    Unrolled loop, time: 887 ms area: 7.39093e+14
    There are two ways we can optimize virtual calls even further. First, instead of the `new` operator, we can allocate with a dedicated allocator for our specific case. That can hugely increase cache locality. Second, a lot more hypothetical, someone can make something like `jit_area_call(shape_object)` that generates a switch code on the fly using function address as a `switch` argument. Both of these is entirely possible to either make a part of tight loop parts while using normal tools for less CPU-heavy parts or even make clang plugin / build into a compiler with specific flag activation.
    Produced code obviously won't be the same, but the performance gap may become negligible.
    Bench: std::variant optimized
    Regular loop, time: 993 ms area: 7.50355e+14
    Unrolled loop, time: 1027 ms area: 7.39093e+14
    So yes, it should be possible for c++ to be just as fast for both procedural and OO styles at least for some tight loops in the near future. And for gcc 11.1.0, that future is now.
    Bench: virtual functions optimized
    Regular loop, time: 115 ms area: 7.50355e+14
    Unrolled loop, time: 91 ms area: 7.39093e+14
    P.S. I should really check if PGO can optimize this even further. Totally forgot about it.
    Edit: old result and conclusions are removed, since they were incorrectly measured. You still can find them in old revisions
    124 changes: 51 additions & 73 deletions main.cpp
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,10 @@ using f32 = float;
    using u32 = uint32_t;

    static constexpr f32 Pi32 = M_PI;
    static constexpr size_t num_of_shapes = 1024*1024*20;
    static constexpr size_t num_of_shapes = 8192;
    static constexpr size_t num_of_outer_loops = 1024*1024*20 / num_of_shapes;

    // static constexpr size_t num_of_shapes = 1024*1024*20;

    // utilities

    @@ -93,6 +96,7 @@ class shape_base
    public:
    shape_base() {}
    virtual f32 Area() = 0;
    virtual ~shape_base() = default;
    };

    class square : public shape_base
    @@ -497,7 +501,6 @@ shape_base* generate_shape_base_opt(){
    abort(); // silencing no return warning
    }


    template<typename ShapeGenerator>
    auto generate_shapes(ShapeGenerator generate_shape) -> std::vector<decltype(generate_shape())>{
    std::vector<decltype(generate_shape())> result;
    @@ -511,78 +514,53 @@ auto generate_shapes(ShapeGenerator generate_shape) -> std::vector<decltype(gene
    }


    template<typename ShapeGenerator, typename AreaCount, typename AreaCountLoopUnrolled>
    void make_bench(ShapeGenerator gen_shape, AreaCount TotalArea, AreaCountLoopUnrolled TotalArea4, char const * const benchname){
    TimeMeasure tm_common;
    TimeMeasure tm_loop_unrolled;

    int main(int, char**) {
    TimeMeasure tm;

    auto vtbl_shapes = generate_shapes(generate_shape_base);
    auto union_shapes = generate_shapes(generate_shape_union);
    auto variant_shapes = generate_shapes(generate_shape_variant);
    auto variant_opt_shapes = generate_shapes(generate_shape_variant_opt);
    auto vtbl_opt_shapes = generate_shapes(generate_shape_base_opt);

    tm.begin();
    f32 vtlb_area = TotalAreaVTBL(vtbl_shapes.size(), vtbl_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBL area: " << vtlb_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 vtlb4_area = TotalAreaVTBL4(vtbl_shapes.size(), vtbl_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBL4 area: " << vtlb4_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 switch_area = TotalAreaSwitch(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaSwitch area: " << switch_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 switch4_area = TotalAreaSwitch4(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaSwitch4 area: " << switch4_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 union_area = TotalAreaUnion(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaUnion area: " << union_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 union4_area = TotalAreaUnion4(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaUnion4 area: " << union4_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 variant_area = TotalAreaVariant(variant_shapes.size(), variant_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant area: " << variant_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 variant4_area = TotalAreaVariant4(variant_shapes.size(), variant_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant4 area: " << variant4_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 variant_opt_area = TotalAreaVariant(variant_opt_shapes.size(), variant_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant opt area: " << variant_opt_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 variant4_opt_area = TotalAreaVariant4(variant_opt_shapes.size(), variant_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant4 opt area: " << variant4_opt_area << " time: " << tm.ms() << " ms \n";

    f32 area_common = 0;
    f32 area_unrolled = 0;
    {
    auto shapes = generate_shapes(gen_shape);

    tm_common.begin();
    for(size_t i = 0; i < num_of_outer_loops; ++i){
    area_common += TotalArea(shapes.size(), shapes.data());
    }
    tm_common.end();

    if constexpr (std::is_pointer_v<std::decay_t<decltype(shapes[0])>>){
    for(const auto shape: shapes ){
    delete shape;
    }
    }
    }

    tm.begin();
    f32 vtlb_opt_area = TotalAreaVTBL(vtbl_opt_shapes.size(), vtbl_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBLOpt area: " << vtlb_opt_area << " time: " << tm.ms() << " ms \n";
    {
    auto shapes = generate_shapes(gen_shape);
    tm_loop_unrolled.begin();
    for(size_t i = 0; i < num_of_outer_loops; ++i){
    area_unrolled += TotalArea4(shapes.size(), shapes.data());
    }
    tm_loop_unrolled.end();

    if constexpr (std::is_pointer_v<std::decay_t<decltype(shapes[0])>>){
    for(const auto shape: shapes ){
    delete shape;
    }
    }
    }
    std::cout << "Bench: " << benchname << '\n';
    std::cout << "\tRegular loop, time: " << tm_common.ms() << " ms area: " << area_common << '\n';
    std::cout << "\tUnrolled loop, time: " << tm_loop_unrolled.ms() << " ms area: " << area_unrolled << "\n\n";
    }

    tm.begin();
    f32 vtlb4_opt_area = TotalAreaVTBL4(vtbl_opt_shapes.size(), vtbl_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBLOpt4 area: " << vtlb4_opt_area << " time: " << tm.ms() << " ms \n";
    int main(int, char**) {
    make_bench(generate_shape_base, TotalAreaVTBL, TotalAreaVTBL4, "virtual functions");
    make_bench(generate_shape_union, TotalAreaSwitch, TotalAreaSwitch4, "switch case");
    make_bench(generate_shape_union, TotalAreaUnion, TotalAreaUnion4, "lookup table");
    make_bench(generate_shape_variant, TotalAreaVariant<shape_variant>, TotalAreaVariant4<shape_variant>, "std::variant");
    make_bench(generate_shape_variant_opt, TotalAreaVariant<shape_variant_opt>, TotalAreaVariant4<shape_variant_opt>, "std::variant optimized");
    make_bench(generate_shape_base_opt, TotalAreaVTBL, TotalAreaVTBL4, "virtual functions optimized");
    }
  7. topin89 revised this gist Mar 16, 2023. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Readme.md
    Original file line number Diff line number Diff line change
    @@ -118,7 +118,7 @@ using shape_variant_opt = std::variant<square_no_vt_opt, rectangle_no_vt_opt, tr

    * Finally, another `shape_base` overloaded classes, with an intermediate base class with optimized area function

    ```
    ```c++
    class shape_base_opt: public shape_base{
    public:
    shape_base_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
  8. topin89 created this gist Mar 16, 2023.
    6 changes: 6 additions & 0 deletions CMakeLists.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,6 @@
    cmake_minimum_required(VERSION 3.0.0)
    project(unclean VERSION 0.1.0)

    set (CMAKE_CXX_STANDARD 17)

    add_executable(unclean main.cpp)
    325 changes: 325 additions & 0 deletions Readme.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,325 @@
    Recently, Case Muratori made an article and a video named ["Clean" Code, Horrible Performance](https://www.computerenhance.com/p/clean-code-horrible-performance)</br>

    I agree that sometimes, a particular software architecture that is regarded as better (that is, Clean Code) in general can reduce performance quite a bit. I also agree that the Clean Code example can be misleading and make people think it's fine even in cases like shape area size. There is a reason why shaders are not written in object-oriented style after all.

    Sometimes after the video was published, Casey and Robert (author of Clean Code) engaged in discussion on the merits of different architecture and on this day (16 Mar 2023) Bob said about "a hypothetical compiler that produces identical binary code irrespective of whether the input is operand or operation primal." Casey responded with essentially "it's hypothetical, modern compilers can't do that" and it is as far from reality as proposing "A hypothetical language where the dependency cost is always the same". I think this is a bit unfair since the previous points were more about architecture in general, and Bob agreed that there are cases where it's harmful for performance to use OO style with virtual functions.

    About four days ago (13 Mar 2023), I made my own benchmark to be absolutely sure things are as they are, as well as to test different approaches to the problem.

    Along with Casey's code, I tested:
    * std::variant containing four different classes with no base class:

    ```c++

    class square_no_vt
    {
    public:
    square_no_vt(f32 SideInit) : Side(SideInit) {}
    f32 Area() {return Side*Side;}

    private:
    f32 Side;
    };

    class rectangle_no_vt
    {
    public:
    rectangle_no_vt(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
    f32 Area() {return Width*Height;}

    private:
    f32 Width, Height;
    };

    class triangle_no_vt
    {
    public:
    triangle_no_vt(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
    f32 Area() {return 0.5f*Base*Height;}

    private:
    f32 Base, Height;
    };

    class circle_no_vt
    {
    public:
    circle_no_vt(f32 RadiusInit) : Radius(RadiusInit) {}
    f32 Area() {return Pi32*Radius*Radius;}

    private:
    f32 Radius;
    };

    using shape_variant = std::variant<square_no_vt, rectangle_no_vt, triangle_no_vt, circle_no_vt>;


    template<typename ShapeVariant>
    f32 TotalAreaVariant(u32 ShapeCount, ShapeVariant *Shapes) noexcept
    {
    f32 Accum = 0.0f;
    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
    Accum += std::visit([](auto shape){return shape.Area();}, Shapes[ShapeIndex]);
    }

    return Accum;
    }
    ```
    * std::variant with one base class with optimized `Area()` function
    ```c++
    class shape_base_no_vt_opt{
    public:
    shape_base_no_vt_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
    shape_index(shape_index_), side1(side1_), side2(side2_)
    {}
    f32 Area() { return c_table[shape_index] * side1 * side2; }
    private:
    static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};
    shape_type shape_index;
    f32 side1;
    f32 side2;
    };
    class square_no_vt_opt : public shape_base_no_vt_opt
    {
    public:
    square_no_vt_opt(f32 SideInit) : shape_base_no_vt_opt(Shape_Square, SideInit, SideInit) {}
    };
    class rectangle_no_vt_opt: public shape_base_no_vt_opt
    {
    public:
    rectangle_no_vt_opt(f32 WidthInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Rectangle, WidthInit, HeightInit) {}
    };
    class triangle_no_vt_opt: public shape_base_no_vt_opt
    {
    public:
    triangle_no_vt_opt(f32 BaseInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Triangle, BaseInit, HeightInit) {}
    };
    class circle_no_vt_opt: public shape_base_no_vt_opt
    {
    public:
    circle_no_vt_opt(f32 RadiusInit) : shape_base_no_vt_opt(Shape_Circle, RadiusInit, RadiusInit) {}
    };
    using shape_variant_opt = std::variant<square_no_vt_opt, rectangle_no_vt_opt, triangle_no_vt_opt, circle_no_vt_opt>;
    // same TotalAreaVariant as above
    ```

    * Finally, another `shape_base` overloaded classes, with an intermediate base class with optimized area function

    ```
    class shape_base_opt: public shape_base{
    public:
    shape_base_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
    shape_index(shape_index_), side1(side1_), side2(side2_)
    {}
    virtual f32 Area() { return c_table[shape_index] * side1 * side2; }
    private:
    static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};
    shape_type shape_index;
    f32 side1;
    f32 side2;
    };
    class square_opt : public shape_base_opt
    {
    public:
    square_opt(f32 SideInit) : shape_base_opt(Shape_Square, SideInit, SideInit) {}
    };
    class rectangle_opt: public shape_base_opt
    {
    public:
    rectangle_opt(f32 WidthInit, f32 HeightInit) : shape_base_opt(Shape_Rectangle, WidthInit, HeightInit) {}
    };
    class triangle_opt: public shape_base_opt
    {
    public:
    triangle_opt(f32 BaseInit, f32 HeightInit) : shape_base_opt(Shape_Triangle, BaseInit, HeightInit) {}
    };
    class circle_opt: public shape_base_opt
    {
    public:
    circle_opt(f32 RadiusInit) : shape_base_opt(Shape_Circle, RadiusInit, RadiusInit) {}
    };
    // used for original shapes as well
    f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes)
    {
    f32 Accum = 0.0f;
    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
    Accum += Shapes[ShapeIndex]->Area();
    }
    return Accum;
    }
    ```

    I did not bother with shapes, not relative performance counter, it was 2 pm after all, so here are the results for different builds with different compilers on different OSes. Also, instead of running the benchmark 100 times, I benchmarked on 20*1024*1024 shapes instead of just 1MiB shapes. While it's not as accurate, it should be fine for our comparison.

    Ubuntu 20.04:
    * gcc 11.1.0:
    ** Release

    ```
    TotalAreaVTBL area: 7.07209e+14 time: 148 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 144 ms
    TotalAreaSwitch area: 7.07209e+14 time: 116 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 106 ms
    TotalAreaUnion area: 7.07209e+14 time: 44 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 16 ms
    TotalAreaVariant area: 7.07209e+14 time: 123 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 116 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 122 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 118 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 49 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 43 ms
    ```

    ** Debug
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 231 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 207 ms
    TotalAreaSwitch area: 7.07209e+14 time: 242 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 210 ms
    TotalAreaUnion area: 7.07209e+14 time: 66 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 49 ms
    TotalAreaVariant area: 7.07209e+14 time: 1143 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 1163 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 1162 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 1164 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 82 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 67 ms
    ```

    * clang 15.0.7
    ** Release
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 147 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 149 ms
    TotalAreaSwitch area: 7.07209e+14 time: 121 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 122 ms
    TotalAreaUnion area: 7.07209e+14 time: 20 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 18 ms
    TotalAreaVariant area: 7.07209e+14 time: 122 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 126 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 129 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 125 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 50 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 46 ms
    ```
    ** Debug
    ```
    TotalAreaVTBL area: 7.07209e+14 time: 219 ms
    TotalAreaVTBL4 area: 7.42829e+14 time: 197 ms
    TotalAreaSwitch area: 7.07209e+14 time: 304 ms
    TotalAreaSwitch4 area: 7.42829e+14 time: 276 ms
    TotalAreaUnion area: 7.07209e+14 time: 106 ms
    TotalAreaUnion4 area: 7.42829e+14 time: 95 ms
    TotalAreaVariant area: 7.07209e+14 time: 596 ms
    TotalAreaVariant4 area: 7.42829e+14 time: 617 ms
    TotalAreaVariant opt area: 7.07209e+14 time: 627 ms
    TotalAreaVariant4 opt area: 7.42829e+14 time: 643 ms
    TotalAreaVTBLOpt area: 7.07209e+14 time: 67 ms
    TotalAreaVTBLOpt4 area: 7.42829e+14 time: 56 ms
    ```

    Windows 10 (different machine, so I can't compare it with the above benchmarks):

    * clang 15.0.6 (MSVC runtime)
    ** Release
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 199 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 206 ms
    TotalAreaSwitch area: 7.0699e+14 time: 136 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 130 ms
    TotalAreaUnion area: 7.0699e+14 time: 23 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 20 ms
    TotalAreaVariant area: 7.0699e+14 time: 145 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 132 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 30 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 30 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 101 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 103 ms
    ```
    ** Debug
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 390 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 329 ms
    TotalAreaSwitch area: 7.0699e+14 time: 288 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 252 ms
    TotalAreaUnion area: 7.0699e+14 time: 124 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 123 ms
    TotalAreaVariant area: 7.0699e+14 time: 700 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 698 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 706 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 726 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 269 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 231 ms
    ```
    * MSVC 2022 (17.4.3)
    ** Release
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 200 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 204 ms
    TotalAreaSwitch area: 7.0699e+14 time: 137 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 129 ms
    TotalAreaUnion area: 7.0699e+14 time: 23 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 20 ms
    TotalAreaVariant area: 7.0699e+14 time: 146 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 134 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 28 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 30 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 103 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 99 ms
    ```
    ** Debug
    ```
    TotalAreaVTBL area: 7.0699e+14 time: 384 ms
    TotalAreaVTBL4 area: 7.42571e+14 time: 321 ms
    TotalAreaSwitch area: 7.0699e+14 time: 285 ms
    TotalAreaSwitch4 area: 7.42571e+14 time: 249 ms
    TotalAreaUnion area: 7.0699e+14 time: 123 ms
    TotalAreaUnion4 area: 7.42571e+14 time: 120 ms
    TotalAreaVariant area: 7.0699e+14 time: 698 ms
    TotalAreaVariant4 area: 7.42571e+14 time: 691 ms
    TotalAreaVariant opt area: 7.0699e+14 time: 708 ms
    TotalAreaVariant4 opt area: 7.42571e+14 time: 762 ms
    TotalAreaVTBLOpt area: 7.0699e+14 time: 274 ms
    TotalAreaVTBLOpt4 area: 7.42571e+14 time: 241 ms
    ```

    So, in conclusion:
    * Loop unrolled version does not produce the same result as the regular. That's totally expected, and to be fair, unrolled version should be a bit more accurate.
    * Across all machines, optimized version of `Area()` is faster than four separate versions.
    * **There can be optimized version of the `Area()` both for OO style and Operation Primal style**
    * At least in one case, clang 15.0.7 debug, OO-style works faster. But that case is not relevant for performance at all.
    * At least in one case, gcc 11.1.0, there was no difference in OO and direct call performance. It is somewhat relevant, but more about gcc being worse than MSVC in this particular case.
    * `std::variant` is some kind of middle ground on Windows release builds, always a bit slower than switch and direct call and always a bit faster than virtual functions. In some cases, it can be a good alternative for both switches and vtables. Side note: it looks like it's that good because MSVC linker can eliminate the same code and then eliminate hidden switch in `std::visit` as well.
    * That same `variant` is horrible for debugging, just like most heavy templated tools. Compiler error messages are not helpful as well.
    * It is unfair to compare optimized `Area()` procedural style with unoptimized OO style. There is a huge gap anyway, but 2-5x is not the same as 10x.

    There are two ways we can optimize virtual calls even further. First, instead of the `new` operator, we can allocate with a dedicated allocator for our specific case. That can hugely increase cache locality. Second, a lot more hypothetical, someone can make something like `jit_area_call(shape_object)` that generates a switch code on the fly using function address as a `switch` argument. Both of these is entirely possible to either make a part of tight loop parts while using normal tools for less CPU-heavy parts or even make clang plugin / build into a compiler with specific flag activation.
    Produced code obviously won't be the same, but the performance gap may become negligible.

    So yes, it should be possible for c++ to be just as fast for both procedural and OO styles at least for some tight loops in the near future. And for gcc 11.1.0, that future is now.

    P.S. I should really check if PGO can optimize this even further. Totally forgot about it.
    588 changes: 588 additions & 0 deletions main.cpp
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,588 @@
    #define _USE_MATH_DEFINES
    #include <math.h>
    #include <stdint.h>

    #include <chrono>
    #include <iostream>
    #include <random>
    #include <stdexcept>
    #include <type_traits>
    #include <vector>
    #include <variant>


    using f32 = float;
    using u32 = uint32_t;

    static constexpr f32 Pi32 = M_PI;
    static constexpr size_t num_of_shapes = 1024*1024*20;

    // utilities

    class TimeMeasure {
    public:
    TimeMeasure() = default;
    void begin(){ start_point = std::chrono::steady_clock::now(); }
    void end(){ end_point = std::chrono::steady_clock::now(); }

    int64_t ms(){
    return std::chrono::duration_cast<std::chrono::milliseconds>(end_point - start_point).count();
    }

    double ms_per_tick(int64_t ticks) {
    return static_cast<double>(ms()) / ticks;
    }
    private:
    std::chrono::steady_clock::time_point start_point;
    std::chrono::steady_clock::time_point end_point;
    };

    // from https://stackoverflow.com/a/34111095
    template <typename Kind, typename... Kinds>
    constexpr bool any_of_types(){
    /* The following expands to :
    * std::is_same_v<Kind, Kind0> || std::is_same_v<Kind, Kind1> || ... */
    if constexpr ((std::is_same_v<Kind, Kinds> || ...)) {
    return true;
    }

    return false;
    };

    // This should really be in the standard
    template< typename T>
    class RandomNumberGen{
    public:
    RandomNumberGen(T min = 14, T max = 9001, size_t seed = 42):
    rng(seed), uni(min, max) {}

    T operator()(){ return uni(rng);}
    private:
    using uniform_dist = std::conditional_t<
    any_of_types<T, float, double, long double>(),
    std::uniform_real_distribution<T>, std::uniform_int_distribution<T>
    >;
    std::mt19937 rng;
    uniform_dist uni;
    };

    // main dish


    enum shape_type : u32
    {
    Shape_Square,
    Shape_Rectangle,
    Shape_Triangle,
    Shape_Circle,

    Shape_Count,
    };

    struct shape_union
    {
    shape_type Type;
    f32 Width;
    f32 Height;
    };

    f32 const CTable[Shape_Count] = {1.0f, 1.0f, 0.5f, Pi32};

    class shape_base
    {
    public:
    shape_base() {}
    virtual f32 Area() = 0;
    };

    class square : public shape_base
    {
    public:
    square(f32 SideInit) : Side(SideInit) {}
    virtual f32 Area() {return Side*Side;}

    private:
    f32 Side;
    };

    class rectangle : public shape_base
    {
    public:
    rectangle(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
    virtual f32 Area() {return Width*Height;}

    private:
    f32 Width, Height;
    };

    class triangle : public shape_base
    {
    public:
    triangle(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
    virtual f32 Area() {return 0.5f*Base*Height;}

    private:
    f32 Base, Height;
    };

    class circle : public shape_base
    {
    public:
    circle(f32 RadiusInit) : Radius(RadiusInit) {}
    virtual f32 Area() {return Pi32*Radius*Radius;}

    private:
    f32 Radius;
    };

    f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes)
    {
    f32 Accum = 0.0f;
    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
    Accum += Shapes[ShapeIndex]->Area();
    }

    return Accum;
    }

    f32 TotalAreaVTBL4(u32 ShapeCount, shape_base **Shapes)
    {
    f32 Accum0 = 0.0f;
    f32 Accum1 = 0.0f;
    f32 Accum2 = 0.0f;
    f32 Accum3 = 0.0f;

    u32 Count = ShapeCount/4;
    while(Count--)
    {
    Accum0 += Shapes[0]->Area();
    Accum1 += Shapes[1]->Area();
    Accum2 += Shapes[2]->Area();
    Accum3 += Shapes[3]->Area();

    Shapes += 4;
    }

    f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
    return Result;
    }

    f32 GetAreaSwitch(shape_union Shape)
    {
    f32 Result = 0.0f;

    switch(Shape.Type)
    {
    case Shape_Square: {Result = Shape.Width*Shape.Width;} break;
    case Shape_Rectangle: {Result = Shape.Width*Shape.Height;} break;
    case Shape_Triangle: {Result = 0.5f*Shape.Width*Shape.Height;} break;
    case Shape_Circle: {Result = Pi32*Shape.Width*Shape.Width;} break;

    case Shape_Count: {} break;
    }

    return Result;
    }

    f32 TotalAreaSwitch(u32 ShapeCount, shape_union *Shapes)
    {
    f32 Accum = 0.0f;

    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
    Accum += GetAreaSwitch(Shapes[ShapeIndex]);
    }

    return Accum;
    }

    f32 TotalAreaSwitch4(u32 ShapeCount, shape_union *Shapes)
    {
    f32 Accum0 = 0.0f;
    f32 Accum1 = 0.0f;
    f32 Accum2 = 0.0f;
    f32 Accum3 = 0.0f;

    ShapeCount /= 4;
    while(ShapeCount--)
    {
    Accum0 += GetAreaSwitch(Shapes[0]);
    Accum1 += GetAreaSwitch(Shapes[1]);
    Accum2 += GetAreaSwitch(Shapes[2]);
    Accum3 += GetAreaSwitch(Shapes[3]);

    Shapes += 4;
    }

    f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
    return Result;
    }


    f32 GetAreaUnion(shape_union Shape)
    {
    f32 Result = CTable[Shape.Type]*Shape.Width*Shape.Height;
    return Result;
    }

    f32 TotalAreaUnion(u32 ShapeCount, shape_union *Shapes)
    {
    f32 Accum = 0.0f;

    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
    Accum += GetAreaUnion(Shapes[ShapeIndex]);
    }

    return Accum;
    }

    f32 TotalAreaUnion4(u32 ShapeCount, shape_union *Shapes)
    {
    f32 Accum0 = 0.0f;
    f32 Accum1 = 0.0f;
    f32 Accum2 = 0.0f;
    f32 Accum3 = 0.0f;

    ShapeCount /= 4;
    while(ShapeCount--)
    {
    Accum0 += GetAreaUnion(Shapes[0]);
    Accum1 += GetAreaUnion(Shapes[1]);
    Accum2 += GetAreaUnion(Shapes[2]);
    Accum3 += GetAreaUnion(Shapes[3]);

    Shapes += 4;
    }

    f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
    return Result;
    }

    class square_no_vt
    {
    public:
    square_no_vt(f32 SideInit) : Side(SideInit) {}
    f32 Area() {return Side*Side;}

    private:
    f32 Side;
    };

    class rectangle_no_vt
    {
    public:
    rectangle_no_vt(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
    f32 Area() {return Width*Height;}

    private:
    f32 Width, Height;
    };

    class triangle_no_vt
    {
    public:
    triangle_no_vt(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
    f32 Area() {return 0.5f*Base*Height;}

    private:
    f32 Base, Height;
    };

    class circle_no_vt
    {
    public:
    circle_no_vt(f32 RadiusInit) : Radius(RadiusInit) {}
    f32 Area() {return Pi32*Radius*Radius;}

    private:
    f32 Radius;
    };

    using shape_variant = std::variant<square_no_vt, rectangle_no_vt, triangle_no_vt, circle_no_vt>;

    class shape_base_no_vt_opt{
    public:
    shape_base_no_vt_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
    shape_index(shape_index_), side1(side1_), side2(side2_)
    {}
    f32 Area() { return c_table[shape_index] * side1 * side2; }
    private:

    static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};

    shape_type shape_index;
    f32 side1;
    f32 side2;
    };

    class square_no_vt_opt : public shape_base_no_vt_opt
    {
    public:
    square_no_vt_opt(f32 SideInit) : shape_base_no_vt_opt(Shape_Square, SideInit, SideInit) {}

    };

    class rectangle_no_vt_opt: public shape_base_no_vt_opt
    {
    public:
    rectangle_no_vt_opt(f32 WidthInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Rectangle, WidthInit, HeightInit) {}
    };

    class triangle_no_vt_opt: public shape_base_no_vt_opt
    {
    public:
    triangle_no_vt_opt(f32 BaseInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Triangle, BaseInit, HeightInit) {}
    };

    class circle_no_vt_opt: public shape_base_no_vt_opt
    {
    public:
    circle_no_vt_opt(f32 RadiusInit) : shape_base_no_vt_opt(Shape_Circle, RadiusInit, RadiusInit) {}
    };

    using shape_variant_opt = std::variant<square_no_vt_opt, rectangle_no_vt_opt, triangle_no_vt_opt, circle_no_vt_opt>;


    template<typename ShapeVariant>
    f32 TotalAreaVariant(u32 ShapeCount, ShapeVariant *Shapes) noexcept
    {
    f32 Accum = 0.0f;
    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
    Accum += std::visit([](auto shape){return shape.Area();}, Shapes[ShapeIndex]);
    }

    return Accum;
    }

    template<typename ShapeVariant>
    f32 TotalAreaVariant4(u32 ShapeCount, ShapeVariant *Shapes) noexcept
    {
    f32 Accum0 = 0.0f;
    f32 Accum1 = 0.0f;
    f32 Accum2 = 0.0f;
    f32 Accum3 = 0.0f;

    u32 Count = ShapeCount/4;
    while(Count--)
    {
    Accum0 += std::visit([](auto shape){return shape.Area();}, Shapes[0]);
    Accum1 += std::visit([](auto shape){return shape.Area();}, Shapes[1]);
    Accum2 += std::visit([](auto shape){return shape.Area();}, Shapes[2]);
    Accum3 += std::visit([](auto shape){return shape.Area();}, Shapes[3]);

    Shapes += 4;
    }

    f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
    return Result;
    }


    class shape_base_opt: public shape_base{
    public:
    shape_base_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
    shape_index(shape_index_), side1(side1_), side2(side2_)
    {}
    virtual f32 Area() { return c_table[shape_index] * side1 * side2; }
    private:

    static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};

    shape_type shape_index;
    f32 side1;
    f32 side2;
    };

    class square_opt : public shape_base_opt
    {
    public:
    square_opt(f32 SideInit) : shape_base_opt(Shape_Square, SideInit, SideInit) {}

    };

    class rectangle_opt: public shape_base_opt
    {
    public:
    rectangle_opt(f32 WidthInit, f32 HeightInit) : shape_base_opt(Shape_Rectangle, WidthInit, HeightInit) {}
    };

    class triangle_opt: public shape_base_opt
    {
    public:
    triangle_opt(f32 BaseInit, f32 HeightInit) : shape_base_opt(Shape_Triangle, BaseInit, HeightInit) {}
    };

    class circle_opt: public shape_base_opt
    {
    public:
    circle_opt(f32 RadiusInit) : shape_base_opt(Shape_Circle, RadiusInit, RadiusInit) {}
    };

    shape_base* generate_shape_base(){
    static RandomNumberGen<size_t> next_shape{0, 3};
    static RandomNumberGen<f32> next_argument{};

    switch(next_shape()){
    case 0: return new square{next_argument()};
    case 1: return new rectangle{next_argument(), next_argument()};
    case 2: return new triangle{next_argument(), next_argument()};
    case 3: return new circle{next_argument()};
    }

    abort(); // silencing no return warning
    }

    shape_union generate_shape_union(){
    static RandomNumberGen<size_t> next_shape{0, 3};
    static RandomNumberGen<f32> next_argument{};

    switch(next_shape()){
    case 0: {
    f32 edge = next_argument();
    return shape_union{Shape_Square, edge, edge};
    };
    case 1: return shape_union{Shape_Rectangle, next_argument(), next_argument()};
    case 2: return shape_union{Shape_Triangle, next_argument(), next_argument()};
    case 3: {
    f32 edge = next_argument();
    return shape_union{Shape_Circle, edge, edge};
    };
    }

    abort(); // silencing no return warning
    }

    shape_variant generate_shape_variant(){
    static RandomNumberGen<size_t> next_shape{0, 3};
    static RandomNumberGen<f32> next_argument{};

    switch(next_shape()){
    case 0: return square_no_vt{next_argument()};
    case 1: return rectangle_no_vt{next_argument(), next_argument()};
    case 2: return triangle_no_vt{next_argument(), next_argument()};
    case 3: return circle_no_vt{next_argument()};
    }

    abort(); // silencing no return warning
    }

    shape_variant_opt generate_shape_variant_opt(){
    static RandomNumberGen<size_t> next_shape{0, 3};
    static RandomNumberGen<f32> next_argument{};

    switch(next_shape()){
    case 0: return square_no_vt_opt{next_argument()};
    case 1: return rectangle_no_vt_opt{next_argument(), next_argument()};
    case 2: return triangle_no_vt_opt{next_argument(), next_argument()};
    case 3: return circle_no_vt_opt{next_argument()};
    }

    abort(); // silencing no return warning
    }

    shape_base* generate_shape_base_opt(){
    static RandomNumberGen<size_t> next_shape{0, 3};
    static RandomNumberGen<f32> next_argument{};

    switch(next_shape()){
    case 0: return new square_opt{next_argument()};
    case 1: return new rectangle_opt{next_argument(), next_argument()};
    case 2: return new triangle_opt{next_argument(), next_argument()};
    case 3: return new circle_opt{next_argument()};
    }

    abort(); // silencing no return warning
    }


    template<typename ShapeGenerator>
    auto generate_shapes(ShapeGenerator generate_shape) -> std::vector<decltype(generate_shape())>{
    std::vector<decltype(generate_shape())> result;
    result.reserve(num_of_shapes);

    for(size_t i=0; i < num_of_shapes; ++i){
    result.push_back(generate_shape());
    }

    return result;
    }



    int main(int, char**) {
    TimeMeasure tm;

    auto vtbl_shapes = generate_shapes(generate_shape_base);
    auto union_shapes = generate_shapes(generate_shape_union);
    auto variant_shapes = generate_shapes(generate_shape_variant);
    auto variant_opt_shapes = generate_shapes(generate_shape_variant_opt);
    auto vtbl_opt_shapes = generate_shapes(generate_shape_base_opt);

    tm.begin();
    f32 vtlb_area = TotalAreaVTBL(vtbl_shapes.size(), vtbl_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBL area: " << vtlb_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 vtlb4_area = TotalAreaVTBL4(vtbl_shapes.size(), vtbl_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBL4 area: " << vtlb4_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 switch_area = TotalAreaSwitch(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaSwitch area: " << switch_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 switch4_area = TotalAreaSwitch4(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaSwitch4 area: " << switch4_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 union_area = TotalAreaUnion(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaUnion area: " << union_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 union4_area = TotalAreaUnion4(union_shapes.size(), union_shapes.data());
    tm.end();
    std::cout << "TotalAreaUnion4 area: " << union4_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 variant_area = TotalAreaVariant(variant_shapes.size(), variant_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant area: " << variant_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 variant4_area = TotalAreaVariant4(variant_shapes.size(), variant_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant4 area: " << variant4_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 variant_opt_area = TotalAreaVariant(variant_opt_shapes.size(), variant_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant opt area: " << variant_opt_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 variant4_opt_area = TotalAreaVariant4(variant_opt_shapes.size(), variant_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVariant4 opt area: " << variant4_opt_area << " time: " << tm.ms() << " ms \n";


    tm.begin();
    f32 vtlb_opt_area = TotalAreaVTBL(vtbl_opt_shapes.size(), vtbl_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBLOpt area: " << vtlb_opt_area << " time: " << tm.ms() << " ms \n";

    tm.begin();
    f32 vtlb4_opt_area = TotalAreaVTBL4(vtbl_opt_shapes.size(), vtbl_opt_shapes.data());
    tm.end();
    std::cout << "TotalAreaVTBLOpt4 area: " << vtlb4_opt_area << " time: " << tm.ms() << " ms \n";
    }