The previous article explained how to factorize an integer N at compile-time using variadic templates from C++11. This article gives benchmarks of the implementation using different compilers. Not only limits of the tested compilers are touched, but the optimal strategy has been selected for possible practical applications.
The algorithm constructs the trial factors in the form qk+r, where q is a multiple of a few first primes q=2*…*p and r is 1 or an integer greater than p and less than q not divisible by 2,…,p. The formula qk+r contains all the primes greater than q and significantly reduces the amount of trial factors: 2k+1 represents 50% of all integers; 6k+r (r=1,5) – 33%; 30k+r – 27%; 210k+r – 23%; 2310k+r – 21%, etc. Theoretically, the optimal algorithm should start with 2k+1 for k=1,2, then switch to 6k+r and so on, reducing the number of trials. However, the set of reminders r grows very quickly, while the percentage of the resulted integers falls down slowly. From the template deepness point of view, a higher q reduces the number of template instantiations by the next prime factor p (q=2*…*p). But in every such instantiation a loop over all reminders r (in the worst case) for the current q must be executed. Note also, that the generation of the list of reminders is also implemented at compile-time and takes non-negligible additional time for the formulas 210k+r and 2310k+r.
The benchmarks below identify a compromise between q and the number of reminders r: the formula 210k+r shows the shortest compilation time for large integers on all tested compilers. There are two implementation (both are available in the github repository):
1) Using C++11 variadic templates. This implementation is described in details in the previous article.
2) Using compile-time Typelists from Loki library by A. Alexandrescu, introduced in his book Modern C++ design. This implementation will work without C++11 support too.
In many cases, excepting Intel C++ compiler only, the second implementation compiles faster, because it applies insertions in the head of a Typelist only, which has constant complexity. However, insertions in the head of variadic templates implemented as a general concatenation of template parameters from two variadic templates.
To hold the benchmarks short and clear, only three ‘hard’ numbers are tested:
1) 65521 is the largest prime number of type unsigned short (2 bytes). Trial factorization of this number takes only a fraction of a second with the best formula 210k+r.
2) 4294967291 is the largest prime number of type unsigned int (4 bytes). Surprisingly, trial factorization of this large number at compile-time takes only a few seconds. That means also, that all other prime or composite numbers in the range of unsigned int should take less compile-time.
3) 18446744073709551615 = std::numeric_limits<unsigned long>::max() = 3 * 5 * 17 * 257 * 641 * 65537 * 6700417 is the largest integer of 8 bytes (64 bit unsigned long or __int64), composite having big prime factors. Its factorization compile-time is close to the case (2).
More complicated cases coming from the range of unsigned long and above std::numeric_limits<unsigned int>::max() should be either prime or have prime factors greater than 65521. Those numbers reach compiler or hardware limits.
If a table below is missing an entry, that means the compiler did not succeed and exited with an error. All tables show compilation time in seconds.
GCC
Command line: g++ -O3 -std=c++11 -ftemplate-depth-300000
This is my default compiler and the benchmarks were selected according to its capabilities. For complicated cases with prime factors more than 65521, g++ takes a few minutes time and a lot of RAM, exiting with out-of-memory, if insufficient.
Compile-time in [s] for | g++ 4.8.3 | 6k + r | 30k + r | 210k + r | 2310k + r |
---|---|---|---|---|---|
Variadic templates | 65521 | 0.4 | 0.4 | 0.4 | 9.1 |
4294967291 | 10.0 | 3.5 | 3.2 | 80.2 | |
18446744073709551615 | 10.0 | 3.4 | 3.3 | 81.8 | |
Loki::Typelist | 65521 | 0.3 | 0.4 | 0.4 | 1.9 |
4294967291 | 8.0 | 2.4 | 1.8 | 17.0 | |
18446744073709551615 | 8.0 | 2.5 | 1.8 | 17.0 |
Clang
Clang with the same compiler switches is significantly faster, but don’t want to take time and available resources. On big enough numbers, it crashes after only a few seconds.
Compile-time in [s] for | Clang 3.4.2 | 6k + r | 30k + r | 210k + r | 2310k + r |
---|---|---|---|---|---|
Variadic templates | 65521 | 0.5 | 0.4 | 0.4 | 5.1 |
4294967291 | - | - | 2.2 | 10.4 | |
18446744073709551615 | - | - | 2.2 | 10.4 | |
Loki::Typelist | 65521 | 0.4 | 0.4 | 0.4 | 1.3 |
4294967291 | - | - | 1.7 | 2.4 | |
18446744073709551615 | - | - | 1.7 | 2.4 |
Intel C++
With the same switches, Intel compiler takes more time than both GCC and Clang, but can handle not more cases than Clang. It is also the only one, that compiles variadic templates implementation faster than one with Typelists.
Compile-time in [s] for | Icpc 14.0.2 | 6k + r | 30k + r | 210k + r | 2310k + r |
---|---|---|---|---|---|
Variadic templates | 65521 | 0.6 | 0.5 | 0.6 | 111.5 |
4294967291 | - | - | 6.6 | 305.2 | |
18446744073709551615 | - | - | 6.5 | 306.4 | |
Loki::Typelist | 65521 | 0.6 | 0.6 | 0.8 | 420.7 |
4294967291 | - | - | 12.5 | 1441.5 | |
18446744073709551615 | - | - | 12.5 | 1436.9 |
MSVC 12 (Visual Studio 2013)
Under Windows 7 x64, MSVC 12 defines unsigned long as a 4-byte integer. To try 8 byte integers, I took __int64. No compiler switches were used explicitly, because there is no setting of template depth and c++11 features are supported by default. According to Microsoft, the template depth is fixed to 2048, but the formula of the last column could not compile completely, though the number of reminders 483 is not so high.
Compile-time in [s] for | MSVC 12 | 6k + r | 30k + r | 210k + r | 2310k + r |
---|---|---|---|---|---|
Variadic templates | 65521 | 0.4 | 0.67 | 0.8 | - |
4294967291 | - | - | 43.2 | - | |
18446744073709551615 | - | - | 42.7 | - | |
Loki::Typelist | 65521 | 0.7 | 0.7 | 0.8 | - |
4294967291 | - | - | 42.2 | - | |
18446744073709551615 | - | - | 41.8 | - |
Conclusion
As the benchmarks prove, the described and provided implementation of compile-time factorization is reliable for usage in the common range of integers up to 4 bytes. The compilation will take a couple of seconds in worst case with any of tested compilers.
Get the source code to the article
On September 15, 2015, I gave a talk about this topic at C++ Meetup in Berlin. The slides are published here