Talos 2 performance evaluated in 2018-2019

Let’s do a comeback in the past … Before I had my own Talos 2 machine, some benchmarks were published on Phoronix in 2018 and 2019. I remembered that some benchmarks gave good results but that few of them were not at the expected level of performance. On the other hand, and with benchmark reports here and there with various configurations, I forgot with which machines the Talos 2 was in competition.

Also, after the initial articles, some fixes were done very fast and proposed to concerned projects. There is an article, Improving performance of Phoronix benchmarks on POWER9, that analyzed some benchmarks run in the initial Phoronix article, focusing only on benchmarks that did not performed well on Talos 2: LBM Parboil, x264 video encoding, Primesieve, LAME, FLAC, OpenSSL, Scikit-Learn and Blender. The article proposes a description of the situation and suggests changes. Note that sometimes, changes were obvious, for example it appeared that a benchmark missed an optimization option!

Let’s remind that in the Phoronix benchmarks, some of them showed that the Talos 2 performed well. For example: Stockfish, LLVM Compilation, 7zip and Zstd compression, TinyMembench, Postgresql. We will come back on that in details.

TODO: Finalize: So, it was time to refresh and synthetize all that. In the end, the investigation will show if there are still some benchmarks to look at for fixes or improvements.

Compared hardwares

I started to read all the articles and thanks to the benchmark indentifier in some of them, I was able to run these old testsuites to have a snapshot on my own config, see what works or not, practice the phoronix-test-suite tool, etc.

I also took some comments from any articles and comments.

I kept a short list of common systems found in these articles, from the less to the most powerful, in theory:

TODO: Add the year of commercialization

Processor Year Cores/Threads Base Freq Max Freq Cache TDP Memory Support
Intel Xeon E3-1280 v5 (Skylake) Q4-2015 4/8 3.70 GHz 4.00 GHz 8 MB 80 W DDR4-2133, up to 64 GB
Intel Core i9-7980XE Q3-2017 18/36 2.60 GHz 4.20 GHz 24 MB 165 W DDR4-2666, up to 128 GB
Intel Xeon Gold 6138 Q3-2017 20/40 2.00 GHz 3.70 GHz 28 MB 125 W DDR4-2666, up to 768 GB
AMD EPYC 7551 Q2-2017 32/64 2.00 GHz 3.00 GHz 64 MB 180 W DDR4-2666, octa-channel
AMD EPYC 7601 Q2-2017 32/64 2.20 GHz 3.20 GHz 64 MB 180 W DDR4-2666, octa-channel
AMD Ryzen Threadripper 2990WX Q3-2018 32/64 3.00 GHz 4.20 GHz 64 MB 250 W DDR4-2933, quad-channel
IBM POWER9 (dual 22-core) Q4-2017 44/176 2.80 GHz 3.40 GHz 120 MB N/A DDR4, Up to 16 TB DDR4

So, still in theory, the big POWER9 configurations should compete with (and even beat) all these systems except the 2 x Xeon Gold 6138.

Results of old benchmarks (2018 and 2019)

That will highlight the comparison with different machines and also with variants of Talos 2. Note that the listed are sorted by increasing performance (the best machine at the end).

pts/build-gcc-1.0.0

Timed GCC Compilation 7.2:

On my Talos 2, this old version fails. At installation, there is a message No rule to make 'defconfig' and then running the test:

    pts/build-gcc-1.0.0 [Time To Compile]
        E: ../.././gcc/match.pd:120:1 error: expected (, got NAME

So below are only results provided by Phoronix:

Test: Time to compile

    Talos II 2 x 22c POWER9 1070.70
    AMD EPYC 7551            926.08
    AMD EPYC 7601            707.34
    2 x Xeon Gold 6138       591.32

Phoronix wrote “Keep in mind the Talos II Secure Workstation was limited to a slow hard drive for this initial testing, but there are some build time references for those curious about the potential of Talos II serving as a POWER build platform.” With the provided results, let’s say that Talos 2 is rather close than EPYC 7551.”

Multi-threaded: YES Verdict: AVERAGE To do:

#### pts/build-llvm-1.1.0

Timed LLVM Compilation 6.0.1:

Test: Time To Compile Seconds < Lower Is Better

    Talos 2 Power9 2x 4c                   535.10
    Talos II POWER9 Dual 4-Core            354.23
    AMD EPYC 7551                          247.00
    AMD EPYC 7601                          236.00
    Core i9 7980XE                         227.00
    Threadripper 2990WX                    221.00
    Talos II 2 x 22c POWER9                183.00
    AMD EPYC 7601                          171.58
    2x EPYC 7601                           149.00
    Talos II POWER9 Dual 18-Core           141.79
    2 x Intel Xeon Gold 6138               127.08

There are some strange results, for example two very different results for the Talos 2 dual 4-core model, about results concerning the EPYC 7601 models … and also with the Talos 2 better with a dual 18-core than with a dual-22 core processor (maybe due to the slow drive evocated in the build-gcc test?).

Anyway, let’s say that high end Talos 2 models are in the same area than Threadripper 2990WX and AMD EPYC 7601. We will see in another article running the same benchmarks in their recent versions.

Multi-threaded: YES Verdict: GOOD To do:

pts/compress-7zip-1.7.1

7-Zip Compression 16.02:

Test: Compress Speed Test MIPS > Higher Is Better

    Talos 2 Power9 2x 4c           40043
    AMD EPYC 7551                  79708
    Threadripper 2990WX            85484
    Core i9 7980XE                 95662
    AMD EPYC 7601                  99574
    2 x Intel Xeon Gold 6138      143505
    Talos II POWER9 Dual 18-Core  158405
    Talos II 2 x 22c POWER9       162969

Phoronix comment: “The 7-Zip compression performance was doing very well on the POWER9 hardware with the 22-core Talos II Lite was outperforming the 32 core EPYC 7601 processor and the dual 18-core Talos II system was outperforming the dual Xeon Gold 6138 Tyan server. 7-Zip is another workload that always scales well including with SMT systems and here the 176 threads of the Talos II paid off well for this compression test.”

A comment says: “Is there a reason why Rodinia is only ‘-O2’ (not ‘-O3’ like everything else), and for 7Zip, it seems no compile optimization at all? (Also, to make best use of the POWER9 processor, use ‘-mcpu=power9’).” That may explain discrepancies in results. However, I did set optimization options and they brought nothing in term of performance.

Multi-threaded: YES Verdict: GOOD To do: Check optimization options and results in a recent version of the benchmark

pts/compress-zstd-1.0.0

Zstd Compression 1.3.4:

Test: Compressing ubuntu-16.04.3-server-i386.img, Compression Level 19 Seconds < Lower Is Better

    EPYC 7601                      163.35
    Talos 2 Power9 2x 4c           134.46 
    2 x Xeon Gold 6138             117.96
    Talos II POWER9 Dual 4 Core    109.09
    Talos II POWER9 Dual 18-Core   106.94

Phoronix comment:”POWER9 was performing extremely well in the Zstd compression benchmark. The Xeon systems were outperforming the EPYC hardware in the Zstd benchmark while the POWER9 hardware managed to beat out the Intel x86_64 CPUs in this single-thread test case.”

Multi-threaded: NO Verdict: GOOD To do:

#### pts/c-ray-1.1.1

C-Ray - 4K 16 Rays Per Pixel pts/c-ray-1.1.1 Seconds < Lower Is Better

    Talos 2 Power9 2x 4c       9.49
    Raptor Talos II            4.65
    AMD EPYC 7601              3.46
    2 x Intel Xeon Gold 6138   3.15

Optimization does not change anything on my machine. I don’t know what this config Raptor Talos II is … That’s too difficult to compare, so I ran pts/c-ray-1.2.0 and obtained these results:

    Talos 2 Power9 2x 4c                83.99
    Core i9 7980XE                      33.51
    2 x Xeon Gold 6138                  27.16
    AMD EPYC 7601                       25.36
    AMD EPYC 7601                       21.97 -march=native
    Talos II 2 x 22c POWER9             19.14 -mcpu=power9 -mtune=power9
    Threadripper 2990WX                 17.97

Phoronix comment: “The C-Ray ray-tracing performance of the Talos II was in line with the AMD Ryzen Threadripper 2990WX but had come up shy of the dual EPYC 7601 server.”

Multi-threaded: YES Verdict: GOOD To do:

pts/encode-flac-1.6.0

FLAC Audio Encoding 1.3.2:

Test: WAV To FLAC Seconds < Lower Is Better

    Talos II 2 x 22c POWER9        51.79
    Talos II POWER9 Dual 4 Core    43.99
    Talos II POWER9 Dual 18-Core   43.95
    Talos II POWER9 Dual 4-Core    40.23
    AMD EPYC 7551                  12.71
    AMD EPYC 7601                  11.79
    2 x Intel Xeon Gold 6138       10.27
    Xeon E3-1280 v5                 9.60

Optimization options does not change anything.

Phoronix comment: “For for audio encoding with FLAC and MP3 is another one of the areas where the POWER9 CPU performance is behind, but could possibly be improved with maturing POWER9 compiler support.”

The project lacks SIMD code for POWER. A patch series was done and integrated in FLAC 1.3.3, that improved the performance by 3.

Multi-threaded: NO Verdict: BAD To do: Check suggested improvements and measure their benefit

pts/encode-mp3-1.7.0

LAME MP3 Encoding 3.100:

Test: WAV To MP3 Seconds < Lower Is Better

    Talos II 2 x 22c POWER9        75.27 
    Talos II POWER9 Dual 4-Core    75.57
    Talos II POWER9 Dual 18-Core   67.48
    AMD EPYC 7551                  45.69
    AMD EPYC 7601                  42.67
    2 x Xeon Gold 6138             32.29
    Xeon E3-1280 v5                30.14

Results show that performance on POWER is very bad!

In the sthbrx article, it is said: “Due to configure options not being parsed correctly this benchmark is built without any optimisation regardless of architecture. We see a massive speedup by turning optimisations on, and a further 6-8% speedup by enabling USE_FAST_LOG (which is already enabled for Intel)”. It concludes on a x5 speedup. See the dedicated article for details. It mentions that the obtained speedup is x7! On my own machine, optimization options made the score flight to more or less 15 seconds! What is confirmed in a recent version of the benchmark.

Multi-threaded: NO Verdict: BAD To do: Check proposed improvements have been integrated in the official project

pts/openssl-1.11.0

OpenSSL 1.1.1:

Test: RSA 4096-bit Performance Signs Per Second > Higher Is Better

    Talos 2 Power9 2x 4c                    1616.9
	Talos II dual 18-core                   3971.9
    AMD EPYC 7551                           4387.4
    AMD EPYC 7601                           4598.4
    Core i9 7980XE                          4686.0
    Threadripper 2990WX                     5821.0 
    2 x Intel Xeon Gold 6138                7965.4

Phoronix comment: “The OpenSSL results also have a ways to improve, the performance on POWER9 was mixed against the AMD/Intel CPUs with the dual 18-core system failing to outperform the EPYC 7601.”

OpenSSL 1.1.0f did not include some improvements existing in the mainline. An update of the used version should improve the perfs on Power9 by a factor of 1.7.

Multi-threaded: YES Verdict: BAD To do:

pts/parboil-1.1.2

Parboil v2.5

Test: OpenMP LBM

    Talos II POWER9 Dual 4 Core   113.43
    AMD EPYC 7551                  71.88
    Talos II 2 x 22c POWER9        66.08
    Talos II POWER9 Dual 18-Core   44.18
    AMD EPYC 7601                  37.39
    2 x Xeon Gold 6138             30.18 

Phoronix comment: “With the Lattice-Boltzmann Method Fluid Dynamics test case, the dual 18-core POWER9 configuration was competing with the EPYC CPUs in this round of OpenMP benchmarking.”

From sthbrx article: “Also this benchmark is compiled without any optimisation. Recompiling with -O3 improves the results 3.2x on POWER9.”

Test: OpenMP CUTCP

    Talos II 2 x 22c POWER9  9.69
    AMD EPYC 7551            2.76
    AMD EPYC 7601            2.61
    2 x Xeon Gold 6138       2.28 

Phoronix comment: “Some tests like this Distance-Cutoff Coulombic Potential test appear just not well optimized for POWER9 at this point.”

To see if changes have been included in recent version of the benchmark and if the 3x speedup applied.

Test: OpenMP Stencil

    AMD EPYC 7551           17.35
    AMD EPYC 7601           14.26
    Talos II 2 x 22c POWER9 10.51
    2 x Xeon Gold 6138       6.01 

Phoronix comment: “While in the stencil test, the Talos II system beat out both AMD EPYC systems and was mid-way to the performance of the dual Xeon Gold server.”

Multi-threaded: ? Verdict: AVERAGE To do: Focus on the test OPEN MP CUTCP that does not seem to be optimized for POWER9.

#### pts/phpbench-1.1.5

PHPBench 0.8.1: pts/phpbench-1.1.5 [PHP Benchmark Suite] Score > Higher Is Better

    Talos II POWER9 Dual 4-Core    166406
    AMD EPYC 7551                  365767
    Talos II POWER9 Dual 18-Core   373681
    AMD EPYC 7601                  393659
    Threadripper 2990WX            525276 
    2 x Xeon Gold 6138             606341 
    Xeon E3-1280 v5                651532 
    Core i9 7980XE                 703666 

Phoronix comment: “The Python and PHP benchmarks also show room for single-threaded performance improvements. POWER9 only came in line with the AMD EPYC hardware for the PHP language performance.”

Optimization options did not bring any visible enhancements.

Multi-threaded: NO Verdict: AVERAGE To do: Identify the source of the problem (but who will like to work on PHP?)

pts/povray-1.2.1

POV-Ray 3.7.0.7:

Test: Trace Time

    Talos 2 2x 4c              93.57
    Core i9 7980XE             28.29
    Talos II 2 x 22c POWER9    25.28
    AMD EPYC 7551              23.01
    AMD EPYC 7601              22.61
    2x Xeon Gold 6138          19.02
    Threadripper 2990WX        17.92

Even with the benefit of the multi-thread support, the best Talos 2 system does not reach the performance of the AMC EPYC systems.

Multi-threaded: YES Verdict: BAD To do: Investigate …

pts/primesieve-1.4.1

Primesieve 6.2:

Test: 1e12 Prime Number Generation Seconds < Lower Is Better

    Talos II POWER9 Dual 4-Core    44.84 
    Talos II POWER9 Dual 18-Core   18.81
    Talos II 2 x 22c POWER9        16.42
    EPYC 7551 .................... 12.93
    EPYC 7601 .................... 12.15
    2 x Xeon Gold 6138 ........... 10.63

The nominal results on POWER are not convincing, showing lower performance than AMD EPYC systems.

After a pull request by Anton Blanchard, the author had to make changes, having understood the issue. To check in a recent version of the benchmark.

Multi-threaded: YES Verdict: BAD To do: Check if changes proposed by the author have a positive and measurable impact

pts/pybench-1.1.2

PyBench 2018-02-16:

Test: Total For Average Test Times Milliseconds < Lower Is Better

    Talos II 2 x 22c POWER9       4088 
    Talos 2 Power9 2x 4c          3671 
    Talos II POWER9 Dual 18-Core  1867
    EPYC 7601                     1538
    Threadripper 2990WX           1147
    2 x Xeon Gold 6138            1127
    Xeon E3-1280 v5               1043
    Core i9 7980XE                 955

Note that I also collected results that are not really the same:

    Talos II 2 x 22c POWER9       4859
    AMD EPYC 7551                 2216
    AMD EPYC 7601                 2086
    2 x Intel Xeon Gold 6138      1395

Anyway, that does not change the order: Python on Power9 systems is 2 or 3 times slower than on x86-64 machines (2 times slower than EPYC based systems).

Multi-threaded: NO Verdict: VERY BAD To do: Investigate

pts/osbench-1.0.1

Test: Create Threads us Per Event < Lower Is Better

    AMD EPYC 7551              38.25 
    AMD EPYC 7601              30.71 
    Talos 2 Power9 2x 4c       27.28
    Raptor Talos II            27.17 
    2 x Intel Xeon Gold 6138   23.07

Test: Create Processes us Per Event < Lower Is Better

    Talos 2 Power9 2x 4c       74.33
    AMD EPYC 7601              59.61
    AMD EPYC 7551              57.95
    2 x Intel Xeon Gold 6138   42.95
    Raptor Talos II            29.77

Test: Memory Allocations Ns Per Event < Lower Is Better

    AMD EPYC 7551              96.32
    2 x Intel Xeon Gold 6138   96.05
    AMD EPYC 7601              95.14
    Talos 2 Power9 2x 4c       94.70
    Raptor Talos II            83.03

Phoronix comment: “While lastly for now are the OSBench synthetic operating system benchmarks with the Raptor Talos II doing well against the EPYC and Xeon platforms.”

Talos 2 performs very, close to the 2 x Intel Xeon Gold 6138 or even better!

On my model, adding optimization options, only the test Create Processes had different results, with a better score of 49 instead of 74.

Multi-threaded: N/A Verdict: GOOD To do: Check optimization options, they improved greatly the test Create Processes

pts/pgbench-1.8.4

PostgreSQL pgbench 10.3:

Test: Scaling: Buffer Test - Test: Normal Load - Mode: Read Only TPS > Higher Is Better

    Talos 2 Power9 2x 4c            11110
    Xeon E3-1280 v5                116058
    Talos 2 Power9 2x 4c optim     159835
    Talos II POWER9 Dual 4 Core    222683
    EPYC 7601                      399625
    Talos II Lite POWER9 22 Core   442106 
    Threadripper 2990WX            472250
    Talos II 2 x 22c POWER9        544186 -mcpu=power9 -mtune=power9 (-march=native on x86_64)
    Talos II POWER9 Dual 18-Core   574297 
    2 x Xeon Gold 6138             587539 

Test: Scaling: Buffer Test - Test: Normal Load - Mode: Read Write TPS > Higher Is Better

    Talos 2 Power9 2x 4c              542
    Xeon E3-1280 v5                  3803
    Talos II POWER9 Dual 4 Core      6381
    Talos II POWER9 Dual 18-Core     6451
    EPYC 7601                        6473
    2 x Xeon Gold 6138               6588
    Talos II POWER9 Dual 4-Core     14457
    Talos 2 Power9 2x 4c optim      14507

Optimization clearly boosts the performance!

Phoronix comment: “The dual 18-core POWER9 system was managing to compete with the dual Xeon Gold server for the PostgreSQL database benchmarking.”

Multi-threaded: YES Verdict: GOOD To do: Check optimization options, that provide a boost

pts/redis-1.1.0

Test: GET Requests Per Second > Higher Is Better

    Talos 2 Power9 2x 4c        904977
    Raptor Talos II            1049994
    Talos 2 Power9 2x 4c optim 1053740
    AMD EPYC 7601              1703353
    2 x Intel Xeon Gold 6138   2515784

Test: SET Requests Per Second > Higher Is Better

    Talos 2 Power9 2x 4c optim  553403
    Raptor Talos II             606874
    Talos 2 Power9 2x 4c        615384
    AMD EPYC 7601              1195935
    2 x Intel Xeon Gold 6138   1744256

There is almost no CPU activity!

Multi-threaded: NO Verdict: BAD To do: Check CPU activity

pts/rodinia-1.2.2

Rodinia - OpenMP LavaMD pts/rodinia-1.2.2: Problem to install opencl packages

And also a problem of checksum on the rodinia_2.4.tar.bz2 archive.

    AMD EPYC 7601                13.26
    Talos II 2 x 22c POWER9      13.22
    AMD EPYC 7551                12.71
    2 x Intel Xeon Gold 6138      7.02

Not many results collected so let’s base our opinion on Phoronix comment: “First up was the Rodinia OpenMP benchmark where the Talos II with dual 22-core processors (44 cores / 176 threads) had the performance aligned with the Core i9 7980XE, which in turn were behind the AMD Ryzen Threadripper 2 WX series performance. With the Parboil and Rodinia scientific tests, the dual 22-core POWER9 system was just behind the EPYC 7551 for performance.”

A comment says: “Is there a reason why Rodinia is only ‘-O2’ (not ‘-O3’ like everything else), and for 7Zip, it seems no compile optimization at all? (Also, to make best use of the POWER9 processor, use ‘-mcpu=power9’).”

Multi-threaded: ? Verdict: AVERAGE To do: Try a recent version and check optimization options

pts/rust-prime-1.0.0

Rust Prime Benchmark:

Test: Prime Number Test To 200,000,000 Seconds < Lower Is Better

    Talos 2 Power9 2x 4c             13.71
    Talos 2 Power9 2x 4c optim       13.64
    Threadripper 2990WX              12.49
    Core i9 7980XE                    8.18
    2 x Intel Xeon Gold 6138          4.48
    Talos II 2 x 22c POWER9           3.64

Phoronix comment: “Rustlang performance is looking good on POWER9. The Rust Mandelbrot benchmark performed poorly with POWER9, but that certainly wasn’t the case with the Rustlang Prime benchmark.”

Multi-threaded: YES Verdict: GOOD To do: Run Rust Mandelbrot benchmark that behaves poorly, in addition to Prime benchmark

pts/scikit-learn-1.0.1

It failed to install on my machine:

Scikit-Learn 0.17.1:
    pts/scikit-learn-1.0.1
        The test quit with a non-zero exit status.
        E: ModuleNotFoundError: No module named 'sklearn.externals.six'

So I got results from only one source:

    Talos II 2 x 22c POWER9        229.62
    Talos II POWER9 Dual 18-Core   227.39
    2 x Intel Xeon Gold 6138       176.07
    EPYC 7601                      144.51

Phoronix comment: “The SciKit-Learn performance could also be better improved for POWER9, possibly via further software optimizations.”

In the sthbrx article, it is said that the benchmark uses the libblas that is a basic implementation among others and with no optimization for POWER9. Alternative libraries bring major speedups.

Multi-threaded: ? Verdict: BAD To do: Run a more recent version of the benchmark and analyze

pts/stockfish-1.1.1

v2014-11-26

Test: Total Time Nodes Per Second > Higher Is Better

    Talos II POWER9 Dual 4 Core    21485986
    Core i9 7980XE                 46289588
    EPYC 7601                      58469775 
    Threadripper 2990WX            67300757 
    2 x Xeon Gold 6138             69928856 
    Talos II POWER9 Dual 18-Core   73165064 
    Talos II 2 x 22c POWER9        79137127 
    2 x EPYC 7601                 100932062

I don’t remember where I found other results with other metrics but they showed Talos 2 between both EPYC models:

    AMD EPYC 7551                      5032    -msse -msse3 -mpopcnt
    Talos II 2 x 22c POWER9            4915    -mcpu=power9 -mtune=power9
    AMD EPYC 7601                      4474    -msse -msse3 -mpopcnt
    2 x Xeon Gold 6138                 3343    -msse -msse3 -mpopcnt

Phoronix comment: “The Stockfish chess benchmark was running very well on POWER9 where the 22-core Talos II Lite was just behind the EPYC 7601, the dual quad-core POWER9 system well ahead of the other quad and octa core Intel Xeons, and the dual 18-core box outperforming the Xeon Gold 6138 by a small margin.”

Phoronix comment: “With the multi-threaded Stockfish chess benchmark using pthreads, the dual socket POWER9 system came up short of the dual EPYC 7601 Dell PowerEdge server.”

This last comment seems to mean Talos 2 performs very good but as the other results put it between both EPYC models and also because the dual 4-core model has poor results, I choose to say it has an average score.

Multi-threaded: YES Verdict: AVERAGE To do: To confirm heterogeneous results on a recent version of the benchmark, look at optimization options

pts/tinymembench-1.0.1

Tinymembench 2018-05-28:

Test: Standard Memcpy MB/s > Higher Is Better

    2 x Xeon Gold 6138              6015.50
    Talos 2 Power9 2x 4c default   10662.90
    Talos 2 Power9 2x 4c optim     10676.10
    Talos II POWER9 Dual 4 Core    12418.40
    EPYC 7601                      12613.20
    Xeon E3-1280 v5                12877.90
    Talos II POWER9 Dual 18-Core   14515.40
    Talos II 2 x 22c POWER9        15453.00

Phoronix comment: “The Tinymembench performance on POWER9 was looking good for memory copy speed.”

Suprisingly, the 2 x Xeon Gold 6138 system looses this benchmark and … Talos 2 wins! This slow to run benchmark has seen no improvement compiling it with optimization options.

Multi-threaded: NO Verdict: GOOD To do: Nothing

pts/x264-2.3.2

x264 2018-02-05:

Test: H.264 Video Encoding Frames Per Second > Higher Is Better

    Talos II POWER9 Dual 4-Core    29.14
    Xeon E3-1280 v5                42.24
    Talos II 2 x 22c POWER9        43.72
    Talos II POWER9 Dual 18-Core   51.22
    AMD EPYC 7551                 101.52
    2 x Xeon Gold 6138            125.21 
    EPYC 7601                     126.39 

Phoronix comment: “The x264 video encoding program is one test showing it’s not too well optimized right now for POWER9”. And a bit later: “Similar to our first POWER9 benchmarking session back in April, the x264 performance and more broadly the multimedia CPU performance on POWER9 still could be much better optimized. The POWER9 performance was quite low compared to the x86_64 competition.”

Not an easy project to improve.

Multi-threaded: YES Verdict: BAD To do: That would require huge efforts …

system/octave-benchmark-1.0.0

GNU Octave Benchmark 4.4.1:

    2 x Xeon Gold 6138        23.47
    AMD EPYC 7551             22.66
    AMD EPYC 7601             20.92
    Threadripper 2990WX       16.78 
    Talos II 2 x 22c POWER9   14.92

Phoronix comment: “With the GNU Octave software as a MATLAB performance, the Talos II squeezed in front of the Threadripper systems for this single-core test.”

Multi-threaded: YES Verdict: GOOD To do:

system/blender-1.0.2

Blender 2.79:

Test: Blend File: Classroom - Compute: CPU-Only

    Xeon E3-1280 v5                1656.82
    Talos II POWER9 Dual 4-Core    1391.41
    Talos II POWER9 Dual 18-Core    829.10
    EPYC 7601                       504.13
    2 x Xeon Gold 6138              415.84 

Test: Blend File: Pabellon Barcelona - Compute: CPU-Only

    Xeon E3-1280 v5                1745.08
    Talos II POWER9 Dual 4-Core    3057.45
    Talos II POWER9 Dual 18-Core   1354.60
    EPYC 7601                       972.62
    2 x Xeon Gold 6138              787.52 

Phoronix comment: “The Blender 3D modeling performance on the CPU also leaves more room for optimization on the POWER9 front.”

“failed to use more than 15 threads, even when “-t 128” was added to the Blender command line”

Multi-threaded: YES Verdict: BAD To do: Check if the project is fixed to use more than 16 threads

Conclusion

That was very much work for me … taking the risk to get a conclusion that was the almost same than in the very first Phoronix article. Anyway, that allowed me to dive into these topics and to reconnect to my too long abandoned Talos 2. Thanks thanks to the synthesis below, that will give an orientation to the next step.

Benchmark MT Verdict To do
pts/build-gcc-1.0.0 YES AVERAGE  
pts/build-llvm-1.1.0 YES GOOD  
pts/compress-7zip-1.7.1 YES GOOD Check optim options and results in recent version of the benchmark
pts/compress-zstd-1.0.0 NO GOOD  
 pts/c-ray-1.1.1 YES GOOD  
pts/encode-flac-1.6.0 NO BAD Check suggested improvements and measure their benefit
pts/encode-mp3-1.7.0 NO BAD Check proposed improvements integrated in the official project
pts/openssl-1.11.0 YES BAD  
pts/parboil-1.1.2 ? AVERAGE Focus OPEN MP CUTCP that does not seem to be optimized for POWER9
pts/phpbench-1.1.5 NO AVERAGE Identify the source of the problem
pts/povray-1.2.1 YES BAD Investigate, profile …
pts/primesieve-1.4.1 YES BAD Check if changes proposed have a positive and measurable impact
pts/pybench-1.1.2 NO VERY BAD Investigate
pts/osbench-1.0.1 N/A GOOD Check optim options, they improved greatly the test Create Processes
pts/pgbench-1.8.4 YES GOOD Check optim options, that provide a boost
pts/redis-1.1.0 NO BAD Check CPU activity
pts/rodinia-1.2.2 ? AVERAGE Try a recent version and check optimization options
pts/rust-prime-1.0.0 YES GOOD Run Mandelbrot that behaves poorly, in addition to Prime benchmark
pts/scikit-learn-1.0.1 ? BAD Run a more recent version of the benchmark and analyze
pts/stockfish-1.1.1 YES AVERAGE To confirm heterogeneous results on a recent version and optim options
pts/tinymembench-1.0.1 NO GOOD Nothing
pts/x264-2.3.2 YES BAD That would require huge efforts …
system/octave-benchmark-1.0.0 YES GOOD  
system/blender-1.0.2 YES BAD Check if the project is fixed to use more than 16 threads

Future actions will depend on the verdict:

  • GOOD, when Talos 2 is ont the first place, I will just check that there is no regression.

I saw in the generated webpage that other benchmarks run on Talos 2 used -mtune=power9 -mcpu=power9 Or even, for Postgresql -O3 -mtune=power9 -mcpu=power9 Environment Details

  • Core i9 7980XE: CXXFLAGS=-O3-march=native CFLAGS=-O3-march=native
  • Threadripper 2990WX: CXXFLAGS=-O3-march=native CFLAGS=-O3-march=native
  • Talos II 2 x 22c POWER9: CXXFLAGS=-O3 -mtune=power9 -mcpu=power9 CFLAGS=-O3 -mtune=power9 -mcpu=power9

This 44-core configuration is intended to compete with the likes of AMD Threadripper and Intel Core i9 families and it did manage to successfully do so in a majority of the benchmarks come out ahead of the Threadripper 2990WX and Core i9 7980XE.

echo performance tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Except stockfish and osbench, results are far from other machines AMD EPYC 7551 and AMD EPYC 7601 and even more 2 x Intel Xeon Gold 6138.

Annexes

References of Phoronix articles related to Talos 2:

  • 2018-04-04: https://www.phoronix.com/review/power9-epyc-xeon phoronix-test-suite benchmark 1804049-AR-POWERTALO23
  • 2018-09-25: https://www.phoronix.com/review/power9-talos-2 phoronix-test-suite benchmark 1806251-AR-LINUXCPUS06
  • 2018-11-08: https://www.phoronix.com/review/power9-threadripper-core9 phoronix-test-suite benchmark 1811068-SK-TALOS205952
  • 2018-11-27: https://www.phoronix.com/review/power9-x86-servers
  • 2019-08-19: https://www.phoronix.com/review/rome-power9-arm

Notes for the next article:

About the methodology:

  • Define a set of benchmarks, at least the same than in this article
  • Run the default configuration
  • List all that still behaves bad
  • Compare results on my machine, as the initial article made a baseline
  • Check and set optimization options
  • Evaluate effort and potential benefit

pts/encode-flac-1.8.1 45.592 instead of 40.066 with pts/encode-flac-1.6.0 !!

Tried configuring with –disable-altivec and result is also 45.307

pts/encode-mp3-1.7.4 14.309 instead of 75.09 with pts/encode-mp3-1.7.0 !!

Timed LLVM Compilation 16.0: pts/build-llvm-1.5.0 [Build System: Unix Makefiles]

Average: 1169.625 Seconds
Deviation: 1.55%

OpenSSL 3.3: pts/openssl-3.3.0 [Algorithm: RSA4096]

Povray pts/povray-1.2.1: Test Installation 1 of 1 1 File Needed [44.75 MB / 2 Minutes] File Found: povray-3.7.0.7.tar.xz [44.75MB] Approximate Install Size: 172 MB Estimated Install Time: 10 Seconds Installing Test @ 23:28:27 The installer exited with a non-zero exit status. ERROR: C compiler cannot create executables LOG: ~/.phoronix-test-suite/installed-tests/pts/povray-1.2.1/install-failed.log

[PROBLEM] pts/povray-1.2.1 is not installed.