Skylake-X review: Intel Core i9-7900X and the X299 platform

Introduction Intel's new X-Series consists of i5, i7, and reissued i9 processors, all of which require the same X299 chipset that comes with the LGA2066 socket. The S-series processors will continue to be used with the 200 chipset. In some applications and games, we've encountered performance trends that didn't match our expectations. Considering that Skylake X has a speed advantage due to higher clock speeds as well as new architectural... Intel has reduced the shared Last Level Cache (LLC-L3) and transferred it from an inclusive to a non-inclusive (but exclusive) approach. This was done with the help of an efficient caching algorithm that improves the hit rate of the L2 cache ... The Basin Falls X299 chipset The Kaby Lake-X and Skylake-X processors sit in an LGA2066 socket (R4), powered by an X299 chipset with 6 watts of power, underscoring Intel's strategy of using server chipsets for their HEDT- Li... Why should it always only hit AMD when a change of architecture leads to application-specific "collapses" in the expected performance or, more simply put, the CPU in certain applications simply does not... Ashes of the Singularity: Escalation Because we were just so nice, we continue the high-altitude flight of the overclocked Core i7-6950X, because even with this benchmark the optimization problem described above is very clear:... Grand Theft Auto V (DX11) GTA V restores the old pecking order and also shows two things. First, it's still an Intel domain, but AMD has made up for it with the Ryzen CPUs! It's really amazing how to deal with some fine... Project Cars (DX11) Even with Project Cars, the chemistry between the new CPU and the engine is right, even though it was observed time and again that all 10 cores clocked up to 4.0 GHz, even though they were not all busy. But we would... Introduction During the launch article of AMD's Ryzen 7 CPUs, we had already explained all workstation and HPC benchmarks in great detail and also questioned the background for many results in some cases even down to the last detail. En... Important preliminary remark Since Intel no longer realizes the contact between Die and Heatspreader by metallic solder at Skylake-X and Kaby Lake-X, but also uses cheaper TIM (Thermal Interface Material) to use the same way. Cooling with the Chiller crowbar In order to be able to achieve usable (overclocking) results, we had to switch from the normal water cooling to the Alphacool Ice Age Chiller 2000, as already mentioned in the previous chapter. ... What is left for us after all these pages as a summary? Intel's market leadership in recent years is ultimately based on a continuous offer of more or less large updates, which of course also this time a certain amount of expected...

In some applications and games, we've encountered performance trends that didn't match our expectations. Considering that Skylake X has both a speed advantage through higher clock speeds and new architectural features, such as the redesigned caches or the 2D mesh topology, we wouldn't have expected Broadwell E models to Successors in any scenario could have trumped at all. But that is precisely what is happening in some cases. We have contacted Intel about this and the answer is our request:

… We have noticed that there are a handful of applications where the Broadwell-E part is comparable or faster than the Skylake-X part. These inversions are a result of the "mesh" architecture on Skylake-X vs. the "ring" architecture of Broadwell-E.

Every new architecture implementation requires architects to make engineering tradeoffs with the goal of improving the overall performance of the platform. The "mesh" architecture on Skylake-X is no different.

While these tradeoffs impact a handful of applications; overall, the new Skylake-X processors offer excellent IPC execution and significant performance gains across a variety of applications.

We've been looking at Intel's new mesh architecture, at least with the few details Intel provided early last week. At this point, therefore, we will offer a brief overview of the content before we return to our actual tests.

The background

Processor interconnects are lines to move data between the key components inside a processor, i.e. the CPU cores, caches, and the PCIe and memory controllers. This basic idea of processor design touches almost every internal component and thus also has a great influence on latency, which in many cases correlates with performance. In the same way, it influences the power consumption and is thus also reflected (in approaches) in the designated TDP.

Intel and AMD's interconnects debuted more than a decade ago. Intel's ring bus, for example, was introduced with Nehalem in 2007, while AMD's HyperTransport dates back to 2001. Both structures developed, but the rapidly growing core number, increasing caches, increasing performance and I/O throughput became the ever-increasing burden on interconnects. Although there are a myriad of technologies to increase the performance of existing interconnects, such as improved scheduling and routing, this usually requires an increase in frequency and thus also the voltage in order to even greater the number of performance gains.

Intel's bi-directional ring bus – shown at the top of red on a Broadwell LCC (Low Core Count) die – serves as a good example of the challenges. The data moves in a rather cumbersome way to get to the cores, cache, and I/O components. Latency then grows as a logical consequence with a growing number of cores. The second image shows a Broadwell HCC (High Core Count) die with 24 cores. A string of all the cores in a monolithic bus leads to excessive losses – so Intel simply divided the larger Die into two separate ring buses. This not only increases the complexity of scheduling, but the buffered switches, which facilitate communication between the two rings, also cause a delay of five cycles. This limits scalability enormously.

AMD's HyperTransport, too, eventually had grown beyond its load limits, so the manufacturer presented the new Infinity Fabric with the Zen architecture. The design is based on two quad-core processors (CCX) that communicate with each other via a 256-bit, bi-directional cross-connection that is equally responsible for Northbrigde and PCIe traffic. They also share the storage controller. The connection via Infinity Fabric to the next Core Complex (CCX) with its four cores, however, leads to increased latency in communication. A detailed look at the design and the latency can be found in our test of the Ryzen 5 1600X.

We also found that increasing the memory clock can lead to improvements in the characteristics of Infinity Fabric latency, which is most likely one of the main reasons why the performance of the Ryzen is improved with faster memory. capable of higher data transfer rates.

AMD is pushing for software and platform optimizations that could eliminate some of the performance issues we've been able to understand in our tests. And in what we have seen, this has indeed worked! AMD's efforts and an incessant succession of BIOS, chipset and software updates are now doing much better than the ones we measured in our first Ryzen 7 test. And AMD's work certainly continues.

Now Intel faces the same challenges and we are curious to see how quickly the blue giant can and will respond to this.

What a mesh!

Intel's new mesh architecture celebrated its launch in the Knights Landing chips. The mesh consists of rows and columns of interconnects that connect cores, caches, and I/O controllers. As you can see, the buffered switches, which are absolutely deadly for low latency, have simply disappeared. The ability to "pass" information through the cores allows for much more complex – and supposedly more efficient – routing among the components. Intel also claims that its 2D mesh requires a lower frequency and voltage than the ring bus, but does it achieve higher bandwidth and lower latency. Of course, we are still reviewing both.

Intel has moved the DDR4 controllers to the left and right sides of the 18-core HcC (High Core Count) ( similar to the Design of Knights Landing). On the ring bus, they were usually still housed at the bottom. The Skylake X Die photo also indicates six more memory controllers (second row from below, in the left and right columns). So it appears that Intel disabled two of the controllers by default. Intel probably uses the smaller LCC (Low Core Count) for the Core i9-7900X, but no details have been officially disclosed.

Things become "meshy"

Intel has developed its 2D mesh to improve scalability as well – but, as the manufacturer suggests, with some compromises. We used SiSoft Sandra's Processor Multi-Core Efficiency Test to measure inter-core, inter-module, and inter-package latency. The benchmark provides multi-threaded, multi-core- and single-threaded testing. For our test we used the multi-threaded test with the "Best Pair Match" setting (lowest latency).

The test can be used to measure performance between cores with all possible thread pairings. For the Intel Core i9-7900X, this meant 189 different results in the end, which of course becomes too confusing in the presentation. We have therefore used a data parser to convert the measurement results into average values.

Processor	Intra-Core Latency	Core-To-Core Latency	Core-To-Core Average Latency	Average Transfer Bandwidth
Core i9-7900X	14.5 – 16ns	69.3 – 82.3ns	75.56ns	83.21 GB/s
Core i9-7900X x 3200 MT/s	16 – 16.1ns	76.8 – 91.3ns	83.93ns	87.31 GB/s
Core i7-6950X	13.5 – 15.4ns	54.5 – 70.3ns	64.64ns	65.67 GB/s
Core i7-7700K	14.7 – 14.9ns	36.8 – 45.1ns	42.63ns	35.84 GB/s
Core i7-6700K	16 – 4/4ns	41.7 – 51.4ns	46.71ns	32.38 GB/s

Intra-core measurement expresses the latency between threads in numbers processed with a physical core, while the core-to-core number reflects thread-to-thread latency between two physical cores. The i9-7900X is comparable at this point to the i7-6950X, which also has ten CPU cores. As a reference, however, we have also added the four-core counterparts.

Between the two chips, we were able to detect a slightly higher intra-core latency and an average value of 10.92 ns. Apart from a slightly increased latency, we were able to measure a decent increase in bandwidth for the Core i9-7900X, which was 17.54 GB/s higher on average – a whopping 26.7 percent increase.

We generated our first set of results with DDR4-2666 RAM and repeated the measurements with DDR4-3200 memory. We received higher latency in all areas, but were also able to measure a higher bandwidth. We fear that these results are only preliminary and will later perform further latency and gaming tests with different storage transfer rates and timings to provide a more in-depth analysis.

Processor	Intra-CCX Core-to-Core Latency	Cross-CCX Core-to-Core Latency	Cross-CCX Average Latency	Average Transfer Bandwidth
Ryzen 7 1800X	40.5 – 82.8ns	120.9 – 126.2ns	122.96ns	48.1 GB/s
Ryzen 5 1600X	6/40 – 82.8ns	121.5 – 128.2ns	123.48ns	43.88 GB/s

The architecture of AMD's Ryzen processors differs significantly at this point, which of course leads to different measurement results. The intra-core latency measurement represents the communication between two logical threads on the same core, and the memory speed does not matter.

The intra-CCX measurements in AMD quantify the latency between threads in the same CCX module, but on different cores. In the past, we observed slight variances at this point, but intra-CCX latency is also hardly affected by the memory speed. In contrast, we were able to measure a 50 percent decrease in cross-CCX latency. How this can be done is simply explained. This type of latency occurs when threads are located on two different CCX modules. However, if the data rates of the memory are improved by switching from DDR4-1333- to DDR4-3200-RAM, these latencys also decrease as a logical consequence.

Structural bandwidth

We also recorded our test results in terms of structural bandwidth. At this point, the Core i9-7900X offers a significant advantage over its Broadwell E predecessor.

According to Intel, the structure is also expected to operate at a lower frequency and voltage, suggesting that the manufacturer has increased the width of the bus. AMD's Ryzen outperforms Intel's four-core, but on average offers less bandwidth than Intel's ten-core models.

Pages: