THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Hyper-Threading Performance

The Hyper Threading technology returned to the Intel Xeon (and Core-ix) with processor codename Nehalem in 2009. It was first introduced with Willamette in late 2000?, and the first Xeon in 2001. But the Core 2 architecture, 2006, in the Xeon 5300 and 5400 series did not have this feature. As with many (potentially) performance enhancing technologies and features, there is a glaring absence of meaningful detailed performance data that might be of help to developers and information technology professionals. And it is not as if Intel and Microsoft do not have the budget for this. (Note: I am deliberately using both Hyper-Threading and Hyper Threading because it impacts search.)

The complex interaction between software and advanced hardware inherently implies that there is no magic that always improve performance without risk of negative consequences. I am of the impression that marketing people squash anything that could be interpreted as negative because they want victims to believe that everything will be wonderful.

On our side we should expect that it takes talent and years of experience to build up the skills necessary to command professional salaries. So we need to see both good and bad in order to determine the best strategy to realized the benefits of the positive while containing the impact of any deficiencies. If it were not, then why staff our positions with someone more expensive than necessary?

Hyper-Threading from Nehalem to Sandy Bridge

Nehalem and later Xeon Core-ix based processor implement temporal multi-threading. The older Pentium 4 based processors had simultaneous multi-threading. The details of both type are covered in the references at the end and will not be covered here. I have a brief discussion on HT in Processor Architectures and History.

When the Nehalem-EX (Xeon 7500 series) was near introduction, the HP benchmark team suggested that the impact of HT was in the range of 30% improvement for the OLTP oriented benchmarks and 10% for the DW/DSS benchmarks. I presumed that this was on the 8-socket ProLiant DL980 as benchmarks for the ProLiant 300 and 500 systems were handled by a different group. My expectations are that HT should have greater impact on the 1 to 4 socket systems because it is much more difficult to achieve performance gains on the more complex systems.

A few years ago, I did an assessment of a proprietary database engine that only did parallel b-tree searches without locking. Scaling was nearly linear with the number threads over both physical cores and logical processors. This was not unexpected because of the pointer chasing code means the core executes serialized memory accesses, so there should be plenty of no-op cycles for two logical processor to share. As a matter of factor, 4 threads (logical processors) per core might be even better.

Apparently that SQL Server does not realize such benefit has to do with the locking. Is there is a single resource that locks must go through? Dave Campbell's post Breakthrough performance with in-memory technologies on Hekaton cites Per-Ake Larson et al article, and mentions lock free data structures. (I found Sasha Goldshtein material Practical Concurrency Patterns: Lock-Free Operations and Practical Concurrency Patterns: Spinlock. If any knows good sources, please advise.) This might indicate that the next version of SQL Server could show more benefit from HT and perhaps Intel should consider more threads per core in the next generation?

 

HT in Small and Large Queries

Ideally we should investigate HT for both high-volume small queries (non-parallel execution plans) and low-volume large queries with parallel execution plans. Unfortunately my multi-threaded test harness is old and I do not have time to bring it up to date.

Recently I did work on a clustered system with two nodes. For whatever reason, one had HT enabled and another had HT disabled. Periodically, the active node would be switched to allow for patching. This provide invaluable data on HT performance.

Test System

The average SQL statement on this system is 1-2 CPU-ms, comparable to TPC-E. HT was off until mid-September, on until October, off until early November, on until mid-January and off until beginning of March. Certain SQL statements are not impacted by HT, and some SQL could be more than 50% faster with HT on, even queries without a parallel execution plan. The overall steady CPU load on the system is typically 25% during the day.

For large queries, I used the TPC-H data generator at scale factor 100 and the set of 22 TPC-H queries. The LineItem table is 80GB because of the use of the 4-byte date data type instead of the 8-byte datetime. The total database is 140GB.

 

Test System

The test server is a Dell PowerEdge T110 II with Intel Xeon E3-1240 processor, quad-core 3.3GHz, 32GB memory, and LSI MegaRAID SAS-9260 controller connected to 8 SATA 6Gpbs SSDs. There are 2 Crucial m4, 2 OCZ Vertex 3, 3 OCZ Vertex 3 Max IOPS and 1 OCZ Vector. The T110 only has PCI-E gen 2 slots, so I will need to get a new system for gen 3 testing. Below is a representation of the test system.

Test System

The software stack is Windows Server 2012 and SQL Server 2012 Enterprise Edition. Most testing was done with sp1 + cu3, with a few tests at RTM, and sp1 base. The storage configurations evaluated were 4 x 2 disk RAID 0, 8 disk RAID 4 and 1 x 8 disk RAID 0. All results here are 1 x 8 disk RAID 0, but write activity was minimal so RAID 5 results should be only moderately less.

Hyper-Threading can be enabled or disabled in the UEFI (formerly BIOS, now Universal Extensible Firmware?). Dell also allows setting the number of cores to enable. I think is most useful in light of SQL Server per core licensing. We can buy a single model for many functions, then enable the correct number of cores to balance performance and licensing cost.

The SQL Server DATE_CORRELATION_OPTIMIZATION setting was turned on. It so happens that correlated dates within the TPC-H tables are within 120-days of each other. Ship date is within 120 days of order date and so on. The TPC-H queries are written citing a greater than condition on one date column and a less than condition on a correlated column, so this setting helps. If there is a not tight correlation, then the setting may not help. In practice, it might be better to just explicitly specify both upper and lower bounds on the cluster key column whenever possible.

Several settings were tried on the LSI SAS-9260. No read-ahead is better than read-ahead and 128K stripe size was better than 64K, but results below are with no read-ahead and 64K stripe size.

While the TPC-H scale factor 100 data set was used, the indexes were built different than TPC-H requirements as I am not presenting official TPC-H results. The significant difference is that the clustered index key on LineItem is Shipdate+OrderKey+LineNumber, specified as unique. This is per my practice of building unique clustered indexes, and not to eliminate key lookups in the execution plan.

Hyper-Threading On versus Off

In all tests, only the execution time and CPU are reported. Statistics generation and compile times were excluded. Each test of the 22 TPC-H queries are from clean buffers and clean procedure cache. From a freshly populated database, the statistics generation time is about 40 sec. The compile time with statistics previously generated is about 4-5 sec for all 22 queries.

CPU and elapsed time were collected with SET STATISTICS TIME ON. The DMV dm_exec_query_stats reports exactly the same elapsed time, but the worker time is higher, perhaps including the compile time?

Below are elapsed time in milli-sec to execute each of the 22 TPC-C queries with Hyper-Threading off and on. There was no restriction on Max degree of parallelism so HT off is DOP 4 and HT on is DOP 8 (4 physical cores).

HT off on dur

Below are CPU (worker) time in ms for the same with HT off and on.

HT off on CPU

The sum total execution elapsed time for 22 queries is 692 sec with HT off and 559 sec with HT on. Overall, the 22 queries run 24% faster with HT. On the individual queries, all but one range from 0 to 65% faster. The exception, query 15, runs 4 times faster with HT on instead of off. Below is the elapsed time for HT off relative to HT on.

HT elapsed

The sum total execution CPU time is 2253 sec with HT off and 3374 sec with HT on. Overall, CPU is 50% higher with HT on. Of course, there are twice as much available "CPU-sec" with HT on than off, so this is not a negative. With a perfect parallel execution plan and infinitely fast IO, all (logical) cores should be at 100% during the query execution.

On the individual queries, CPU on all but Q 2 and Q15 range from 22 to 80% higher with HT on. On the Q2, CPU is 137% higher. On Q15, CPU was 45% lower with HT on than off, reflecting the 4X speedup with double the logical processors. Below is the CPU time for HT on relative to HT off.

HT elapsed

There is just the slight difference in the execution plans for Q15, shown below. The first is with HT off and the second with HT on.

Q15 HT off

Q15 HT on

I do expect HT should improve the performance of this query but it is probably reasonable to believe that most of the 4X gain in elapsed time is attributed to the difference in execution plans. I have no idea why the Query Optimizer would not use the same plan in both cases. Perhap someone from the Microsoft QO team can help?

For anyone interested, I have provided the actual execution plans for the above HT off and on sqlplan. (Too bad the web browser does not render sqlplan files!)

 

Hyper-Threading Off and On at DOP 1

Below is elapsed time with HT off and on at degree of parallelism 1. Amazingly, performance is better with HT on at DOP 1, 20% faster!

HT off on dur

How can this be? When this is the only activity, only one of two logical cores on one of the four cores is in use. The performance should be very similar to single-threaded HT off? The range is 0 to 63% faster except Query 17 which 14% slower with HT on.

Below is CPU time with HT off and on at DOP 1.

HT off on CPU

Below is the relative elapsed time with HT off to HT on.

HT elapsed

Query 1 is a simple table access aggregating 591M rows over nearly 80GB, so there can be no difference in execution plan, and yet it is 48% faster with HT on? If someone knows how this can be, please advise?

I could speculate that with HT on, there are twice as many logical processors, so it is less likely an OS task would disrupt the SQL thread handling the running query. Perhaps is so happened that thread handling the SQL query and the thread handling IO were on the same physical processor, so data was already in L2? These are just wild guesses.

In the comments below, Chris A mentions that SQL Server 2012 tries to fit batches into L2 cache. What if it estimate the amount of available L2 too high? Then with HT on, it would presume only half that to be available? But with nothing running on the other LP, now single-thread HT on runs better than HT off?

I like this hypothesis, but the HT-on benefit occurs for S2K8R2 too. Its never the theories that we like!

 

Hyper-Threading Off and On at DOP 2

Below is elapsed time for HT off and on at DOP 2. Most of the individual queries are with 12% of each other, faster or slower, except for Query 1, 28% faster with HT on. Overall HT on is 2% faster, basically noise.

HT off on dur

Below is CPU time for HT off and on at DOP 2

HT off on CPU

 

Hyper-Threading Off and On at DOP 4

Below is elapsed time for HT off and on at DOP 4. The range with HT on is 2.5X slower (Query 6) to 90% faster (Query 15). Overall, 13% slower with HT on. It is possible that both logical processors are used on certain cores instead of a single logical processor of each physical core. The same may have happened at DOP 2.

HT off on dur

Below is CPU time for HT off and on at DOP 4

HT off on CPU

 

Hyper-Threading Off, DOP 1, 2 and 4

Below is elapsed time with HT off at DOP 1, 2 and 4. Notice that scaling is sometimes better than 2X from DOP 1 to 2. I discussed this super-scaling effect in Tpch Studies at SF10

HT off on dur

Below is CPU time for HT off at DOP 1, 2 and 4

HT off on CPU

Below is the ratio of elapsed time at DOP 2 over 1, and DOP 4 over 2. Correction: this is actually speedup from DOP 1 to 2, and 2 to 4, it is the ratio of elapsed time at DOP 1 over DOP 2, and 2 over 4. The conventional expectation is the scaling should be close to linear from DOP 1 to 2, for a value of just under 2 on the graph. From 2 to 4, scaling should be still be good but less than 1 to 2. If the test system had more cores, there should be reduced gains in scaling to the higher DOP. Query 10 is an example.

HT elapsed

As it turns out, the behavior is all over the map of possibilities. One effect is super-scaling where the parallel execution plan is actually more efficient than the non-parallel plan. Four queries (2,11,13 and 18) actually exhibit super-scaling at both DOP 1-2 and DOP 2-4.

Other effects are that the parallel plan is much less efficient than the non-parallel plan, but otherwise scales with parallelism. In this case, we would expect scaling from 1 to 2 to be poor, but good at 2 to 4. Examples are queries 6 and 15.

In addition, because of the SQL Server cost formulas, we could be expect completely different execution plans and characteristics at different degrees of parallelism.

Below is the ratio of worker time at DOP 2 over 1, and DOP 4 over 2. The conventional expectation is progressively higher CPU with increasing DOP to account for the overhead of coordinating the threads.

HT elapsed

But as with elapsed time, there are radical variations in behavior. Per super-scaling effect, CPU at DOP 2 is actually lower than DOP 1 in many cases.

 

Hyper-Threading On, DOP 1, 2, 4 and 8

Below is elapsed time with HT on at DOP 1, 2, 4 and 8. Notice the more predictable behavior of queries running faster at progressively higher DOP, even when the number of threads exceeds the number of physical cores. The exception is query 15.

HT off on dur

Below is CPU time with HT on at DOP 1, 2, 4 and 8. Again the expected behavior of progressively higher CPU with increasing DOP except in the DOP 1 to 2 where the parallel execution plan is more efficient than the non-parallel plan.

HT off on CPU

Below is the ratio of elapsed time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4. Correction: DOP 1 over 2 DOP 2 over 4 and DOP 4 over 8

HT elapsed

Below is the ratio of CPU time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4.

HT elapsed

 

HT Summary

Overall Hyper-threading is a significant positive for performance. There are unexpected characteristics so it is important to collect sufficient performance data for HT off and on, so that anomalies can be investigated. It is also important to have data on query performance at range of DOP values. In a production system, one should also consider the strategy for max degree of parallelism (and cost threshold for parallelism) factoring in the number of concurrently running queries (with and without parallel execution plans.

The premise is that HT performance gains on SQL Server is heavily impacted by the locking mechanism. Then the next version of SQL Server, which is expected to make extensive use of lock-free data structures should see more substantial performance gains with HT. This should also prompt Intel to support more logical processor on each physical core, similar to the 4 threads on IBM POWER.

See also Queries barely over the Cost Threshold for Parallelism and Parallelism Strategy and Comments. In the first, I mentioned that a parallel execution plan with an intermediate DOP may allocate both logical processors on one core, instead of distributing threads one per physical core. Even if using both logical processors on one core produces a 30% throughput gain, that implies each processor running at 65% of one LP used and the other idle. Then the threads on the doubled up core would take 50% longer than the cores with only one active LP.

Another question is how should the threads be distributed? Consider one thread per core, all on a single socket first, before going to the next socket, so threads share a common L3 cache. Potentially operations such as the Parallelism Repartition Streams might run better. The other option is one thread on each socket first, before coming back to the first socket. If SQL Server could then distribute the work so that threads are assigned to access pages stored on local memory as much as possible, this would improve memory performance. Of course, this would be impossible to accomplish perfectly, but perhaps good enough is possible?

Wikipedia References

Wikipedia has pages on both Multi-Threading and processors with MT in one form or another. MT topics include the broader topic Multi-Threading, and the specific types of MT: Simultaneos Multi-Threading, Temporal Multi-Threading, along with the Intel brand name Intel Hyper-Threading on this topic.

Processors with MT include the IBM POWER5 (2004), POWER6 (2007) and POWER7 (2010), all 4 threads per core. POWER8 in 2013?

The Sun (now Oracle) UltraSPARC T1 (2005) is 4 threads. UltraSPARC T2 (2007) and later are 8 threads. Then SPARC T3 (2010) T4 (2011). The newest Oracle SPARC T5 (2013).

Intel Itanium acquired MT with Montecito (2006), and continued with Tukwila 2010 and Poulson (2012).

HT-SQL references

I should have checked earlier, Linchi post on this topic performance-impact-not-all ... (2012-01)- 2 posts actually. And Kevin pointed out Slava Oks' Weblog Be aware: To Hyper or not to Hyper

Published Monday, April 08, 2013 10:15 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Joe,

Nice article. In my testing I've found that the sweet spot for MAXDOP with HT, especially in a concurrent workload, is around 150% of the physical core count per node. More than that, at least on the microarchitectures I've tested on (mostly Nehalem and Westmere), and additional threads very sharply begin decreasing in utility. (Which is to say that I don't see worse elapsed time, but rather no additional benefit; and I'd rather keep the worker thread count as lean as possible at any given time.)

My tests haven't been nearly as formalized as these, so it would be interesting to see your numbers specifically comparing MAXDOP 4/6/8. In such a test, correcting for parallel plan choice would be ideal. If you were to enable trace flag 8649 at the server level, it would force parallel plan choice regardless of the DOP-based cost changes, so it would force some plan stability and eliminate that as a variable.

Up for it?

--Adam

April 8, 2013 10:52 AM
 

jchang said:

there is definitely something to be said for MAXDOP less than the full set of logical processors. This may even apply to systems without HT, see http://www.qdpma.com/tpch/TpchStudiesSF10.html

I had to guess, I would say that there are OS and SQL Server background threads that may need to intermittently do something, so having a few cores or LP free would help? I wonder what the parallel plan parent thread needs? And I wonder what the SQL Server strategy is for division of work between threads. I think SQL Server occasionally has problems with asymmetric loop joins. On the one hand, we would like each thread to tackle a big chunk before checking with the parent on the next chunk. But we would also like to keep all assigned threads busy according to its capability, i.e., if both LP on one core are active on some, but only one LP in other cores, there should be different amount of work sent to each? I will try to take another look at 4,5,6,7,8 later.

April 8, 2013 1:30 PM
 

Chris Adkin said:

Whenever I read your blog postings I'm always blown away by the sheer amount of effort you always go to and the level of detail you go into. Something thats quite interesting regarding hyper threading, if my understanding is correct, is that turning it on 'Splits' the on CPU caches between hyper threads. SQL 2012 batch mode is designed such that batches fit into the level 2 cache, as such it would be really interesting to see what turning hyper threading on / off would do to batch mode and monitor execution time againt CPU utilisation and level 2 cache misses.

April 10, 2013 5:06 AM
 

jchang said:

I recall seeing something on this. If anyone could provide a link to the L2 cache size - batch size details, I would appreciate. I thought this aspect applies to parallel execution plans, but it could also be used for non-parallel plans too?

I would assume that on startup, SQL Server inquires to the L2 size. And it must determine the number of logical processors per core, to determine whether 2 (or more) threads will share the same L2. Then one must estimate the batch size to be some amount less than the L2 size, because L2 is a unified cache, shared between data and instructions and there are other data structures too.

The theory is that in processing a query, the same data may be accessed repeatedly, so if the chunk fits in L2, the access time will be very short. Now suppose that the chunk size was over-estimated, i.e., to a higher percentage of L2 than actually available for the working data.

However, a system with HT-enabled should estimate only half the L2 as available. So the single-threaded execution plan with HT on has a more CPU-efficient strategy than the single-threaded plan with HT off?

If MS would let me have source-code access, I could look into this. I am relatively good at keeping secrets, honest. OK it could be done with just the pdb files, but my debugging skills have faded and I am no longer inclined to decode assembly.

April 11, 2013 9:17 AM
 

Chris Adkin said:

I love your very last line, I know that Thomas Kejser has done some very deep debugging stuff, and I probably don't really need to mention the likes of Bob Ward . . . ;-)

April 15, 2013 5:22 PM
 

?????????? Hyper-Threading ???? ?????? ???????????? SQL Server said:

April 28, 2014 8:34 PM
 

Fabio A. said:

Actually there are many reasons for your threads to be improved in performance even when DOP is 1.

When there are less threads running than the available physical CPU (cores) you may guess the system will schedule a thread on each processor, thus that's why you guess performance should be similar.

I don't believe it works so.

As far as I got the way HT works it should always schedule threads on physical CPU trying to fulfill processors, so you should see processors with two threads running and processors totally idle.

In this way the system can speedup the threads by activating turboboost, i.e. having some processors (of the same CPU socket) idle it can dedicate more clock cycles to active processors.

Furthermore thrads on the same core may use L1 cache for intercommunication rather than L2.

Finally, with this scheduling logic, all processors on the same socket will be first "populated" with threads, rather than scaling on other sockets, thus avoiding processes communication between cores on different sockets (on NUMA systems).

As far as I understood the processor architecture, dual threading on the same core (physical processor) doesn't reduce the clock cycle of each thread, which remains the same.

Summurizing, HT can only bring benefits to your thread scheduling, because can speed-up the threads (turboboost) and avoid processes communication via nested level of cache

May 7, 2014 9:29 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement