The secret to improving multicore performance: Data flow and cache analysis

For programmers, few experiences are more frustrating than multithreading an algorithm for a multicore application only to find the performance is barely superior to single core. To avoid this frustration, achieve significantly faster runtimes, and predict performance scalability when they add one, two, or more cores, programmers benefit from greater visibility into multicore performance bottlenecks. Doug explores how analyses of data flow and cache usage can help programmers zero in on key factors that slow runtime performance, and a cache analysis software library is key.

It’s a familiar question for programmers: How can faster runtimes be achieved on multicore architectures? The simple answer is multithreading. In practice, the time-consuming task of multithreading algorithms for multicore applications often fails to produce an appreciable performance boost. Frustrated programmers look for solutions in instruction cycles, but many times coding issues are not the problem. Instead, the answer often lies in data flow and cache usage: two areas that can be easily analyzed to improve algorithm – and application – performance.

Data flow and cache analyses are effective ways to identify where and how multicore processors lose performance. Helpful tools and techniques include:

1. Processor performance counters – Counters are highly accurate and provide a snapshot of everything that takes place when code is running, including cache operations, executed instructions, and frequency of core stalls. Limitations include the difficulty of capturing the performance of code by itself and the need to have a specific hardware target in place.

2. Simulators and emulators – Processor suppliers sometimes provide simulator or emulator technology, which enable code to run and designers to possibly view activity within caches. These tools might become available as soon as a processor’s architecture is defined, but generally they are only applicable for specific processor architectures.

3. Instrumentation of existing code – Developers can gain greater visibility into how data flows through cores by using a cache analysis software library that lets them run code through simulated cache structures of their own definition. Users receive detailed information about cache problems, which can lead to immediate code changes and follow-up analyses.

The third method of data flow and cache analyses – instrumentation of existing code – holds particular promise. This technique gives the freedom to conduct “what-if” analyses, where users create and test multiple cache architectures – all without need for a specific hardware target or specialized emulation.

Cache analyses and data flows: A closer look

Dataflow analysis is most useful when the application of interest consists of a small set of “well-behaved” algorithms, each having deterministic data flow and few control branches. Applications that fall into this class tend toward complex mathematical computations and machine control algorithms. Performance of database applications, on the other hand, is much more difficult to characterize because the dataflow varies wildly based upon the particular records being accessed and the database's size and structure. Fortunately, many applications of interest, such as image reconstruction and machine control, fall into the “well-behaved” class.

Example: Data flow and cache analyses

Figure 1 uses the Fast Fourier Transform (FFT) algorithm to demonstrate the kind of visibility into multicore performance bottlenecks provided by data flow and cache analyses. Performance is critical in the FFT, which is commonly used in the military and medical fields, and it is well-behaved with respect to its dataflow.

21
Figure 1: The Fast Fourier Transform (FFT) algorithm is used to demonstrate the kind of visibility into multicore performance bottlenecks provided by data flow and cache analyses.

 

The left side of Figure 1 illustrates a simple decomposition of a 16-point, 4-core FFT in which the array is shown vertically and indicated by circles. Core 1 begins by working on the first four points in the array, computing values, and then, at each stage, waiting for the other cores to complete their processing. Core 2 operates in much the same way, always waiting for other cores to complete their processing before recalculating its results, because each stage depends on data computed in earlier stages. The threads must be synchronized at each stage where data comes from a different core. In this instance, synchronization stops at stage three.

The left side of Figure 1 shows data flowing from one core’s cache to another – information that is not normally visible when programming. When data needlessly flows from the cache of one core to the cache of another core, application runtime slows down.

The right side of Figure 1 shows runtime improvements that can result from clues provided by data flow and cache analyses. A decrease in the number of arrows crossing cache domains represents 1) an improvement in the performance of the FFT algorithm; 2) the elimination of data sharing between cores; and 3) elimination of the double buffering that occurs when two processors are no longer writing back to the same data that is being read.

Note that in the right side of Figure 1, Core 2 and Core 4 sit idle until the third stage. This would be a concern if computation was the limiting factor in performance, but in this case, it is not – data flow is.

Figure 2 shows the time the processor spends in computation and data movement for three different FFT cases. The left side shows a single-core implementation. The middle and right stacks show time breakdowns for the first and second algorithms from Figure 1.

22
Figure 2: The time the processor spends in computation and data movement for three varying FFT cases differs. The left side shows a single-core implementation, while the middle and right side stacks show time breakdowns for the first and second algorithms from Figure 1.

 

The middle stack of Figure 2 shows the increased time that a processor spends accessing memory – a key factor in runtimes and one that data flow and cache analyses can help resolve. In the single-core example, memory operations do not take as long because data fits well in the caches. In the multicore example, memory access is slower because data does not fit well in the cache after it has been double buffered.

The right side of Figure 2 shows the better performance that can result when the aforementioned bottlenecks are addressed. The time spent accessing memory and the L2 cache are greatly reduced. Now, the number of instructions and memory accesses to the L1 cache are the largest bottlenecks. Both of these can be addressed using vector processing instructions. 

These charts, generated from instrumentation of the code, enable the programmer to easily visualize problem areas and focus on changes that maximize overall performance.  Although not shown in this example, code instrumentation also allows the programmer to change the size or structure of the caches in order to understand implications on code performance – something that is impossible to do with processor simulators or performance counters.   

Greater visibility for better performance

Multithreading an algorithm is time consuming, and too often programmers find their efforts have resulted in meager performance gains.

Data flow and cache analyses can prevent that frustration by providing visibility into multicore performance limitations. Emerson’s complimentary Cachelib cache analysis software library supports instrumentation of existing code and makes analysis of data flow and cache easier and more detailed than ever before. By emulating the caching structure, the software enables what-if code analyses of multiple algorithms and cache architectures.

When designers obtain better visibility into performance limitations and use it to build and experiment with new cache architectures, frustration lessens and better multicore performance emerges.

Doug Sandy is the senior staff technologist of the Technology and Strategy department for the Embedded Computing business of Emerson Network Power. He is responsible for evaluating the performance of computer systems, constructing models for the systems, and predicting system behavior. He also focuses on the strategic development of computing, networking, memory, and storage trends. He holds bachelor’s and master’s degrees in Electrical Engineering from California Polytechnic State University, San Luis Obispo, California. He can be contacted at Doug.Sandy@Emerson.com.

Emerson Network Power

Embedded Computing

602-438-3392

www.emersonnetworkpower/embeddedcomputing

 

To obtain a free copy of the Cachelib library, please e-mail the author at Doug.Sandy@Emerson.com.