- Memory, like Earth, is not flat
- The compiler is one of your best friends for optimization (and you do not need to buy him a beer)
- Identify your hotspots
- Optimize serial execution
- The unavoidable evils - synchronization and communication
- Load balancing
- Be aware - NUMA is everywhere
- How to avoid programming grief
- Don't reinvent the wheel - use mathematical libraries
- Monitor the system
- Measure scalability
- SGI® UV™ Tuning
1. Memory, like Earth, is not flat
In ancient times, people believed the Earth was flat, and that if you go to the edge you'll fall into an endless abyss. The Earth is not flat, but many programmers still think memory is "flat," meaning a resource with access that's time-independent of the program data structures and data access flow. This belief may bring abysmal performance.
In a modern system, accessing data in RAM may be >100x slower than accessing data in the local registers in the CPU.
To bridge this gap, "the RAM gap," we have a hierarchy of fast memory buffers - the cache memory. In Intel® and AMD® processors there are three levels of cache - L1, L2 and L3. The idea is for the program to access the data in the faster cache memory more efficiently by maximizing the use of data already resident before accessing more from the slow memory. Data is transferred in blocks from memory to the cache. Using the data in the blocks before requesting the next one from memory is critical to good performance.
The memory hierarchy is based on the principle of locality:
- Temporal locality: If we just referenced a piece of data, there is a good possibility we will reference it again soon.
- Spatial locality: If we just referenced a piece of data, there is a good possibility we are going to reference those whose addresses are next to it in memory.
How to program to enhance locality?
- Access sequentially near elements in an array in the right direction of memory layout.
- Do not mix different types of data in a structure etc.
While this is not always easy to do, a good friend will come to your rescue.
2. The compiler is one of your best friends for optimization (and you do not need to buy him a beer)
Compilers today are very smart and able to do sophisticated automatic code optimization. However, most users ignore this fact or have a limited knowledge of the capabilities of the compiler.
Compilers have up to hundreds of different flag combinations for optimizing a program. Using the flags is the easiest, fastest and least painful way to improve performance. Among the capabilities of compilers:
- Multiple levels of optimization
- Different levels of floating point accuracy
- Wide reporting capabilities
- Fine-tuned optimization techniques (unroll, inlining etc.).
- Optimization for the whole program, routines or lines of code
Caveat: Compilers are not omniscient. It is possible to write code in such a way that no compiler will be able to optimize it. Compilers will not perform code optimization when it has doubts on the correctness of the results.
By using a simple few flags, the compiler can produce reports of optimization, vectorization, etc., which show what it was able to improve - but more importantly, what it could not improve. A complication is that the data in these reports may be huge. So where to look? The first place is to see what was done in the most amount of time of your code. More on this in the next section, Identify your hotspots.
Changing the optimization level may help shorten the program's execution time. For example, try using -O3 instead of -O2. Although GNU compilers are included in the standard Linux distribution, it is recommended that you work with commercial compilers from Intel® and Portland Group, as they provide better performance than the open source compilers, justifying the additional cost.
3. Identify your hotspots
When you optimize your program the first question is: How to find where it is spending most of its time in order to invest the optimization efforts there first? In optimization lingo - where are the hotspots in the program?
SGI recommends user friendly tools for analysis:
- MPInside: An MPI profiling tool which requires no recompilation or relinking, and provides valuable information to help MPI application developers to optimize their application, providing information on synchronization overhead, data transfer matrices, etc. MPInside is included in SGI® MPI and SGI® Performance Suite.
- Rogue Wave® ThreadSpotter™: A sophisticated tool to analyze memory bandwidth and latency, data locality, and thread communications/interaction, to pinpoint performance issues.
- PerfSuite: Open Source tool from NCSA which allows you to take performance measurements of unmodified executables and perform aggregate data collection ("counting mode") or do statistical profiling ("profiling mode").
- Intel® VTune™ Amplifer XE: Intel's VTune Amplifier XE suite provides advanced profiling for parallel applications and enables performance optimizations.
While very user friendly and intuitive, these utilities are also packed with lots of sophisticated options for analysis. Spend some time to get to know them. A few recommendations while profiling:
- Use a data set that reflects your actual run and not a smaller one "to run it faster."
- You do not need to run for 24 hours to profile your program. A shorter run may be good enough, given it spent enough time in the main calculation. Pay attention to differentiate the timing of the initialization/wrap up from the steady-state part.
- Try to profile a program in a quiet system (same for scalability). In a system under heavy use, your results may be biased because of the load of the other programs.
4. Optimize serial execution
Before going to run in a large number of cores, it is important to be sure that also in serial mode, single CPU, your program is as efficient as possible. An inefficient serial program will be a more inefficient program running in parallel.
Before you start the big work of parallelization, make sure your program is as optimal as possible. Check for the hotspots and explore anything else that can be done to improve performance. A profiler like PerfSuite and the compiler optimization reports for the hotspots are very useful.
Caveat: A routine that is the most time consuming in single CPU execution is not necessarily what will take the most time at 100, 1,000 or 10,000 cores. Communication and I/O may potentially take the most time at the very large core counts. Profiling your program serially does not mean you do not need to do it in parallel and for different core counts - forewarned is forearmed!
5. The unavoidable evils - synchronization and communication
Synchronization and communication are necessary in parallel computing but they are overhead to the real work; in other words, the time a process/thread communicates/synchronizes when it is not doing a real calculation. Tips to improve performance:
- Keep synchronization and communication at a minimum. Check if there is a collective communication or a communication pattern that is less time consuming.
- Once your MPI program is running correctly, try to swap blocking communication routines with nonblocking ones to potentially improve performance.
- Tune the SGI UV communication environment variables for better performance.
- Test the ratio of computation to communication for different number of cores in your workflow.
- Use optimized communication libraries in SGI MPI.
- Take advantage of SGI MPI PerfBoost which uses SGI Message Passing Toolkit on the fly with programs linked with other MPI libraries such as Intel MPI, Platform MPITM, MPICH and OpenMPI. You get SGI Message Passing Toolkit performance improvements for your program without recompiling.
- For hybrid parallelism with MPI and OpenMP, use SGI MPI's omplace command to properly distribute MPI ranks and OpenMP threads on SGI UV systems and distributive clusters.
- Use PerfSuite and SGI MPInside to identify bottlenecks.
6. Load balancing
Load balancing is another critical factor in parallel computing and scalability. If your processes/threads are not load balanced, that means they do not perform the same amount of work and you have processes/threads waiting for the others to complete their work. There are unused computation cycles, which are a direct hit on performance and scalability. SGI provides several tools to analyze the load balancing in programs, such as Perfsuite and MPInside.
7. Be aware - NUMA is everywhere
SGI UV® 2000 is a sixth generation ccNUMA system. Today, commodity systems with x86-64 processors have evolved to use NUMA, too. Cool. But what is NUMA and why is it so important for performance? NUMA means Non-Uniform Memory Access. In a NUMA system the memory is physically distributed among processors/nodes, but there is a layer of hardware and software infrastructure that makes it appear as one large memory space. However, for a process accessing the closest memory, the time to reach it is faster than accessing a remote memory, or in HPC lingo, "larger latency to the remote memory." As a result, there is a penalty for accessing non-local memory which has a negative impact on performance. There is no need to worry, since there are many things you can do to reduce the penalties and improve performance.
- Pinning processes/threads: Do not allow threads to migrate to another node. This prevents remote access to the previously accessed "local memory." SGI MPI and Intel MPI pin MPI processes by default to successive cpus in the cpuset they are contained in. SGI UPC pins UPC threads in the same way.
- Locality of memory: The program is written such that it accesses mostly local variables, supports parallel initialization of arrays, etc. Pinning processes goes a long way towards ensuring locality of memory for most arrays a program will reference.
- Optimize cache access: Optimal use of the cache or cache-friendly programs reduces the impact of the remote memory access. Also need to be aware and avoid issues like false sharing.
- Use Numactl to manage resources: Numactl is a command line tool for reviewing the NUMA nodes configuration of a system (numactl -H). Knowing the NUMA node configuration of an SGI UV® system will help in efficient use of system resources. The numactl command will output the NUMA nodes, assigned core IDs, and available memory on each node.
8. How to avoid programming grief
Before performance tuning, be sure your code gives correct answers! Try to answer the following questions:
- Same results with different levels of compiler optimization?
- Are the results of your parallel code independent from the numbers of threads or processes?
- Have you ever checked array boundaries, subroutine arguments and more in your code (i.e., with -check flag)?
- Have you checked that your code is not causing floating point exceptions?
- Have you made any assumptions regarding storage of data objects in memory?
Once you hit the bug - also if you belong to the "write/printf Sect" - follow a few tools that will help you to identify the issue without causing more gray hair:
- IDB: The Intel Debugger, included with Intel Composer XE. IDB can debug both single and multithreaded applications, serial and parallel code.
- RogueWave® TotalView®: TotalView is a GUI-based source code defect analysis tool. It allows debugging of one or many processes and/or threads programs. Supports OpenMP, multiple MPI libraries and CUDA (GPU).
- Allinea DDT: Supports many parallel programming models including OpenMP, MPI, GPUs, SGI UPC, CUDA, etc.
- GDB: The eternal GNU debugger.
9. Don't reinvent the wheel - use mathematical libraries
When you need to implement a mathematical routine in your program (or use the one you wrote yourself in a university numerical programming course years ago), check if it is implemented in one of the many high performance mathematical libraries on the market. You cannot compete with the work of a dedicated team of mathematicians and software engineers who focus on developing, maintaining and optimizing these libraries. Some examples include Intel® Math Kernel Library (MKL), Intel® Integrated Performance Primitives (IPP), Rogue Wave® IMSL®, ATLAS, ACML and FFTW. Most of these libraries are already parallelized.
Is your code running on the Intel® Xeon® processor E5 product family? If so, it's time for a recompile of your code and link with Intel MKL to take advantage of Intel's new Advanced Vector Extensions, AVX. Build your app by passing the "-xAVX" option and you should see a nice performance boost using the new extensions.
10. Monitor the system
Monitor your run-time program interaction on the system in order to identify issues like overuse of memory, over-subscription, sleeping process, process placement, etc. Learn the basic tools of the trade for system analysis: top, sar, strace, lsof, nodeinfo, dstat. If you use only one, "top" should always be in your quiver of tools providing you have the correct resource columns displayed. Make sure the last CPU column is displayed and sort on this column to verify you are not over-subscribing cores. By default, top does not display the last CPU resource column.
11. Measure scalability
Congratulations! You just completed the parallelization of your program. Now it is time to check its scalability. The unit of measure you can use is "scale up," defined as the ratio of the time spent to complete the program in one CPU to the time to complete it in N CPUs. We like to see linear scalability, meaning that if the program runs in 16 CPUs, it runs 16x faster than in one CPU. However, in general, this is not always the case, as factors such as communication, synchronization, serialization, etc. influence the runtime.
Scalability will show you the "return on investment" while running in N CPUs. This will help you to assess the quality of the parallel implementation and, for a given problem size, to decide which is the right number of CPUs to use. If your program scalability flattens at 128 cores, there is no reason to run at 256 cores even if they are free.
You need to be aware there are many different mechanisms at work in a parallel application regarding how its serial version may influence its scalability. For example, the memory footprint and amount of work per core may be very different between a serial program and its parallel version. This is why the definition of strong vs. weak scalability is important:
- Strong scalability: How quickly a program can complete running on a particular data set by increasing the processor count.
Amdahl's Law: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program (and overhead).
- Weak scalability. How to run a larger problem in approximately the same amount of time by increasing the processor count.
Gustafson's Law: Computations involving arbitrarily large data sets can be efficiently parallelized.
At the end of the day, the scalability of your program will always be limited by some factor - serial portion of the program, communication/synchronization overhead, cache invalidation, I/O, reaching a limit value on speedup. Did we already say not to use more than you need?
12. SGI® UV Tuning
The previous sections discuss topics that are almost "universal" in modern computer architectures optimizations. However, every implementation has its "specific bits", and the SGI UV 2000 is not an exception. There are several recommendations for the effective use of the capabilities of the SGI UV 2000.
In a processor that supports HyperThreading, each physical core the operating system can see addresses two "logical cores". If you think "Eureka! Two for the price of one" - think again. Although the operating system can schedule two threads simultaneously, the resources are not duplicated. Therefore most applications show only small variations in performance while using HyperThreading.
For SGI UV 2000, most high performance computing MPI programs run best using only one process per core and using placement tools. How do you identify who's who for the processes and cores? How do you use the placement tools correctly? When an SGI UV system has multiple HyperThreads per core, logical CPUs are numbered such that HyperThreads are in the upper half of the logical CPU numbers. Confused? Use the cpumap(1) or numactl -H commands to determine if cores have multiple Hyper Threads on your SGI UV system. The output of cpumap tells the number of physical and logical processors, if HyperThreading is ON or OFF and how shared processors are paired. The task of scheduling only on the even HyperThreads may be accomplished by scheduling MPI jobs as if only half the full number exists, leaving the high logical CPUs idle. When an MPI job uses the logical CPUs, set GRU_RESOURCE_FACTOR to 2 so the MPI processes can utilize all available GRU resources on a Hub rather than reserving some of them for the idle HyperThreads.
- Tuning environment variables
SGI MPI's Message Passing Toolkit can improve the bandwidth of large messages if MPI_GRU_CBS is set to 0. This favors large message bandwidth at the cost of suppressing asynchronous MPI message delivery. In addition, some programs transfer large messages via the MPI_Send function. To switch on the use of unbuffered, single copy transport, you can set MPI_BUFFER_MAX to 0. See the MPI(1) man page for more details in the SGI Technical Publications Library (techpubs.sgi.com).
MPI small or near messages are very frequent. For small fabric hop counts, shared memory message delivery is faster than GRU messages. To deliver all messages within an SGI UV host via shared memory, set MPI_SHARED_NEIGHBORHOOD to "host" on SGI UV systems.
MPI application processes normally perform best if their local memory is allocated on the socket assigned to execute it. This cannot happen if memory on that socket is exhausted by the application or by other system consumption, for example, file buffer cache. What will happen then is that the NUMA double sided sword will show up - your process will get memory but it will be allocated in another socket and you will do remote memory access with a larger latency. Use the nodeinfo(1) command to view memory consumption on the nodes assigned to your job. You can also use the dlook command to check where your program allocated memory pages.
MPI_SHARED_NEIGHBORHOOD environment variable controls the choice of shared memory or GRU references for communication between MPI processes on the same host, and defines the "neighborhood" size. No need to turn to Web Translate, this simply means MPI processes from the same "neighborhood" will communicate with each other via shared memory. Default is all the memory in the blade, but you can define a smaller or larger neighborhood which may better suit your program.
- MPI one-sided put/get
The use of MPI one-sided primitives put/get in SGI UV can lead to significant performance improvements due to lower communication cost in comparison with traditional two-sided communication primitives (send, receive, etc). This improvement comes from two sources:
- Significantly lower communication latencies
- Reduced number of synchronization barriers
- Process/Threads Placement
Finally, do not forget SGI tools dplace and omplace for process/threads placement! Remember NUMA.