Bringing multicore computing to VME

With Moore's Law forging a shift toward multicore, VME system designers will face migration challenges as they consider which type of migration is most appropriate: application consolidation or application acceleration. Another important consideration is whether to use the Asymmetric Multiprocessing (AMP) or Symmetric Multiprocessing (SMP) configuration – or whether multicore is a viable option for the VME system at all. Additionally, performance, with its many aspects, also plays a key role in multicore migration.

While semiconductor technology has continued following Moore's Law, the method for translating these additional transistors into increased application performance has changed. Ever more complex chips with continually increasing clock speeds create a power problem for processors, resulting in a turn toward multicore architectures. This shift toward multicore computing creates challenges for VME system designers seeking to increase the performance or reduce the cost of existing applications.

The typical VME system design utilizes multiple processor blades, each containing a single processor executing an independent process. Although they may exchange information, each processor functions independently as part of a cluster computing structure. The system is not necessarily homogeneous; each blade may use a different processor type or operating system, and processors are typically not x86 architectures.

Most VME system designs are mature and stable, but still may require migration to new processor architectures due to obsolescence issues or an increase in performance demands beyond what a blade can provide. System developers may also want to reduce system costs by combining several blade functions into a single, more capable blade design.

Realizing application performance gains on a new processor, however, is no longer as simple as recompiling. Processor architectures have turned to multicore approaches. These approaches place two or more processors in a single package to increase performance, rather than increasing clock speed or architectural efficiency to realize performance. This leaves VME developers with a fundamental problem: how to migrate applications from single-processor/single-core environments to multiprocessing/multicore environments.

The specific challenges developers will face depend on the type of migration they follow. There are two basic types. One is application consolidation, combining several applications onto a single multicore processor blade. This type of migration is typically done to reduce cost or to increase a rack's functional density. The other migration type is application acceleration, moving a single application to a multicore processor blade to achieve higher performance in that application (see Figure 1). These migration types lend themselves to a choice of two multiprocessing configurations – Symmetric Multiprocessing (SMP) or Asymmetric Multiprocessing (AMP); their effects on performance are a key factor. The question of whether switching to multicore is viable for the VME application at all must also be considered.

21
Figure 1
(Click graphic to zoom by 2.0x)

Multiprocessing styles

Multicore processors can operate in one of two multiprocessing configurations: SMP or AMP. They are fundamentally different computing environments.

Symmetric Multiprocessing

In an SMP configuration, the multiple cores have equal access to and share utilization of system resources, including main memory, operating system, and I/O. The system's software tasks can run on any core available and may run on a different core each time they execute. A single OS controls all the tasks, coordinating use of system resources and assigning tasks to cores to keep CPU utilization (the "load") balanced. This SMP model is utilized in desktop computing and servers, and is the multiprocessing model that most general-purpose operating systems support.

Asymmetric Multiprocessing

The AMP configuration runs an independent copy of the operating system, or even different operating systems, on each processor core and is often used in real-time systems to minimize interactions between tasks. A multiaxis servo system, for instance, might assign each axis to its own core running independent copies of the servo control program. An industrial automation system, on the other hand, might run an RTOS for machine control on one core and Windows for the user interface on another core.

Because of the hard partitioning between the operating systems, tasks are assigned only to specific cores for execution. As a result, each core may require sole access to some system resources, such as a memory block or peripheral. There may, therefore, need to be mechanisms in place to allocate resources to each core and protect system resources from the actions of other cores. In the AMP model, achieving load balance requires manual assignment of tasks and resources to each core, keeping CPU utilization in mind.

Processors for SMP and AMP

Most multicore processors, such as the x86 architectures, are extensions of multiprocessing architectures that developed to serve desktop computing. Thus they and the software available for them, such as Linux and Windows, favor the SMP configuration. Among such multicore processors, the sharing of L2 cache, memory controller, and peripheral bus is typically built into the architecture. There are some processors, however, that do provide separate resources. The Freescale 8641D, for instance, provides independent L2 cache and memory control for each of its cores. These features may make it a more natural fit for developers pursuing an AMP-style migration.

Migrating to multicore

As mentioned, the migration of a VME application to a multicore processor will depend on the migration path desired, the software processes involved, and the target processor.

The easiest migration is application consolidation to an SMP configuration, when functions on two or more boards that employ the same OS migrate to a multicore processor. If independent peripheral resources are available for each core and there is sufficient processing capacity available, this type of migration can be simple to implement. All the developer needs to do is to execute the applications under the combined OS, letting the OS scheduler handle load balancing.

Application consolidation might follow an AMP migration when different operating systems are involved, so that developers can assign each OS to a different core. An AMP migration may also be required when resources need protection and tasks demand isolation because they can interact, or when careful load balancing is necessary because the combined load approaches processor capacity. In these cases, however, the migration can become complex and not for the faint-of-heart. Developers must evaluate the I/O and processor utilization for each task and manually assign tasks to cores to keep resource and processor load balanced. The assignments must be evaluated when running the processes concurrently in order to check for interactions between cores and to ensure that minimum performance levels are met under worst-case conditions. If processes run under different operating systems, developers may consider using server virtualization to simplify partitioning and isolation for each process. Virtualization, however, lowers performance by adding overhead, adds another layer of complexity to the migration effort, and may not be available for all target processor architectures.

In application consolidation, the application code itself typically needs little or no modification because it is already in the form of independent processes. In application acceleration, however, the program must typically be rewritten into multiple processes or threads in order to take advantage of the additional processor cores. When splitting a single application into smaller tasks, the hard partitioning of AMP, with its manual partitioning and attendant risks and effort, offers few advantages. As a result, the SMP configuration is the typical target for application acceleration migrations. The migration choice then becomes how to best recode the application to realize performance gains with a minimum amount of effort.

When migrating a single application to SMP multicore, the application needs to be rewritten to run as a set of smaller, independently executable units so that the OS can control their sharing of system resources and take advantage of the parallelism offered by more than one core. These units can be relatively large (processes) or quite small (threads). Rewriting the application for multithreaded operation, though, can be an order of magnitude more time consuming than rewriting for multiprocess operation because of the greater likelihood that the units will interact and create debugging challenges. The advantage of multithreaded operation, however, is more efficient utilization of processing capacity.

Implementing fine-grained multiprocessing requires that the processes be synchronized as well as architected for data flow through shared resources. Program code might need to be hand optimized for processor and resource utilization. Because of explicit resource partitioning, AMP may be more appropriate for this type of migration. This approach requires the most software development, but results in the highest performance efficiency.

Performance issues

Achieving substantial performance improvements by moving to multicore processing is not guaranteed, however, regardless of configuration. The maximum performance improvement attainable in multiprocessing by adding another core to a group of N cores executing code is approximately 1/N (Amdahl's Law). Thus, moving from one core to two may double performance, but adding a third can only give an additional 50 percent boost, and so on.

In practice, the actual performance increase attained depends on factors such as congestion among shared resources. Without careful design, performance will be substantially less than Amdahl's Law predicts. For example, two processes on one core that each generate 100 messages per second cannot scale to faster operation by moving to separate cores if they must still share I/O that can only handle 200 messages a second.

The impact of resource sharing in multiprocessing thus requires careful evaluation when migrating VME designs to multicore. Processes trying to use congested resources may experience greater latencies, increasing linearly with each added core, because data throughput must be shared among the multiple cores. Processes may also experience deadlocks, where two threads are both waiting for the other to release a resource so that they can finish execution. A similar condition, priority inversion, is a particular problem for real-time systems. In priority inversion, a high-priority thread has its execution delayed by a low-priority thread that has control over a shared resource the high-priority thread requires, unless the code provides a mechanism for forcing the transfer of control.

Another common resource sharing challenge in multiprocessing is one process corrupting memory that another process is using. This can easily occur when two board processes get combined onto a single multicore processor, especially if they have similar memory maps. It is also possible for a single process split across two cores to create this condition if different tasks utilize the same memory space and execute simultaneously on two cores.

Something else to consider when migrating VME to multicore is that unbalanced processor loading can limit performance gains. When capacity utilization is low, load balancing is less important; there is plenty of growth room in each core. As the load grows, however, problems can arise when load balancing is inadequate. A coarse partition, for instance, might break a process into three tasks, each loading a core to about 40 percent of capacity. Manual load balancing would put two tasks on one core and one on the other core (Figure 2). While this partitioning works initially, it has limited room to scale. A modest increase in demand would saturate one processor while the other still has 50 percent capacity remaining that cannot be utilized.

22
Figure 2
(Click graphic to zoom by 2.0x)

If multiple copies of a process can execute in parallel, replicating the process on every core rather than partitioning across cores, they may provide better scaling. To meet a given application demand in a dual-core processor, for instance, each copy of the process would provide half the effort. With this approach, both processors have the same loading, and the range of scaling possible is greatly increased.

Justifying migration

In most cases, developers will need to justify the migration of a VME system to multicore processors by estimating development efforts as well as anticipated performance increases and production cost savings. The interacting details of application, resource sharing, and multiprocessing configuration choice, however, make estimation difficult. Unexpected software interactions, resource contention, operating system mismatches, and the challenges of effective load balancing can all complicate the development effort as well as increase the likelihood of suboptimal performance.

Developers considering this move, however, have help available. Multicore systems vendors like Emerson Network Power have experience and expertise in handling multiprocessing. They can assist developers in assessing the application's potential for migration to multicore and help estimate the performance increases they can reasonably expect. Migrating an existing VME design to a new board in order to utilize the latest generations of multicore processors may not be simple. However, with careful planning and the assistance available, this can be successfully achieved, lowering system costs and increasing performance.

Doug Sandy is the senior staff technologist of the technology and strategy department for the Embedded Computing business of Emerson Network Power. He is responsible for evaluating the performance of computer systems, constructing models for the systems, and predicting the systems' behavior. He also focuses on the strategic development of computing, networking, memory, and storage trends. He holds bachelor's and master's degrees in Electrical Engineering from California Polytechnic State University, San Luis Obispo, California.

Emerson Network Power
602-438-3392
www.emersonnetworkpower/embeddedcomputing