Eric StotzerMulticore processors are changing application development in all areas of computing. Because they offer higher performance with lower energy consumption, multicore processors are being used in more embedded applications than ever and are enabling entirely new applications. Software for embedded systems becomes more complex, as multicore hardware enables more functions to be implemented on the same device. This complexity makes the production of software the critical path in embedded systems development. A high-level programming model, such as OpenMP, has the potential to increase programmer productivity, which, in turn, should reduce design, development costs and time to market for embedded systems. In this article, we discuss the initial work of adapting an existing shared memory programming interface, OpenMP, to enable the specification of parallel programs on an embedded multicore processor

OpenMP is a popular standard for shared memory parallel programming in C, C++, or Fortran. It provides portable high-level programming constructs that enable users to easily expose a program's task and loop level parallelism in an incremental fashion. With OpenMP, users specify the parallelization strategy for a program at a high level by annotating the program code with compiler directives that specify how a region of code is executed by a team of threads. The compiler works out the detailed mapping of the computation to the machine. 

Figure 1: Master thread spawns a team of threads as needed. Parallelism is added incrementally until desired performance is achieved: i.e. the sequential program evolves into a parallel program.

As shown in Figure 1, OpenMP is a thread-based programming language. The master thread executes the sequential parts of a program. When the master thread encounters a parallel region, it forks a team of worker threads that along with the master thread execute in parallel. The OpenMP programming API enables the programmer to perform the following: 

• Create and manage threads 
• Assign and distribute work (tasks) to threads 
• Specify which data is shared among threads and which data is private 
• Coordinate thread access to shared data 

Figure 2: An example of data-parallelism. A parallel-for loop where each thread executes a chunk of the loop and their intermediate results are reduced to a final result. A single copy of x[] and c[]

There is a fairly easy migration for existing code base - C/C++ based directives (#pragma) - used to express parallelism. As shown in Figure 2, OpenMP directives specify that a well-structured region of code is executed by a collection of threads that share in the work. Worksharing directives are provided to effect a distribution of work among the participating threads. The programmer incrementally adds OpenMP pragmas to an existing sequential application allowing them to quickly port code to a multicore platform.

The specification for OpenMP is maintained by the OpenMP architecture review board (ARB) which is composed of members from government, academia and industry. OpenMP has a long history in the high-performance computing community and is supported on all major/ISA platforms including GCC. The most recent OpenMP 3.0 standard extends the OpenMP programming model to incorporate tasks. The new tasking model allows the creation of explicit tasks and worksharing constructs to create implicit tasks. OpenMP tasks can be preempted at scheduling points. The language is evolving to support accelerator and heterogeneous systems. For more details on OpenMP, visit:

OpenMP Implementation
Compilers translate OpenMP into multi-threaded code with calls to a custom runtime library. An efficient runtime library to support thread management and scheduling, shared memory and fine-grained synchronization execution is essential.

OpenMP has shared and private variables. Each thread has its own copy of a private variable that the other threads cannot access. There is only one copy of a shared variable and all threads can access it.

OpenMP specifies a relaxed consistency shared memory model. Threads executing in parallel have a temporary view of shared memory until they reach memory synchronization or flush points in the execution flow. At a flush point, threads are required to write-back and invalidate their temporary view of memory. After the memory synchronization point, threads again have a temporary view of memory.

Memory Model
Although many embedded multicore processors have shared memory, its consistency is not automatically maintained by the hardware. In this case, it is the responsibility of the OpenMP runtime library to perform the appropriate cache control operations to maintain the consistency of the shared memory when required.

For the shared variables, software managed cache coherency is implemented. When a thread updates a shared variable, the updated value is first stored in an L1 cache. At a flush point, the runtime library is in charge of executing the cache control operations that synchronize the local-L1 cache with the shared memory. The program image is loaded in the shared memory and can be accessed by all the cores using their program caches.

Parallel Regions Execution Model
For each parallel region, the OpenMP compiler divides the workload into multiple micro-tasks that are then assigned to worker threads at runtime. A parallel region's fork-join mechanism is implemented via the following steps: 

• After initialization, worker threads wait for a micro-task execution notification. 
• The master thread assigns microtasks to worker threads by sending a message, which includes a function pointer and a data pointer, through a message queue. 
• Upon reception of the message, each worker thread executes the micro-task specified by the function pointer. The data pointer is passed as an argument to the micro-task. 
• Upon completion of the micro-task, the worker thread sends a completion message back to the master's message queue. 
• After the master completes the execution of its own micro-task, it waits for completion messages from all the worker threads.

OpenMP implements a thread-based shared memory parallel programming model supporting both data and task level parallelism. Early experience has demonstrated that OpenMP is a suitable and productive programming model for embedded systems. One challenge for embedded processors is that they lack features that are commonly found in general-purpose processors such as memory management units, coherent cache systems, and uniform memory access time. Texas Instruments (TI) plans to address these challenges as part of the process of extending TI’s C6X compiler to support OpenMP.

About the author:
Dr. Eric Stotzer is a senior member of TI’s Software Development Organization’s compiler team. He has been with TI for 22 years focusing on software development tools, compilers, architectures and parallel programming models. Eric has a PhD in computer science from the University of Houston in Texas.