Embedded Packet Processing Software for Multicore-based Systems
4G, LTE, WiMAX, IPTV, etc. – you name it and these advanced IP-based services are upon us, and more are coming. The IP communications layer for multimedia and data-centric Next Generation Networks needs tremendous performance and sophisticated improvements in order to connect billions of devices, while providing a network infrastructure that enables new Internet services based on image and video. At the same time, power consumption remains a critical factor for networking equipment.
High Performance Packet Processing for Multicore
Multicore CPU technology was introduced a few years ago. It can achieve the required level of networking performance while minimizing power consumption. To fully benefit from these CPU improvements, though, packet processing software has to be specifically designed to manage and process multiple 10Gbps streams of complex traffic.
A standard networking stack uses OS services, so it must deal with issues such as preemptions, threads, timers, and locking. This leads to performance bottlenecks including Layer 1 / Layer 2 CPU cache misses and pipeline mispredictions. Although a networking stack can be improved to support multicore architectures, it does not scale linearly over the many cores available on multicore CPUs. Performance can be very disappointing and the potential advantages of the CPU can be wasted.
Performing packet processing outside the OS, using a low-level executive, seems at first to be an attractive solution. However, it leads to major software design issues. In the first place, IP networking stacks are very complex and over time many new protocols have been added to the initial IP-UDP-TCP stack. Developing a new stack from the ground up would be very expensive. Porting an OS stack under the executive environment is faster, but is sub-optimal as it only eliminates scheduling limitations.
Secondly, having a complete networking stack under the executive environment is inefficient, as only the processing performed on each packet needs to be improved. It would be pointless to accelerate the processing of complex signaling protocols at this level as they represent only a small fraction of the overall traffic.
Finally, an OS networking stack provides stable APIs for applications. A complete re-design of the networking stack could lead to costly, time-consuming integration and validation steps.
Splitting the networking stack is the answer. This allows the acceleration of standard OS-based packet processing while preserving the well-known APIs . The lower part (typically called the Fast Path) implements the processing of each packet in the executive environment. If that processing becomes too complex, the Fast Path transparently forwards the packet to the OS networking stack. To make the Fast Path transparent to applications, they should not be aware of it and should interface to the OS stack as usual. Specific software modules must be designed for that purpose to synchronize the applications, the kernel networking stack, and the Fast Path.
Taking Linux as an example, a Fast Path-based architecture provides a 7-10x performance increase for packet processing as compared to standard networking stacks. As this solution affects only the OS networking stack, the APIs are preserved. Hence integration and validation are straightforward.
High Availability Requirements
A multicore-based system provides huge processing capabilities. A single failure in this kind of equipment can affect a very large number of users with long service interruptions and it is absolutely unacceptable. Multicore packet processing needs to implement specific mechanisms to provide HA-ready solutions.
A High Availability architecture is based on some inactive elements that are not in operational use. The goal of the system is to replace a failing active element by an inactive one, to restore the expected level of service within the shortest period of time. Several strategies can be implemented according to the requirements for service interruption.
Once a failure has been detected, an inactive element is configured to replace the failing one. This means that the whole configuration has to be restored and complete information has to be provided from the system to the new element to restart the service. If we take routing as an example, it means that the configuration of the routing protocols has to be performed on the new element and the routing protocols have to complete the route learning process to provide the service. This could take a long time (several minutes), which is not compatible with some high availability requirements.
To avoid such long interruption of service, a more sophisticated architecture can be implemented based on a “1+1” architecture. A pair of elements (one active, one inactive) is used. A process is defined to maintain a coherent view of the system in both elements. This process synchronizes the required information between both elements. In case of a failure of the active element, the inactive one has all the information ready to again provide the expected level of service within a very short period time. To provide synchronization between routing protocols, the inactive element receives all the routing table updates from the active element, ensuring that it has exactly the same level of information. It should be noted that each Control Plane protocol (ARP/NDP, IPsec, NAT, firewall…) has its own specific information. A dedicated synchronization mechanism is required for each of them.
Only the Control Plane’s states need to be synchronized since the Control Plane will directly update its Fast Path. Maintaining synchronization between Fast Paths is unnecessary as the update of the Fast Path working tables is far shorter compared to the Control Plane’s and the Fast Path has been designed to ensure Non Stop Forwarding. Taking the routing example again, the Fast Path uses forwarding tables that describe an instantaneous status of forwarding tables that can be quickly rebuilt from the Control Plane’s routing tables.
Beside synchronization mechanisms, packet processing also has to provide monitoring services and graceful restart capabilities.
Monitoring services periodically check the health of the packet processing components to detect possible issues in order to prevent complete shutdown and anticipate switching from the active to the inactive element. These services alert the HA framework that supervises the whole equipment. The HA framework takes the decision to re-launch a software component or to partially / totally reboot the system.
Graceful restart provides the capability for restarting packet processing software components without interrupting the traffic. Each key software component must be able to implement those features. If the software component does not implement internal states, it just has to be started and stopped. More complex protocols require specific mechanisms; some of them like the OSPF routing protocol have been standardized.
High performance packet processing is a critical element for moving the networking industry beyond the limitations of OS-based packet processing. It requires a specific software design in order to fully benefit from multicore CPU technology. It also has to provide High Availability extensions to be used in network infrastructure.