Millind MittalToday’s high-performance processors for networking applications utilize a multi-core approach that benefits from packet and flow level parallelism inherent to network processing. Multicores optimized for networking applications typically augment general-purpose processor cores with off-load engines to relieve processor cores of specialized packet processing tasks such as parsing, classification, and security. Achieving high processing efficiency while providing full control over the traffic flow with minimal intervention from processor cores, however, calls for careful coordination among the various engines.

The task of a networking device is to perform necessary protocol processing and to move incoming data packets along towards their final destination. Complexity arises quickly, however, due of multiple factors. Packet processing typically requires multiple processing steps, and processing needs vary depending upon corresponding flow attributes and packet content. Further, a typical networking device has multiple I/O connections in play with packets moving among them. Adding to the challenges of addressing this complexity is continual growth in network bandwidth demands, pressuring networking device designs to provide the highest total packet throughput possible. At the same time, the device must control individual packet movements according to Quality of Service (QoS) levels that range from the real-time of streaming media to best-effort delivery of data files. Chip designs also need to be compact in order to achieve cost and power efficiencies. Additionally, developers expect a multicore solution to provide a robust application development environment with support for fault isolation and error recovery to minimize the impact of software induced error conditions.

Moving a data packet through a networking device involves many steps. One of the first steps is parsing of the packet header to identify destination address and various additional header fields. Then, packets need classification based on extracted header and payload information to determine the packet’s handling needs. Classification reveals some information that may be necessary for processing, for example security key or QoS parameters associated with the flow corresponding to a given packet. Additionally, certain management and security applications also require inspection of the data payload.  

The multicore architectures typically employ hardware offload engines for key functions like classification and security coding that would otherwise consume processor cores with long latency dependent memory accesses or bit-level manipulation tasks. To keep from overloading the memory bus the device design also incorporates a direct memory access (DMA) controller. The DMA controller allows the resource to transfer data directly to or from memory, reducing both demands on the processor cores and associated bus-bandwidth utilization.

The multicore architectures, however, need to use processing elements (processor cores and offload engines) efficiently in order to address both throughput and cost considerations. A multicore architecture with shared resources and a common high performance interconnect (Figure 1) allows for ease-of-use and optimum processing flexibility – both in terms of type as well as amount of processing per packet.


Queue and per Resource Traffic Management Maintains the Flow within the Multicore
The multicore approach additionally requires an effective intrachip communication mechanism for guiding the flow of packets through various processing steps within the device. Achieving efficient intrachip communication along with QoS guarantees for networking applications, requires supporting two related capabilities: 1) Queue management and 2) per resource traffic management (Figure 2).


To simplify the coordination of shared-resource access, the memory architecture can implement and manage queues for each data flow path. Queues help improve networking performance by decoupling resource operations. Instead of waiting for a resource to become available or to acknowledge a data transfer during packet processing, a resource simply posts to the queue associated with another resource and allows queue management to ensure that the data reaches its destination.

This decoupling also eliminates the need for software synchronization among multiple processor cores sharing other resources, creating the effect of virtual resources dedicated to each processor core. Such virtualization simplifies scaling of the architecture to higher performance levels by adding processor cores and resources with appropriate queue management. It also increases reliability and fault isolation by preventing one processor core from locking up a resource.

A traffic manager controls data ingress to prevent any given channel from monopolizing device resources. It then controls the egress port to ensure that the various traffic flows appropriately share output channel bandwidth. Traffic management also plays a role internally to the device, enforcing shared-access resource utilization rules based on QoS ratings.

The various stages of traffic management can also be combined with queue management to gain synergy in offering a framework for efficient and QoS-aware intrachip communication. The result is intrachip QoS control at each stage of packet processing. One way to implement this integrated traffic and queue management is to place at each resource an arbiter that can coordinate multiple queues (Figure 3). This approach also allows assigning a given traffic flow to a specific queue so that all packets for that flow follow the same internal path in the device, preserving order of packets within a flow and making it easier to isolate and recover from faults. Integrated queue and traffic management also allows for minimizing core intervention for managing processing flow within the multicore.


Coupling the queue and traffic management mechanisms together so that back pressure from the queue provides the arbiter with additional information for its decision making helps maximize resource utilization without compromising QoS enforcement. For example, there is no value in giving a streaming-video flow priority access to a decoding engine when its output channel queue is full. An integrated queue and traffic management system, however, could respond to such conditions by temporarily giving other flows additional access to the resource rather than continuing to follow a static round-robin or priority-based allocation.

Efficient processing while achieving required throughput and preserving QoS requirements is the ultimate goal in networking device designs. The use of multiple cores and various offload engines is now essential in order to provide the channel capacity and processing flexibility that network systems demand. These processors need to follow a multicore architecture to implement complex traffic flow patterns through the device, while sharing processor cores and offload engines among the channels as efficiently as possible. The use of queues and traffic management as an integrated structure along with the provision of a sufficient number of processor cores and the right types of offload engines enables achieving design efficiency, processing flexibility, and associated QoS and throughput requirements.