In any modern day multi-processor based system, bus interconnects are one of the most complex IP blocks. All these system bus-interconnects are the backbone of the whole system for end to end communication, carrying substantial transaction bandwidth across the chip. Hence the performance of the system largely depends on how efficiently bus- interconnects are structured. The correctness of the functionality of interconnects become an important criteria in defining the performance and efficiency of the system. Hence a robust verification methodology is required to verify different aspects of system bus interconnects, such as end to end latencies, priority of transfers, correctness of the address decoding, response to illegal accesses. Following article will walk through some of the major issues that are faced while verifying the bus-interconnect design/configuration and how simple assertions can help to identify the design issues.
Typical system bus-interconnect structure
Figure 1 shows a network of bus-interconnect structure.
M0, M1 and M2 are the defined masters in the system, which are responsible for driving various transactions and collecting associated responses. These masters can be processors, DMA masters or other system masters. S0 and S1 are the defined slaves of the system who processes transactions from masters and generates responses back to masters. BM1, BM2, BM3 and BM4 are the bus-interconnects who transfers transactions driven by masters towards slaves. The point-to-point path(s)(Pxx, such as P01A, P30B, etc.) connects masters, interconnects and slaves. For simplicity these paths are numbered as follows.
1. Paths marked as P0x (e.g P01A, P01B, P02A, P02B, P04) are the interfaces from external masters M0, M1, M2 and terminating onto a bus matrix. For example, P01A is the interface between master M0 and Bus matrix BM1.
2. Paths designated as Pxy (e.g P13A, P23A, P14, P34, etc.) are the communicating interfaces starting from bus matrix BMx and terminating at BMy. For example, P13A starts from BM1 and ends at BM3).
3. And paths starting from Bus matrices BMx and terminating at external interfaces are designated as Px0 (e.g P10, P30A, P30C, P20B, etc.). For example, P30A starts at BM3 and ends as an external interface.
In real system scenario, each of the bus matrices will be comprised of many internal switches to support different bus widths transformations, frequency conversions, arbitration schemes and different bus protocols driven by master and slave interfaces.
Considering the above system, three major aspects of bus-interconnect design are discussed. They are:
A) Latency across the interconnect
B) Priority of simultaneous accesses (arbitration)
C) Quality of Service / Quality of Virtual Networks
D) Address space decoding
Also, in each of the sections, it is discussed how assertions can help in validating the correctness of the interconnect design.
Latency across the interconnect
Latency happens to be the most critical parameter for an interconnect design. End-to-end latency across interconnect defines how the good is the performance of the whole system around interconnect. By interconnect latency, it is meant that how much time does a transaction control traverses from master to slave in the forward interconnect path, and the transaction response/data traverses from the slave to go through reverse interconnect path to reach the master. For example, in figure 1, the latency between Master M2 and slave S1 is defined as the time the transaction traverses from/to master M2 and slave S1 through interconnect BM2. Since there are no other components in between the master and slave other than interconnect, the interconnect design defines the latency of the described path. In all the examples discussed later, it will be assumed that there are no additional components (such as intermediate register slices, combinational logic or sync-bridges) other than interconnects between a master-slave pair.
Latency for a particular master-slave interface pair is affected majorly by three things:
i) Presence of low latency master-slave paths
ii) Insertion of register slices to support timing closure, frequency and protocol conversions across the interfaces
iii) Addition of buffers to support the transactions that are of bufferable, thus enabling early response to master and delayed writes to the slave interface.
Following section will describe all the above features and how assertions can catch correctness of the interconnect design:
Low latency paths
Low latency access paths are those paths which have critical access latency requirements. These are typically achieved through dedicated point-to-point paths, and/or through additional access path between defined two nodes.
In the system defined in the figure1, P04 is a dedicated low latency path starting from the low latency port of master M0 (e.g. ARM processors such as Cortex R5 have a LLPP port). Master M0 interacts with the peripheral on P40A interface using the P04 interface. For such kind of interfaces, the accesses solely depend on the master/processor configurations and the decoding of the address map for the particular interface (P04) inside BM4 bus matrix. Verification of such interfaces and connections is very straight forward.
Let us consider a different low latency scenario. Masters M0, M1 and M2 need to access the low latency peripherals on the interfaces P40B and P40C. For masters M0 and M1, there are two possible transfer paths.
1. Path01: P01A (P01B) -> BM1 -> P14 -> BM4 -> P40B/P40BC
2. Path02: P01A (P01B) -> BM1 -> P13A (P13B) -> BM3 -> BM4 -> P40B/P40BC
As evident from the above two paths, the Path02 will be a high latency path and so any accesses on this path will be delayed by large number of cycles. The delay can be due to the arbitration of each of the bus matrices, address region decoding logic of the bus matrices or due to multiple protocol/frequency conversions. Hence Path02 is definitely going to degrade the system performance. The correct functional path is the Path01, where the transfer from the masters goes through only one level of bus interconnects.
Similarly, master M2 can access the peripherals sitting at P40B and P40C through the following paths:
1. Path11: P02A -> BM2 -> P12 -> P13A -> P34 -> BM4 -> P40B/P40C
2. Path12: P02A -> BM2 -> P23A -> BM3 -> P34 -> BM4 -> P40B/P40C
In the above case also, it is highly likely that the Path12 will be the low latency path as it goes through two levels of interconnect structure compared to three levels of interconnect structure as in case of Path11.
The same thing holds through when masters are trying to access peripherals or other system blocks at the P30A and P30C interfaces.
It is very important that correct functional paths are determined and verified as part of the design verification flow. But it is equally difficult to verify the correctness of the above functional paths through simulations. All test scenarios will pass if the verification is done end to end, because the slaves will always give the expected result and normal simulation cannot determine the latency of any functional paths. Only functional monitors at the bus interconnect interfaces can help to determine the latencies of the paths and hence will help in determining the right functional path for a particular master-slave pair. From the above examples, it may seem the above problem has arose because of address decoding problems and by checking the address ranges on each of the slave interfaces (on connected masters) of the interconnects, the design issues can be detected easily. It is true that they can be easily detected if the interconnects are physically separate. But what if both the bus-interconnects are small switches inside a bigger interconnect, as shown in Figure 2? It is a common practice to segregate the complex interconnect structure into smaller switches, as shown in figure 2, in order to increase the efficiency of the network
In such a case, slave address space remains same on both the destination interfaces at the bus-interconnect periphery, but an erroneous address decoding of the internal switches can cause a substantial performance problem leading to higher latencies for some of the transactions. The only way to catch the above issue is writing latency assertions. An example of an assertion for low latency path detection is given below:
ASSERT: (<valid tx at the input interface>) |=> first_match (##[1:EXPECTED_LATENCY] (<valid tx at the output interface>))
where, EXPECTED_LATENCY = f (no. of register slices, delays due to protocol conversions)
Approximate EXPECTED_LATENCY can be calculated for each of the interface and fed to the assertion.
The above assertion can be complemented with valid slave address spaces so that it can be clearly detected whether certain peripherals are connected across the low latency paths only.
Similar latency issues can also crop into the design when the intermediate protocol between the switches inside a bus-interconnect is inefficiently chosen. For e.g, if the master or slave interfaces follow AXI protocol, but the intermediate protocol is chosen to be AHB, then the performance will be badly hit. Such problems are very unlikely to arise as the majority of the present interconnects are designed with AXI protocol as the intermediate communication protocol, to support bufferable, cacheable and out-of-order transactions.
Register slices inside interconnect
In a complex system, interconnects will have many interfaces supporting various protocols running at different frequencies, with varying synchronous relationships. Whenever there is frequency conversion or protocol conversion, there is a general requirement of putting register slices for synchronization, timing isolation and arbitration. But for direct translation across the interfaces (for example AHB 32 bit transaction to AHB 32 bit transaction), no register slices are required. Unwanted register slices on such interfaces will introduce cycle delays, thus hitting the interconnect performance. How do we catch the unwanted register slices in the design? For example, in the Figure1, register slices may be introduced in the bus matrices BM1 and BM3 in the path from master M0 to slave S0 through P01A->P13A->P30B. The unwanted register slices typically happen while configuring BM1 or BM3 bus matrices or on BM1 to BM3 physical path. These register slices may be unnecessary in the present design which degrade the system performance especially when they are really not required as a part functional requirements.
The same assertion, used for checking low latency paths used in the detection of low latency paths can be effectively used to check if the transactions are reaching the slaves within a particular time. Approximate values of the expected latencies for a particular master-slave interface pair need to be calculated and to be passed on to the assertions.
Interconnect capability to support bufferable transactions
Placement of buffers inside interconnects is an important design consideration. Buffers not only help to avoid traffic congestion, but also help in increasing the performance by giving early response to the masters, instead of waiting for the slaves to give the response and doing delayed writes to the destined slaves. AMBA protocols support protection signals using which master can indicate to the bus-interconnect that it is ready to accept an early response even before the transaction reaches the slave. This helps in increasing the efficiency of the masters especially when accessing normal type memory/slave. If the interconnect is wrongly configured without any buffers to support bufferable type transactions, the masters will be kept waiting for delayed response till slave responds, thus degrading the performance. For example, in the figure 1, slave S0 may be a memory and bus matrices do not have capability to buffer transactions. So masters have to wait for longer response time from slave S0, every time masters initiate transactions. Had there been buffers in the bus matrices, interconnects will be capable of giving early responses even when the transactions had not reached on to the slave or interconnects themselves can do delayed write transactions to slave, without hogging the master. This is a very critical design issue and it can degrade the overall system performance to a great extent.
By accurately predicting the effective latency between the master and slave interfaces, the assertions discussed earlier can be used to catch such major issues.
Priority of simultaneous accesses (arbitration)
The performance of a master of the bus-interconnect(s) largely depends on the arbitration scheme. The problem with arbitration comes especially when the arbitration scheme employed is of fixed type i.e each master has a fixed priority. Following figure 3 considers a simple fixed arbitration scheme of a 3x2 bus matrix:
As per figure 3, M0, M1 and M2 are the three masters with M0 having the highest priority and M2 having the least priority. BM is the 3x2 bus matrix and S0 and S1 are the two slaves. Let us consider that M0 is the instruction bus master of a processor (e.g ARM CORTEX M3) and M1 is the data bus master of the same processor. M2 can be another processor or external master. Also, let us consider that the slaves, S0 and S1, are two memories.
Since M0 is the instruction bus master, the processor uses this interface to fetch the instructions from memory, S0 (say). M1 bus master is used to get the data from memories S0 and S1.
If M0 is given highest priority, then M1 and M2 will never be able to get access under certain conditions, such as during the execution of an instruction of type - Branch to itself. Such type of instructions are often used when the application wants to poll for an interrupt. If M0 has no cache, then M0 will continue to fetch the same instruction again and again from the memory, thus not relieving the bus for lower priority bus master M1 and M2. In case of M0 (instruction bus) and M1 (data bus) are coming from same master, blocking of M1 by M0 during arbitration causes processor pipe-line to enter dead lock as no transactions progress in pipeline.
Hence above priority will lead to a complete deadlock. Even if M2 is given the highest priority, then the processor performance will get degraded. For example, say, M0 fetches an instruction which requires a data read/write transfer to happen to memories S0/S1 through M1 interface. But an immediate M2 access to the memory will delay the data transfer initiated by M1, leading to pipeline stalls in the processor (M0 and M1 being the interfaces of the same processor).
The correct priority for the above system should be M1->M2->M0 with M1 having the highest priority. This is a very important aspect of any interconnect and a wrong priority setting/ arbitration scheme can cause severe performance degradation.
The above functional aspect cannot be caught by any functional simulations. Test scenarios with self-loops can catch the first priority deadlock situation, but cannot detect the second scenario where pipeline stalls are introduced due to delayed accesses from M1.
The best way to detect and verify the above scenarios is by coding assertions which will calculate the point-to-point transaction time i.e the latency of the transactions. By calculating the latency and matching with the expected values, it will be easy to detect whether a particular interface is getting stuck due to a particular reason. The assertion similar to the ones discussed earlier.
The other way of coding the assertion is by trying to detect whether the address from the highest priority interface came on the slave interface and simultaneously the lower priority interfaces are held. A sample pseudo code is given below:
ASSERT: (<tx on intfA with higher priority> && <tx on intf B with lower priority> |=> ##[1:INTERCONNECT:LATENCY] (tx seen on intf A) && (tx on intfB is held)
Where INTERCONNECT_LATENCY corresponds to the latency of the bus-interconnect for the defined master-slave pair.
The above assertion can be modified to check the type of arbitration implemented. Such assertions are required where choice of arbitration (round robin, FIFO, fixed, etc.) can help in avoiding deadlock situations.
Quality of Service of the bus-interconnect
The issues related to priority of master interfaces discussed above pertain to simple arbitration techniques in low bandwidth interconnect. In Systems with high bandwidth requirements on-chip, simple arbitration schemes are not sufficient to arbitrate between burst of transaction from different priority masters. In addition to standard arbitration techniques, incoming transactions priority also accommodated as a part of arbitration mechanism. Priority of transaction(s) are driven by masters. In addition, the masters are also capable of assigning priorities to a transaction based on its significance in the overall system. This procedure is termed as quality of service (QOS). AMBA-AXI protocol has additional QOS signals, AWQOS and ARQOS to support such a feature of the bus-interconnect. Based on the incoming QOS value and the standard arbitration scheme selected, the bus-interconnect decides how to arbitrate the simultaneous transaction from different masters. The same master at one point of time can initiate a low priority transaction whereas at some other time it can initiate a high priority transaction by changing the QOS values. A complex network also has the capability of performing QOS values profiling dynamically over a period time based on traffic variation. For e.g. the network may try to increase the QOS value and hence the priority for a particular master when it finds that the same master, not having the highest priority, is initiating lot of transactions in comparison to other masters.
In addition to the interface QOS values, the network can also be configured to support virtual channels across masters for transaction arbitration. These virtual channels use same physical channels of masters for transactions propagation. Transactions issued on different virtual channels never conflict to each other. With this, head-of-line blocking is avoided. Ex: if high bandwidth master and low bandwidth critical master both want to share bandwidth, without virtual channels, it is very likely that slow bandwidth critical master transactions are not completed in expected time window. With virtual channels enabled, slow bandwidth critical master transaction always passes without dependency on high-band-width master transactions. This concept is part of quality of virtual network (QVN).
In all the above cases, it becomes necessary to check that the network/bus-interconnect have been assigned with the correct QOS values. To exactly detect the correctness of the QOS values, it will to probe into some of the internal nodes of the bus-interconnect/ network. But few features can be verified with simple to complex assertions. Scenarios like dynamic modification of QOS values over a period of time for a lower priority master can be verified by keeping track of the output interface QOS signals.
ASSERT: (<input_qos value on a particular interface for a particular slave>) |=> ##[0:SAMPLING_PERIOD] (<output qos on the slave interface > input QOS > && <output QOS within a valid QOS range>)
where SAMPLING_PERIOD depends on the interconnect/network configuration.
Similarly, assertions can be used to check whether any lower priority is able to get access when higher priority masters are blocking the network paths. This is mainly applicable for networks where QVN is implemented. Assertions can be used to check whether buffer slots are allocated for lower priority masters also.
ASSERT: (<Tx from higher priority master with certain QOS value> && <Tx from lower priority master with certain QOS value>) |=> (##[0:SAMPLING_PERIOD] <Tx from lower priority master interface is visible on slave interface)
Above are some simple assertions to check few basic QOS related issues but detailed QOS and QVN verification requires much more complex assertions, where multiple parameters like IDs, QOS values, priorities, bufferable/cacheable, etc. Functional monitors are much more beneficial in such cases due to increasing assertion coding complexities.
Incorrect address decoding
Address decoding issues are very common in bus-interconnects but due to their nature it is very difficult to find them through functional simulation. In present day system-on-chips, most of the address decoding are taken care by the configurable interconnect networks.. Now, in a multi-level interconnect structure, it becomes very difficult to detect which interconnect is causing the decoding issues, until and unless somebody walks through the simulation waveforms, interface by interface. Let us refer to figure 1 interconnect structure, where master M2 is performing transactions, through bus-interconnects BM2 and BM3 ,to slaves S0 and an external slave, say S3, against P30C interface. Let us also consider that the slaves have address spaces as per the following table 1:
As per table 1, slave S0 supports address ranges ARg_A and ARg_B whereas slave S3 supports address range ARg_C. When master M2 places address within the range ARg_A, the transaction should reach slave S0. But due to wrong interconnect configuration (due to erroneous address decoding), the address reaches slave S3. Both the slaves S0 and S3 may have similar behavior and so transactions to slave S3 can successfully complete with valid response to the master M2. A normal simulation will not be able to identify whether the transactions initiated by master M2 have completed from slave S0 or slave S3. Even if the master initiates transactions with IDs, as in case of AXI protocol, the bus-interconnect will still behave incorrectly. But from the design perspective, there is a bug in the interconnect address decoding and the slave S3 accepted the transaction instead of slave S0. Such simple but trivial issues are very difficult to catch through standard simulation without having functional monitors/assertions. When address range assertions are put against each of the interfaces, simulations can easily detect any address decoding errors.
A simple assertion to check address decoding based on the system memory map is shown below:
ASSERT: (<valid tx on intf A) |-> (tx address on intf A >= start_address && tx address on intf A <= end_address)
As evident from the above discussions, interconnect verification is highly challenging and so a robust verification methodology needs to be put. Lot of functional monitors need to be put across the interconnect interfaces to check the correctness of the interconnect configurations. Assertions provide an easy means of coding simple functional monitors but it is also necessary to complement them with complex functional monitors, protocol monitors and functional coverage monitors as per the design complexity.