PCI Express Primer #3: Transaction Layer
Simon Southwell
Semi-retired logic, software and systems designer. Technical writer, mentor, educator and presenter.
Introduction
In the first and second articles in this series the PCIe physical and data link layers were discussed and we got to the point of having a physical channel we can send data through, then a means to flow control through that channel
The transaction layer, as we shall see, defines three categories of packets for transferring data as reads and writes into three address spaces, and a fourth category for sending messages for housekeeping and signalling. Compared to the layers discussed in the last two articles, the transaction layer has a lot of detailed rules
So, let’s get to it.
Transaction Layer Packets
As we saw in the data link layer article, three types of transaction were identified and it is worth reiterating what these are here. Below are the three identified types:
Posted packets are ones where no response is issued (or expected), such as a write to memory, non-posted are the opposite where a response is required, and a completion is a returned packet for an earlier packet in the opposite direction—such as read data from an earlier read. In the data link layer, we discussed these types with reference to flow control, as each type is flow controlled separately and, indeed, within those types are flow controlled for header and data separately as well. The transaction layer defines a set of transaction layer packets (TLPs), each of which fits into one of these three types. The general category of TLPs are listed below along with the type to which they belong.
Of these TLP types most are non-posted, whilst just memory writes and messages are posted, with the completion TLPs being the response to non-posted request which may, or may not, carry data and is a ‘completion’ type, as you’d expect.
Other than message transactions, the access request TLPs (with a completion being a response, if applicable) are reads and writes to different address spaces—namely memory, I/O and configuration.
Memory accesses are just what you’d expect, with reads and writes of data within a?memory mapped address space. According to the PCIe specifications, the I/O TLPs are to support legacy PCI which defines a separate I/O address space, but even modern systems still make a distinction of main memory and I/O, such as the RISC-V fence instructions.?The configuration access TLPs are used to access the configuration space of the PCIe. The configuration space is effectively the control and status registers of the PCIe interface. These ‘registers’ advertise capabilities, reports status and allow configurations. We will look at the configuration space details in another article. The I/O and Configurations writes, unlike memory writes, are both non-posted, and require a completion.
The general structure of a TLP is as shown in the diagram below:
Each TLP has a header which is either 3 or four double words, depending on its type, and (where applicable) the address width being used (either 32-bit or 64-bit). This is followed by the data, if any. The header will indicate the length of the data, as we shall see, but the maximum supported payload size is 4096 bytes (1024 DWs) by default, but an endpoint can advertise in its configuration space that this is smaller, in powers of 2, down to 128 bytes (32 DWs). An optional CRC can be added for addition data integrity. This is called the TLP digest and is a CRC with the same specification as the LCRC of the data link layer discussed in article 2, including inversion and bit swapping the bytes. A bit in the header indicates whether this TLP digest is present or not.
TLP Headers
The first double word of all TLP headers have a common format to indicate what the construction of the rest of the TLP is like. The diagram below shows the layout of this first double word:
The first field in the header is the ‘fmt’, or format field. This dictates whether the header is 3DW or 4, and whether there is a payload or not. Basically, if bit 0 of the format is 0 it’s a 3DW header, and if 1 it’s a 4DW header. If bit 1 of the format field is 0 then there is no data payload, else there is.
The type field, in conjunction with the format field, indicates the specific type of TLP this header is for. The table below gives the possible values:
The three-bit TC field defines the traffic class. We discussed virtual channels in the last article, and how these are mapped (through the configuration space) to traffic classes of TLPs. These three bits define to which class the TLP belongs and thus its priority through links that have more than one virtual channel. Traffic class 0 is always implemented and always mapped to VC0.
The TD bit is the TLP digest bit and indicates the presence of the ECRC TLP Digest word (when set). In addition, there is the EP bit, indicating that the TLP data payload is ‘poisoned’. That means that some error occurred, such as the ECRC check (if TLP digest present) failing at some hop over a link towards its endpoint destination, or perhaps error correction failed when reading memory. A packet is still forwarded, through any switches, to the destination in these cases, which is known as error forwarding. This feature is optional but, if present, the destination reports the error and discards the packet. Though this is a reported error it need not be fatal, as higher layer recovery mechanisms may exist.
The two bit attribute field (attr[1:0]) bits are to do with ordering and cache coherency?(snooping—see my article on caches). Bit 1 indicates relaxed ordering when set, like for PCI-X, but strict ordering when clear (as for PCI). Bit 0 is a cache snoop bit, where a 1 indicates no snooping for cache coherency, and a 0 indicates cache snooping expected. For both these bits, they are?only set for memory requests.
The AT field, introduced in Gen 2.0, is an address type field. There are three valid values as shown below.
These are only used for memory requests and are reserved for all other types of transactions. The use of these bits relates to address translation services (ATS) extensions. This allows endpoints to keep local mappings between untranslated addresses and physical addresses. Which type in the header is being sent is defined by the AT bits, as per the list above. The translation request (01b) allows endpoints to request the ‘translation agent’ (logically sitting between the root complex and main memory) to return the physical address for storing locally. Using local endpoint mappings relieves the bottle neck in the translation agent. There are also mechanisms for invalidating mappings, but more details on ACS is beyond the scope of this article.
The last field is the length field. This indicates the length of the data payload in double words (32 bits). All data in a TLP is naturally aligned to double words, with byte enables used to align at bytes and words, where applicable. Note that a length of 0 for packets with a payload indicates 1024 double words, whilst for TLPs with no payload the length field is reserved. Note that a transaction must not have a length field where an access crosses a 4Kbyte boundary.
Having defined the headers’ common first double word, present for all TLPs, let’s look in detail at the individual TLP header formats and uses.
Memory Accesses
Memory access TLPs are the fundamental means for doing reads and writes over the PCIe links. As mentioned before, memory access come in two forms: a 64-bit long address format and a 32-bit short address format. Both the read and write memory requests can use either format, but a requester accessing an address less than 4GBytes must use the 32-bit format. The diagram below shows the format for the Memory requests’ headers for the two address types.
As mentioned above when discussing the common first header DW, the fmt and type fields identify the TLP as either a Memory Write (MWr), a Memory Read (MRd) or a Memory Read-Locked (MRdLk—see the table above). For memory writes the length field will determine the number of double words of the accompanying payload. For memory reads, this is the amount of data requested to be returned. If the TD bit is set, then the digest is ECRC is present.
The second double word for memory transaction contains a requester ID, a tag and byte enables for the first and last double words. The requester ID is a unique value for every PCIe function
The bus and device numbers are captured by a device during certain configuration writes (more later) and must be used by that device whenever it issues requests. Many devices are single function but, if multiple function, then the device must assign unique function numbers to each function it contains. It is possible that the bus and device number change at run-time and so the device must recapture these numbers if the particular type of configuration write is received once more. This feature might be used when a new device is hot-plugged and the system determines it might be useful to group the new device with devices neighbouring numbers, but one it wants is already allocated. Before being assigned a bus and device number a device can’t initiate any non-posted transactions as a requester ID is required to route back the completion.
The tag field is assigned a unique value by the requester from all other outstanding requests so that it may be identified for completions (which might be out of order from the order of requests). By default, only 32 outstanding requests are allowed, but if the device is configured for extended tags in the configuration space, then all 8 bits can be used for 256 outstanding requests. The number of outstanding requests can be extended even further with the use of ‘Phantom function numbers’. If a device has less than the full number of separate functions that can be supported (8), then the unused functions numbers may be used to uniquely identify outstanding transaction in conjunction with the tag. Since a device must have at least one function, leaving 7, this extends the maximum possible outstanding transactions to 1792.
The byte enables indicate the valid bytes at the beginning and end of a transaction. This means the bytes to be written or read, when set. When set on write, only those bytes are updates. When clear on reads, this indicates the bytes are not to be read if the data isn’t pre-fetchable. If the payload length is 1DW, the last BE must be 0000b. Also, in this case, the first BE need not be contiguous, so 0101b or 1001b etc., are valid. For multiple DW lengths the BE must be such to form a contiguous set of bytes, and neither must be 0000b.
After the second double word (or words), the address follows. This address must be aligned to a double word so the lower two bits are reserved and are implied as 0.
The above descriptions refer to both writes and reads to memory. The memory read lock variant, identified with a type of 00001b is identical in usage for normal memory reads. It is included for legacy reasons, as PCI supported locked reads, but is a potential for bus lockup, so new device designs are not to include support for this type of read and normally only root complexes would issue these types of transaction.
Completions
Completions are used as responses to all non-posted requests. That is, all read requests and non-posted write requests (i.e., I/O and configurations writes). The diagram below shows the header format for a completion.
All completion headers are 3DW, so fmt bit 0 is always 0. A completion sets the TC,?attribute, requester ID and tag fields to match those the request for which it is a response.
The second DW carries a completer ID, which is the bus, device and function number for the device issuing the completion, using bus and device numbers as captured on receipt of a CfgWr0 (more later). If the bus and device hasn’t been programmed, then a completion sets the bus and device to 0. All completions use ID routing, and the requester ID sent with the non-posted transaction is used in the third DW for this purpose. After the completer ID is a 3-bit completion status which has the following valid values:
Of the non-successful statuses, we have mentioned unsupported request before, where a request, such as a vendor message, is not implemented so a completion with UR is returned. The CRS status is for configuration requests only where, say after initialisation, a configuration request can’t yet be processed but will be able to in the future and a retry can be scheduled. The completer abort is used only to indicate a serious error that makes the completer permanently unable to respond to a request that it would otherwise have normally responded to and is a reported error. The error that might result in such a response can be very high level, such violating the program model of the device.
The BCM (byte count modified) field is for PCI legacy support, and PCIe completers should set this to 0. The byte count gives the remaining byte count to complete the read request, including the payload data of the completion. For memory reads, the completions can be split into multiple completions, so long as the total amount sent exactly equals that requested. Since all I/O and Configurations reads are 1DW in length, only one completion is allowed for these packets. Note that a byte count of 0 equals 4096 bytes.
The final completer field to mention is the lower address field. For completions other than for memory reads, this value is set to 0. For memory reads it is the lower byte address of the first byte in the returned data (or partial data). This is set for the first (or only) completion and will be 0 in the lower 7 bits from then on, as the completions, if split, must be naturally aligned to a read completion boundary (RCB), which is usually 128 bytes (though 64 bytes in root complex).
The diagram below shows some traffic, with requests and completions, from the pcie model with just the transaction layer enabled for display with colour and highlights added for clarity :
In this traffic snippet we can see the down link sending out two requests; a memory read request, with a tag of 1, and a configuration read (type 0) request with a tag of 3. For the memory read, the address is given as 0xa0000080, but the first byte enable (FBE) is 0001b, so the data actually starts at byte address 0xa0000083. The length is given as 0x21 (33) DWs, but the last byte enable is 0111b, so the actual length of the transfer, in bytes, is 128. The traffic class is TC0 and the request has a digest word (ECRC).
The successful completion for the memory read is returned by the upstream port after the config read request, identified as for the memory access with a tag of 1. The count is set at 0x80, matching the 128-byte request (so no split completion), but the bytes are spread over the 132 returned bytes (33 DWs) since the address offset was 3, and the lower address value in the header reflects this.
The completion for the configuration read reflects that all configuration reads return a single DW, so the?byte count is 4 and the payload length 1.
I/O Accesses
I/O access transaction are very similar to memory access transaction, but with some restrictions. As mentioned before, they are used to access an I/O space that’s separate from the memory address space and are really for legacy support. The diagram below shows the header for these types of TLPs.
The main thing to note here is that only 32-bit address types are supported for I/O requests, so bit 0 of the fmt field is always 0. An I/O request can only be for 1DW, so the length is always 1. Also, to comply with the BE rules for 1DW payloads, the last BE is fixed at 0000b. Since the attribute field bits are associated with memory access ordering and cache snooping, they are both set to 0 in I/O TLP headers. Other than this, I/O transactions work in much the same way as 32-bit address, 1DW memory accesses.
Configuration Space Access
The configuration space is a third address space, separate from the memory and I/O spaces. In addition, unlike memory and I/O TLPs, the transactions are not routed with an address but with an ID, containing a bus-, device-, and function number, as per the requester ID mentioned in the discussion of memory accesses. There we talked of unique bus, device, and function number for each device on the fabric, and transactions for configuration accesses use these to specify the destination configuration space. The header format for configuration request TLPs is shown below:
Like I/O TLPs, configuration TLPs are only 1DW, and the same field values are set to 0 and length set to 1, as for I/O. The device sending the configuration request also has a unique ID, with bus, device, and function number, and this is in the second DW as for other transactions. The device it is addressing is in the third DW, in lieu of the 32-bit address, with the target bus, device, and function numbers.
In addition, there is a register number. The configuration space is made up of a set of 8-bit registers with an offset associated with each of them, addressed by the register number. The PCIe device has a PCI compatible 256 register space, addressed by the register number, but extends this to a 4096 register space. The extended register number bits are used to access this extended space. Thus, the PCI compatible configuration space occupies offsets 0 to FFh, and the PCIe extended configuration space occupies, occupies offsets 100h to FFFh.
One final thing that identifies the destination is the configuration TLP’s type—either type 1 or type 0, as shown in the table of TLP types in the section on TLP headers. Type 0 configuration reads and writes are routed to a destination device (endpoint) and intermediate link hops simply route the request to the destination. Type 1 configuration accesses are destined for root complexes or switches/bridges. The configuration register set for type 1 is different from type 0 , though there are common registers (more later).
Note that the bus and device numbers, as used by completions, are not fixed for a given link. Whenever an endpoint receives a type 0 configuration write, the bus and device number used in the transaction is set in the devices configuration space and used in the CID of all completions it generates. It is sampled on all type 0 configuration writes, as it may be updated dynamically whilst the link is up. The configuration space itself will be discussed in a separate section.
Messages
Messages convey a variety of information that isn’t an access to an addressable space.?The general groups of information carried by messages are:
The general format for a message header is shown below:
Message headers are 4DW, so bit 0 of the fmt field is fixed at 1. The attribute field is also fixed at 00b. Some message types can have payloads (MsgD TLPs) as well as be assigned a traffic class. A requester ID and tag is included as normal, but in place of the byte enables is a message code defining the type of the TLP message. For most message types, the third and fourth double word are reserved, but are used for some types, as we shall see.
Unlike the other TLPs, messages can have different routing types. The table in the TLP Header section listed the Msg/MsgD types as having their lower 3 bits of type as rrr. These bits define the routing used, as shown in the table below.
Interrupt Messages
Interrupt messages are for legacy support, though they must be implemented. The preferred interrupt signalling method is to use message signalled interrupts—MSI or MSI-X (extended). These are implemented using normal memory write transactions. PCI Express devices must support MSI, but legacy devices might not be capable, and the interrupt messages are used in that case. Switch devices, at least, must support the interrupt messages.
The interrupt messages effectively implement four ‘virtual wires’ that can be asserted or deasserted—namely A,B C, and D, mirroring the four wires in PCI. Thus, there are two types of interrupt message Assert_INTx and Deassert_INTx, where is x is one of the virtual wires. The message codes for the eight interrupt messages are as follows:
All interrupt messages use local routing (rrr = 100b). It is up to the switches to amalgamate interrupts arriving on its downstream ports and map these to interrupts on its upstream port. Also, only upstream ports (e.g., endpoint to switch) can issue these messages as it makes no sense sending interrupts ‘away’ from the CPU direction towards endpoint devices. Interrupt messages never have a payload (so no MsgD types). Ultimately, at the root complex, an actual interrupt is raised on the system interrupt resource system—e.g., an interrupt controller.
So, these messages are sent by an upstream port whenever the state of one of the interrupts changes, either to active or inactive. Duplicate messages (e.g., a second Assert_INTB without a deassertion) have no effect but are not errors and are ignored by the receiving device. Note that interrupts can be disabled individually in the command register of the configuration space andn if in an asserted state when disabled, a Deassert_INTx message must be sent.
Power Management Messages
We have already alluded to one of the power management messages, PM_Active_State_Nak, when discussing the data link layer in the second article in this series, used when a downstream device is requesting a lower power state by sending PM_Active_State_Request_L1 DLLPs, and this is sent if the request is rejected. There are three other message types to look at, and the full message code encodings are shown below:
None of these messages include a payload (no MsgD types) and all are traffic class 0 (TC0).
The PM_PME message signals a ‘power management event’—e.g., some change in state of power has completed. These are sent by an endpoint device towards the root complex and are another source of interrupt. All these events can be enabled/disabled in the configuration space, like the interrupt messages.
The last two power management messages are the PME_Turn_off, a request broadcast from the root complex to prepare for power removal, and PME_TO_Ack, an acknowledgment sent back to the root complex that the appropriate state is reached. From a link LTSSM point of view, the downstream component must get to L0, if in a lower power state, so the PM_TO_Ack can be sent, and then it eventually ends up to L2 (see the first article in this series). Power can then be removed (L3 power state) when the root complex has seen acknowledgement from all the devices.
Error Signalling Messages
Error signals originate from downstream components and are routed towards the root complex (routing type 000b) and do not have payloads (no MsgD types). There are three types of error messages, as listed below:
The message types reflect correctable, non-correctable but not fatal, and fatal errors. An example correctable error might be a TLP LCRC error, but where this can be fixed with a retry. This is correctable but the error might still be reported for analysis and debug of error rates. A non-fatal error is one which cannot be corrected but does not render the link itself unusable. It would then be up to software to process the error to recover the situation of possible. The reception of a malformed packet might be an example of a non-fatal error. A fatal error is one where a link is now considered as unreliable. For example, a time out on acknowledgements that has reached maximum link retraining attempts. The three types of error can be individually enabled or disabled in the device control register of the configuration space. Some error types are listed below:
If extended capabilities are supported in the configuration space, then, if the advanced error reporting capability structure is present, the above errors have their own separate status and can be enabled or disabled individually.
Locked Transaction Messages
There is a single message used to support locked transactions. As we have seen previously, there is a MRdLk and CplDLk TLP type. A lock transaction is initiated by one or more CPU locked read accesses (with subsequent CplDLk responses) followed by a number of writes to the same locations. This establishes a lock, and all other traffic is blocked from using the link path from the RC to the (legacy) endpoint. The lock is release by sending an Unlock message from the root complex. The message code value is shown below:
The Unlock messages do not have payloads (no MsgD types) and always have a traffic class of 0 (TC0).
Slot Power Limit Messages
There is a single message defined for support of slot power limiting: Set_Slot_Power_Limit. This Message is used to set a slot power limitation value from a downstream port of a root complex or switch to an upstream Port of a component (e.g., ?Endpoint or Switch) attached to the same Link. The message code value is shown below:
The Set_Slot_Power_Limit message includes a 1DW data payload, and this data payload is copied from the slot capabilities configuration space register of the downstream port and is written into the device capabilities register’s captured slot power limit fields (a scale and limit) of the upstream port on the other side of the link. The two fields then define the upper limit of power supplied by the slot, which the device must honour.
All Set_Slot_Power_Limit messages must belong to traffic class 0 (TC0).
Vendor Defined Messages
Vendor messages are meant for PCIe expansion or vendor-specific functionality. There are two types of vendor messages defined: type 0 and type 1. Both types can be routed using one of four mechanisms: routed to RC (000b), routed by ID (010b), broadcast from RC (011b), and routed locally (100b). The message codes for the two vendor defined messages are as listed:
The main difference between type 0 and type 1 vendor messages is that, receiving a type 0 vendor message if vendor messages are not implemented, triggers an unsupported request (UR) error, whereas receiving a type 1 ?message when not implemented discards the packet without error.
The structure of a vendor message is shown in the diagram below:
In these messages bytes 8 and 9 are either a route ID (bus, device, and function numbers) when the routing type is 010b, otherwise these bytes are reserved. The last DW is defined by the vendor specific implementation. Vendor messages may contain payloads (Msg and MsgD TLPs supported). Bit 0 of the format fields in the first DW is fixed at 1, as the header is always 4DWs long. The attribute field, though, is not fixed and either bit may be set, and any traffic class value can be used.
Conclusions
In this article we have gone through all the transaction layer packets types and discussed their use with the PCIe protocols. Necessarily, this has been a summary as the amount of detail would quickly overwhelm an article such as this.
For the most part, the TLP layers is involved in reading and writing to various addressed spaces: memory, I/O and configuration, each with their own transaction layer packet (TLP) types. These access requests, where applicable, result in completion packets with a success/error status and returned data where reading—which can be split into smaller completions. Each outstanding packet request has a unique tag, and completions identify with the request using the same tag number. We have also seen that packets can be routed using different mechanisms—address, ID, routed to RC (possibly gathered), broadcast from RC and routed to local link. As well as the different kinds of reading and writing transactions, there are message TLPs used for interrupt signalling, error reporting, power management, locked translation support, and vendor defined messages.
We have now covered all three layers of the PCIe protocol, so that should be it, right? In this article I have mention the configuration space on numerous occasions but with only few details to explain what was necessary. In the next (and last) article I want to look at the configurations space in a little more detail to see what information it contains and what can be controlled. Then we will finish with a quick look at later specifications, including PCIe 6.0, released in January of this year (2022) and PCIe 7.0, the development of which was announced in June. I will also try and summarise what features these articles have not covered, through lack of space and time.
Sr. Principal Digital Architect at Anokiwave
1 年Thanks for this amazing, informative series of articles! This morning I didn't know anything about PCIe.
software engineer
1 年very good article
Software Design Engineer at Tektronix
1 年Loved every bit of this article, it answered all my questions, thank you very much!
Verification, Methodology and Automation
1 年Very well summarized.
Silicon Validation @Qualcomm , ~12+ yrs exp in Pre & Post Silicon Validation domain, Motivator, Learner, 43k+ Followers, 30K+ Connections
1 年But I don't see anywhere it explains about the Link training phases .