312 lines
		
	
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			312 lines
		
	
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| ===================================================
 | |
| PCI Express I/O Virtualization Resource on Powerenv
 | |
| ===================================================
 | |
| 
 | |
| Wei Yang <weiyang@linux.vnet.ibm.com>
 | |
| 
 | |
| Benjamin Herrenschmidt <benh@au1.ibm.com>
 | |
| 
 | |
| Bjorn Helgaas <bhelgaas@google.com>
 | |
| 
 | |
| 26 Aug 2014
 | |
| 
 | |
| This document describes the requirement from hardware for PCI MMIO resource
 | |
| sizing and assignment on PowerKVM and how generic PCI code handles this
 | |
| requirement. The first two sections describe the concepts of Partitionable
 | |
| Endpoints and the implementation on P8 (IODA2). The next two sections talks
 | |
| about considerations on enabling SRIOV on IODA2.
 | |
| 
 | |
| 1. Introduction to Partitionable Endpoints
 | |
| ==========================================
 | |
| 
 | |
| A Partitionable Endpoint (PE) is a way to group the various resources
 | |
| associated with a device or a set of devices to provide isolation between
 | |
| partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
 | |
| to freeze a device that is causing errors in order to limit the possibility
 | |
| of propagation of bad data.
 | |
| 
 | |
| There is thus, in HW, a table of PE states that contains a pair of "frozen"
 | |
| state bits (one for MMIO and one for DMA, they get set together but can be
 | |
| cleared independently) for each PE.
 | |
| 
 | |
| When a PE is frozen, all stores in any direction are dropped and all loads
 | |
| return all 1's value. MSIs are also blocked. There's a bit more state that
 | |
| captures things like the details of the error that caused the freeze etc., but
 | |
| that's not critical.
 | |
| 
 | |
| The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
 | |
| are matched to their corresponding PEs.
 | |
| 
 | |
| The following section provides a rough description of what we have on P8
 | |
| (IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
 | |
| is a completely separate HW entity that replicates the entire logic, so has
 | |
| its own set of PEs, etc.
 | |
| 
 | |
| 2. Implementation of Partitionable Endpoints on P8 (IODA2)
 | |
| ==========================================================
 | |
| 
 | |
| P8 supports up to 256 Partitionable Endpoints per PHB.
 | |
| 
 | |
|   * Inbound
 | |
| 
 | |
|     For DMA, MSIs and inbound PCIe error messages, we have a table (in
 | |
|     memory but accessed in HW by the chip) that provides a direct
 | |
|     correspondence between a PCIe RID (bus/dev/fn) with a PE number.
 | |
|     We call this the RTT.
 | |
| 
 | |
|     - For DMA we then provide an entire address space for each PE that can
 | |
|       contain two "windows", depending on the value of PCI address bit 59.
 | |
|       Each window can be configured to be remapped via a "TCE table" (IOMMU
 | |
|       translation table), which has various configurable characteristics
 | |
|       not described here.
 | |
| 
 | |
|     - For MSIs, we have two windows in the address space (one at the top of
 | |
|       the 32-bit space and one much higher) which, via a combination of the
 | |
|       address and MSI value, will result in one of the 2048 interrupts per
 | |
|       bridge being triggered.  There's a PE# in the interrupt controller
 | |
|       descriptor table as well which is compared with the PE# obtained from
 | |
|       the RTT to "authorize" the device to emit that specific interrupt.
 | |
| 
 | |
|     - Error messages just use the RTT.
 | |
| 
 | |
|   * Outbound.  That's where the tricky part is.
 | |
| 
 | |
|     Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
 | |
|     from the CPU address space to the PCI address space.  There is one M32
 | |
|     window and sixteen M64 windows.  They have different characteristics.
 | |
|     First what they have in common: they forward a configurable portion of
 | |
|     the CPU address space to the PCIe bus and must be naturally aligned
 | |
|     power of two in size.  The rest is different:
 | |
| 
 | |
|     - The M32 window:
 | |
| 
 | |
|       * Is limited to 4GB in size.
 | |
| 
 | |
|       * Drops the top bits of the address (above the size) and replaces
 | |
| 	them with a configurable value.  This is typically used to generate
 | |
| 	32-bit PCIe accesses.  We configure that window at boot from FW and
 | |
| 	don't touch it from Linux; it's usually set to forward a 2GB
 | |
| 	portion of address space from the CPU to PCIe
 | |
| 	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
 | |
| 	reserved for MSIs but this is not a problem at this point; we just
 | |
| 	need to ensure Linux doesn't assign anything there, the M32 logic
 | |
| 	ignores that however and will forward in that space if we try).
 | |
| 
 | |
|       * It is divided into 256 segments of equal size.  A table in the chip
 | |
| 	maps each segment to a PE#.  That allows portions of the MMIO space
 | |
| 	to be assigned to PEs on a segment granularity.  For a 2GB window,
 | |
| 	the segment granularity is 2GB/256 = 8MB.
 | |
| 
 | |
|     Now, this is the "main" window we use in Linux today (excluding
 | |
|     SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
 | |
|     onto a segment alignment/granularity so that the space behind a bridge
 | |
|     can be assigned to a PE.
 | |
| 
 | |
|     Ideally we would like to be able to have individual functions in PEs
 | |
|     but that would mean using a completely different address allocation
 | |
|     scheme where individual function BARs can be "grouped" to fit in one or
 | |
|     more segments.
 | |
| 
 | |
|     - The M64 windows:
 | |
| 
 | |
|       * Must be at least 256MB in size.
 | |
| 
 | |
|       * Do not translate addresses (the address on PCIe is the same as the
 | |
| 	address on the PowerBus).  There is a way to also set the top 14
 | |
| 	bits which are not conveyed by PowerBus but we don't use this.
 | |
| 
 | |
|       * Can be configured to be segmented.  When not segmented, we can
 | |
| 	specify the PE# for the entire window.  When segmented, a window
 | |
| 	has 256 segments; however, there is no table for mapping a segment
 | |
| 	to a PE#.  The segment number *is* the PE#.
 | |
| 
 | |
|       * Support overlaps.  If an address is covered by multiple windows,
 | |
| 	there's a defined ordering for which window applies.
 | |
| 
 | |
|     We have code (fairly new compared to the M32 stuff) that exploits that
 | |
|     for large BARs in 64-bit space:
 | |
| 
 | |
|     We configure an M64 window to cover the entire region of address space
 | |
|     that has been assigned by FW for the PHB (about 64GB, ignore the space
 | |
|     for the M32, it comes out of a different "reserve").  We configure it
 | |
|     as segmented.
 | |
| 
 | |
|     Then we do the same thing as with M32, using the bridge alignment
 | |
|     trick, to match to those giant segments.
 | |
| 
 | |
|     Since we cannot remap, we have two additional constraints:
 | |
| 
 | |
|     - We do the PE# allocation *after* the 64-bit space has been assigned
 | |
|       because the addresses we use directly determine the PE#.  We then
 | |
|       update the M32 PE# for the devices that use both 32-bit and 64-bit
 | |
|       spaces or assign the remaining PE# to 32-bit only devices.
 | |
| 
 | |
|     - We cannot "group" segments in HW, so if a device ends up using more
 | |
|       than one segment, we end up with more than one PE#.  There is a HW
 | |
|       mechanism to make the freeze state cascade to "companion" PEs but
 | |
|       that only works for PCIe error messages (typically used so that if
 | |
|       you freeze a switch, it freezes all its children).  So we do it in
 | |
|       SW.  We lose a bit of effectiveness of EEH in that case, but that's
 | |
|       the best we found.  So when any of the PEs freezes, we freeze the
 | |
|       other ones for that "domain".  We thus introduce the concept of
 | |
|       "master PE" which is the one used for DMA, MSIs, etc., and "secondary
 | |
|       PEs" that are used for the remaining M64 segments.
 | |
| 
 | |
|     We would like to investigate using additional M64 windows in "single
 | |
|     PE" mode to overlay over specific BARs to work around some of that, for
 | |
|     example for devices with very large BARs, e.g., GPUs.  It would make
 | |
|     sense, but we haven't done it yet.
 | |
| 
 | |
| 3. Considerations for SR-IOV on PowerKVM
 | |
| ========================================
 | |
| 
 | |
|   * SR-IOV Background
 | |
| 
 | |
|     The PCIe SR-IOV feature allows a single Physical Function (PF) to
 | |
|     support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
 | |
|     Capability control the number of VFs and whether they are enabled.
 | |
| 
 | |
|     When VFs are enabled, they appear in Configuration Space like normal
 | |
|     PCI devices, but the BARs in VF config space headers are unusual.  For
 | |
|     a non-VF device, software uses BARs in the config space header to
 | |
|     discover the BAR sizes and assign addresses for them.  For VF devices,
 | |
|     software uses VF BAR registers in the *PF* SR-IOV Capability to
 | |
|     discover sizes and assign addresses.  The BARs in the VF's config space
 | |
|     header are read-only zeros.
 | |
| 
 | |
|     When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
 | |
|     base address for all the corresponding VF(n) BARs.  For example, if the
 | |
|     PF SR-IOV Capability is programmed to enable eight VFs, and it has a
 | |
|     1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
 | |
|     This region is divided into eight contiguous 1MB regions, each of which
 | |
|     is a BAR0 for one of the VFs.  Note that even though the VF BAR
 | |
|     describes an 8MB region, the alignment requirement is for a single VF,
 | |
|     i.e., 1MB in this example.
 | |
| 
 | |
|   There are several strategies for isolating VFs in PEs:
 | |
| 
 | |
|   - M32 window: There's one M32 window, and it is split into 256
 | |
|     equally-sized segments.  The finest granularity possible is a 256MB
 | |
|     window with 1MB segments.  VF BARs that are 1MB or larger could be
 | |
|     mapped to separate PEs in this window.  Each segment can be
 | |
|     individually mapped to a PE via the lookup table, so this is quite
 | |
|     flexible, but it works best when all the VF BARs are the same size.  If
 | |
|     they are different sizes, the entire window has to be small enough that
 | |
|     the segment size matches the smallest VF BAR, which means larger VF
 | |
|     BARs span several segments.
 | |
| 
 | |
|   - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
 | |
|     to a single PE, so it could only isolate one VF.
 | |
| 
 | |
|   - Single segmented M64 windows: A segmented M64 window could be used just
 | |
|     like the M32 window, but the segments can't be individually mapped to
 | |
|     PEs (the segment number is the PE#), so there isn't as much
 | |
|     flexibility.  A VF with multiple BARs would have to be in a "domain" of
 | |
|     multiple PEs, which is not as well isolated as a single PE.
 | |
| 
 | |
|   - Multiple segmented M64 windows: As usual, each window is split into 256
 | |
|     equally-sized segments, and the segment number is the PE#.  But if we
 | |
|     use several M64 windows, they can be set to different base addresses
 | |
|     and different segment sizes.  If we have VFs that each have a 1MB BAR
 | |
|     and a 32MB BAR, we could use one M64 window to assign 1MB segments and
 | |
|     another M64 window to assign 32MB segments.
 | |
| 
 | |
|   Finally, the plan to use M64 windows for SR-IOV, which will be described
 | |
|   more in the next two sections.  For a given VF BAR, we need to
 | |
|   effectively reserve the entire 256 segments (256 * VF BAR size) and
 | |
|   position the VF BAR to start at the beginning of a free range of
 | |
|   segments/PEs inside that M64 window.
 | |
| 
 | |
|   The goal is of course to be able to give a separate PE for each VF.
 | |
| 
 | |
|   The IODA2 platform has 16 M64 windows, which are used to map MMIO
 | |
|   range to PE#.  Each M64 window defines one MMIO range and this range is
 | |
|   divided into 256 segments, with each segment corresponding to one PE.
 | |
| 
 | |
|   We decide to leverage this M64 window to map VFs to individual PEs, since
 | |
|   SR-IOV VF BARs are all the same size.
 | |
| 
 | |
|   But doing so introduces another problem: total_VFs is usually smaller
 | |
|   than the number of M64 window segments, so if we map one VF BAR directly
 | |
|   to one M64 window, some part of the M64 window will map to another
 | |
|   device's MMIO range.
 | |
| 
 | |
|   IODA supports 256 PEs, so segmented windows contain 256 segments, so if
 | |
|   total_VFs is less than 256, we have the situation in Figure 1.0, where
 | |
|   segments [total_VFs, 255] of the M64 window may map to some MMIO range on
 | |
|   other devices::
 | |
| 
 | |
|      0      1                     total_VFs - 1
 | |
|      +------+------+-     -+------+------+
 | |
|      |      |      |  ...  |      |      |
 | |
|      +------+------+-     -+------+------+
 | |
| 
 | |
|                            VF(n) BAR space
 | |
| 
 | |
|      0      1                     total_VFs - 1                255
 | |
|      +------+------+-     -+------+------+-      -+------+------+
 | |
|      |      |      |  ...  |      |      |   ...  |      |      |
 | |
|      +------+------+-     -+------+------+-      -+------+------+
 | |
| 
 | |
|                            M64 window
 | |
| 
 | |
| 		Figure 1.0 Direct map VF(n) BAR space
 | |
| 
 | |
|   Our current solution is to allocate 256 segments even if the VF(n) BAR
 | |
|   space doesn't need that much, as shown in Figure 1.1::
 | |
| 
 | |
|      0      1                     total_VFs - 1                255
 | |
|      +------+------+-     -+------+------+-      -+------+------+
 | |
|      |      |      |  ...  |      |      |   ...  |      |      |
 | |
|      +------+------+-     -+------+------+-      -+------+------+
 | |
| 
 | |
|                            VF(n) BAR space + extra
 | |
| 
 | |
|      0      1                     total_VFs - 1                255
 | |
|      +------+------+-     -+------+------+-      -+------+------+
 | |
|      |      |      |  ...  |      |      |   ...  |      |      |
 | |
|      +------+------+-     -+------+------+-      -+------+------+
 | |
| 
 | |
| 			   M64 window
 | |
| 
 | |
| 		Figure 1.1 Map VF(n) BAR space + extra
 | |
| 
 | |
|   Allocating the extra space ensures that the entire M64 window will be
 | |
|   assigned to this one SR-IOV device and none of the space will be
 | |
|   available for other devices.  Note that this only expands the space
 | |
|   reserved in software; there are still only total_VFs VFs, and they only
 | |
|   respond to segments [0, total_VFs - 1].  There's nothing in hardware that
 | |
|   responds to segments [total_VFs, 255].
 | |
| 
 | |
| 4. Implications for the Generic PCI Code
 | |
| ========================================
 | |
| 
 | |
| The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
 | |
| aligned to the size of an individual VF BAR.
 | |
| 
 | |
| In IODA2, the MMIO address determines the PE#.  If the address is in an M32
 | |
| window, we can set the PE# by updating the table that translates segments
 | |
| to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
 | |
| set the PE# for the window.  But if it's in a segmented M64 window, the
 | |
| segment number is the PE#.
 | |
| 
 | |
| Therefore, the only way to control the PE# for a VF is to change the base
 | |
| of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
 | |
| amount of space required for the VF(n) BAR space, the VF BAR value is fixed
 | |
| and cannot be changed.
 | |
| 
 | |
| On the other hand, if the PCI core allocates additional space, the VF BAR
 | |
| value can be changed as long as the entire VF(n) BAR space remains inside
 | |
| the space allocated by the core.
 | |
| 
 | |
| Ideally the segment size will be the same as an individual VF BAR size.
 | |
| Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
 | |
| are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
 | |
| allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
 | |
| 
 | |
| If the segment size is smaller than the VF BAR size, it will take several
 | |
| segments to cover a VF BAR, and a VF will be in several PEs.  This is
 | |
| possible, but the isolation isn't as good, and it reduces the number of PE#
 | |
| choices because instead of consuming only numVFs segments, the VF(n) BAR
 | |
| space will consume (numVFs * n) segments.  That means there aren't as many
 | |
| available segments for adjusting base of the VF(n) BAR space.
 |