GPU Tech: Fill Rate
Ever wondered how the past shapes our present and future? My friend once told me he had to dive deep into the history of transportation engineering to truly grasp its current impact on our climate. His journey led him to write his first book. If you feel the same curiosity about modern computing technology, then you're in the right place! This month, our newsletter details how blitters evolved into more powerful graphics accelerators during the 1990s.
In 1995, I purchased a Sun Sparc station 10 on behalf of my employer, the Center for Nonlinear Studies of the Los Alamos National Laboratory. This workstations had a high resolution (1280×1024), high color (8-bit indexed) cathode ray tube display. In First Light we saw how the technology evolved from sprites, a kind of rudimentary masking performed in real time by the video graphics controller to bitblit, which accelerates writes into the framebuffer, but is limited to simple logical operations on rectangular areas.
Going through the preinstalled demos, I learned that the CG6 graphics coprocessor in this workstation could handle drawing flat shaded triangles and quads. A bit earlier, I had also witnessed the success of the very first real-time 3D games in the 1980s and the early 1990s. These games worked without any kind of acceleration and their immediate success demonstrated the existence of a price-sensitive gaming market for 3D graphics accelerators.
Key Takeaways
The No-GPU Era
For the last 20 years, there has been a competition among gamers, regarding who has got the fastest and best GPU. But gaming GPUs did not exist until 1996. A few early video games Elite, Wing Commander, and Wolfenstein 3D provided groundbreaking solutions for display functions in a time when graphical accelerators or GPUs were not yet available. These games employed various techniques to enhance the graphical experience and set the stage for the development of more advanced graphic technologies. Their authors analyzed, and cleverly optimized the most compute-intensive operations involved in real-time 3D rendering.
Elite and Wing Commander are space combat games. The reason is simple: space is mostly empty. There is very little content to show and most of the screen pixels remain black. The first version of Elite hit the shelves in 1981. It was one of the very first open-world games, with a storyline but also the freedom to carry other missions in space across several galaxies.
Low Poly Games
A polygon is defined by a list of loop-connected edges situated in the same plane. Modern GPU technology can render scenes where each polygon covers only a few pixels. Until the late 1990s, the main processor had to perform all the computations related to polygons, edges and vertices. The 3D accelerators took over to render the pixels. Limited by the processing power of the CPU, the game scenes were therefore designed with simplified shapes made from a small number of polygons. The direct consequence was an increase in the number of pixels covered by each polygon in the critical case of a close-up encounter.
Without the acceleration provided by a GPU, the first versions of Elite could not afford to shade the surfaces because the microprocessor was almost fully loaded by the game logic, vertex processing and edge drawing. The background starfield shone through the spaceships. Occultation of a ship by another wasn't handled either.
In addition to a low polygon count, the ships had convex hulls, preventing self occlusion: in other terms, the edges of a front-facing polygon cannot be hidden by another front facing polygon of the same ship. This property makes it easy to decide which edges must be drawn: an edge must be drawn only if it touches a visible face, and a face is visible if its normal faces toward the camera. The sign of the Z coordinate of the normals projected in camera coordinates is the only information needed, and it is easily computed with three multiplications and two additions. The property also guarantees that a face is either entirely in view, or entirely hidden.
Early versions of Microsoft? Flight Simulator running on the Apple II? and on the original IBM PC (in 4 color CGA graphics) showed similarly empty sceneries: the horizon separated a blue sky from a green earth, and everything else was wireframe. The PC version added a few geometric magenta mountains and black surfaced runways.
Wireframe rendering was also the norm for mechanical computer-aided design software running on large mainframes in the 1980s, but these systems could of course handle more complex geometries.
Low poly games remained the norm well into the 21st century, but the meaning of "low poly" evolved with the increasing performance of GPUs. In 2022, Epic Games claimed that its game engine, Unreal Engine 5 could use cinema quality assets without any simplification.
Wireframe rendering was a good solution when shading the faces was beyond the capabilities of the computers.
No Poly Games: A Better Solution?
If processing vertices is so computationally expensive, why not avoid it completely?
In 1991, another space combat game did just that: Wing Commander. While sophisticated shading and lighting was still out of reach without gaming GPUs, the users had become accustomed to colorful games, and there was a demand for new 3D first person games with more immersive graphics.
We will not describe the solution in great detail because it departs radically from modern approaches. On the image above, you may notice that the enemy ship is banking to the left relative to its pursuer, but is otherwise observed from directly behind and slightly above.
The game could count on much more memory than Elite, and used it to store tens of views of each ship under various angles. Instead of projecting the vertices of meshes, the game engine only projected the coordinate system of each ship. This operation is basically free once the projection matrix has been calculated. Based on the result, it extracted the perspective-correct screen location, the apparent size based on the distance, and the relative orientation of each visible ship.
The game engine used mipmaps: collections of textures representing the same spaceship in the same relative orientation at various resolutions. As you can see in the example below, the highest resolution texture alone uses most of the space. In fact, removing the biggest texture divides the memory requirement by four.
All the possible relative orientations of the other ships were pre-rendered. Considering that, most of the time, spaceships fly quite far from each other, a parallel projection can be substituted to the true perspective projection. Given the low resolution, the difference cannot be noticed. The rendered images are then properly filtered down, getting the best possible rendering of each ship, under each relative orientation using a given number of pixels. It is clear from the above illustration that a mipmap contains a lot of small thumbnails.
During the game, the apparent size was used to select the nearest size of the correct orientation. Rolling the player's ship did not result in a different selection, because the final bank rotation was not precomputed. The game used a fast bitmap rotation algorithm to align the chosen bitmap to the projected coordinate system, but it also scaled and stretched slightly each bitmap to get it exactly to the right size.
Wing Commander relied on a technology more advanced than sprites. Given its capability to scale and rotate bitmaps dynamically, it is an early example of texture mapping in two dimensions (only scaling, stretching and rotating bitmaps). All these operations can be performed using the well-known Bresenham algorithm for the rasterization of lines. The algorithm runs quickly on old CPUs, because it is simple and only requires integers. For example, it can be used to compute the oblique lines used in the Three-Shear Rotation algorithm to offset rows and columns of pixels. An oblique line close to a 45° angle can also represent the relation between old and new coordinate of pixels, before and after a slight stretch or shrink along an axis. The algorithm easily computes the rows of pixels, which must be duplicated or skipped.
After priming on the classic line algorithm and nine attempts of the artificial intelligence, here is a GPT-4 generated Python version of the Bresenham line algorithm, adapted to stretch or shrink bitmaps along the X-axis:
def stretch(self, bitmap2, new_width):
? ? src_x = 0
? ? error = self.width - new_width
? ? # Set the width of bitmap2 to the new_width
? ? bitmap2.width = new_width
? ? # Initialize variables to accumulate increments
? ? error_increment = self.width * 2
? ? src_x_increment = new_width * 2
? ? for dest_x in range(new_width):
? ? ? ? while error > 0:
? ? ? ? ? ? src_x += 1
? ? ? ? ? ? error -= src_x_increment
? ? ? ? self.copy_column(src_x, dest_x)
? ? ? ? error += error_increment
The Bresenham algorithm is a generic interpolation method. In the above example, it interpolates the pixel column indices in one bitmap, synchronously with another bitmap, picking the same number of equally spaced columns in both. The initialization of error is such, that in a shrink, to chosen columns will not be biased to one side or the other.
This useful algorithm is easy to implement in hardware as well. Interpolation was a key compute task in early GPUs.
The Wing Commander custom solution worked well, but it also had drawbacks:
Software Texture Mapping
In conclusion of the "no-GPU era" we will examine Wolfenstein 3D, one of the first real-time games to implement perspective-correct textures in a 3D game engine. Before diving into that, let's explore texture coordinates and affine texture mapping.
Thus far, we have associated a position with each vertex, which was sufficient for black and white wireframe rendering in Elite. However, more advanced graphics pipelines require additional information. In this issue, we will focus on texture mapping and fill rate rather than discussing color and color models.
Texture maps are auxiliary images, and vertices associated with the edges of a textured face have texture coordinates. For a triangular face, there is a 1-to-1 correspondence with another triangle in the texture map. Software typically uses normalized coordinates for textures, as the actual bitmap dimensions can be defined dynamically during rendering when mipmaps are employed. Blender uses the range [0;1], but other software may utilize different ranges. The normalized texture coordinates are denoted as U and V for the X and Y dimensions of the texture maps, respectively.
The spaceship model created in Blender is derived from a cube that has been split by an edge loop and slightly stretched. The vertices have inherited the coordinates from the original cube, therefore the default texture map shows the vertices laid out like an unfolded cube: the faces cover the H4 to A5 and the F2 to E7 rectangles. In the illustration, the selected vertex, in orange, has coordinates U=0.5, V=0.5. This vertex is one of the lower corners of the front top side of the spaceship. It should be evident from the map that some vertices have more than one set of U, V coordinates.
A vertex can have different texture coordinate for each face it belongs to. The selected vertex is not in that case since it appears only at the (0.5, 0.5) location in the texture map. The unfolding of the cube duplicates many vertices and edges. Vertices make the junction between an arbitrary number of faces and can therefore have an arbitrary number of texture map coordinates associated with them. Edges are only connected to two faces and can therefore only be duplicated once in the texture map.
On the low poly spaceship, the faces C5 to H5 form a continuously textured strip. The top side F6, F7, E6, E7 also joins continuously, but there is a gap between this top side and C5 because the edge is duplicated in the texture map.
The situation is frequent with low poly assets. With modern technology supporting highly detailed assets, a large mesh often maps continuously to a texture map and the most vertices have only one set of texture coordinates.It is sometimes more convenient to duplicate the vertices located at the junction of several meshes.
Affine Texture Mapping
We have seen earlier that pixel coordinates could be interpolated linearly to achieve a stretching or a shrinking effect, or even rotations. Linear interpolation in screen space (pixel coordinates) is a two dimensional linear algebra operation. The texture can be stretched, shrunk, mirrored, rotated (in two dimensions) or any combination of all that, but it never renders correctly with perspective. In fact, the projection of the texture 2D base vectors — those three points O, I, J of coordinates (0,0), (1,0) and (0,1) — defines entirely how the texture is mapped, and the mapping is a 2D affine transformation, thus the name "Affine Texture Mapping".
In the example below, we have arbitrarily chosen to map the triangles of a triangle strip to a 5×15 checkerboard pattern. The base vectors OI and OJ are respectively aligned with the vanishing lines (the sides) and the horizontal lines. The diagonals therefore have no influence on the layout of the texture.
领英推荐
While all the patterns connect, the vanishing lines in the checkerboard pattern itself are zigzagging. The lack of perspective within each triangle is also obvious. But looking at the image from some distance, there is a feeling of depth since the more distant parts appear smaller and their texture scales down. In fact, some perspective-correct texture mapping might not be strictly required when the face covers only a few pixels.
The texture is not even right along the reference edges of a triangle. When the depth is constant, on the horizontal lines, there is no issue. But the successive squares along the depth direction are all the same size, and that's not the correct result.
In affine texture mapping, the U, V texture coordinates are simply interpolated linearly in screen space. The interpolation can be performed using the simple Bresenham algorithm shown above. It is also worth noting that for a triangle, the swept area of the texture map is correct. The problem is just that colors end up at the wrong location.
Perspective-correct Texture Mapping
Linear interpolation is not valid in screen space; however, it is valid in the camera axes since the transforms used to compute these coordinates involve 4x4 linear matrix multiplications. To address the issue of depth in affine texture mapping, a set of texture coordinates incorporating depth can be defined. Depth represents the distance of a vertex from the camera plane, and the result remains unchanged if the depth is normalized to 1 at the farthest vertex. The depth ratios for other vertices are the crucial parameters.
It is also important to note that linear interpolation is accurate when all the pixels processed are at the same depth. This property was exploited by Wolfenstein 3D to solve the problem without relying on a GPU.
In Wolfenstein 3D, the scene consists of textured vertical walls, hostile dogs, soldiers, and various objects such as small tables, columns, and plants. The floor and ceiling have a uniform color, and since the player cannot look up or down, the horizon is always at the same height on the screen.
Character animation is two-dimensional. Characters and objects can be partially obscured by walls. Soldiers may turn around when surprised, and they have only one shooting position facing the player. The animation is a sequence of 2D bitmaps and is not highly detailed.
All other objects always face the player, as the game uses the same bitmap for each view angle. When moving sideways, objects seem to rotate with the player's movement. The bodies of defeated enemies behave similarly. The texture mapping quality demonstrates the use of properly scaled mipmaps, and the texture resolution is reasonable given the screen resolution.
Gameplay restrictions ensure that a vertical column of pixels defining a wall remains at a constant distance from the camera. The game engine renders walls column by column, linearly interpolating to stretch or shrink the corresponding texture column. This approach avoids complex 3D projections using ray-casting. However, casting rays to find walls set on a grid bears a resemblance to another variant of the Bresenham line algorithm.
Why it Works
In a textured triangle, pixels farther from the camera represent a larger surface area of the triangle compared to pixels located closer to the camera. This occurs because each pixel on the screen represents light cones with the same solid angle, which expand as they move further away from the camera.
As a direct result, fewer texture samples are needed for distant pixels, which implies that texture coordinates cannot be linearly interpolated.
Let's imagine a triangle exactly 1 unit long. The ?? texture coordinate varies between 0 and 1 when traversing the triangle along an edge from its reference point placed at ?? coordinate ?? after projection in camera space . In this case the position of any point along the edge of the triangle is ??+??. Its ?? coordinate after perspective projection (screen space) would be: (??+??)/?? = ??/?? + ??/?? . Since the first term is a constant for the triangle, the second term counts the pixels on the screen, and can therefore be linearly interpolated.
When we get a value for ??/?? using interpolation, we must multiply it by the local depth value ?? (itself interpolated too) in order to get the perspective correct texture coordinate ???. The calculation is exactly the same on the vertical axis.
The above formulas are easy to understand. The numerator linearly interpolates ??/?? = ??/?? between the values defined at vertices 1 and 2. The numerator can also be interpreted as a weighted sum, and the denominator computes the sum of the weights in order to properly normalize the result. Dividing by 1/?? is equivalent to a multiplication by ??, therefore we get the corrected texture coordinates as a result.
In the special case when the depth is the same, ??1 = ??2 . The above formulas simplify to a linear interpolation of ?? and ?? respectively, which is the affine case since ?? = ?? , ?? = ?? and ?? = ?? = 1 for every vertex. Using this special case, Wolfenstein 3D manages to create true perspective correct texture mapped walls without having to project wall vertices into camera axes.
Hardware Implementation
Early graphics accelerators focused on operations repeated for each pixel, while leaving vertex-level operations to the main computer. Rasterizing a transformed triangle involves several distinct cases, but the GPU primarily uses Bresenham's line algorithm to descend along two edges, line by line on the screen. On the 3dfx Voodoo, all setup code had to be executed by the CPU. Nevertheless, hardware triangle setup quickly emerged on other brands, such as the 3DLabs Permedia 2-based GPU in 1997.
Triangle Setup
Hardware frequently necessitates the conversion of polygons into triangles through a process called triangulation or tessellation, depending on when it is executed. Early GPUs depended on the CPU to perform these calculations, as well as the perspective projection into screen space, which required two divisions per vertex.
Triangles on screen often needed to be divided into two wedges: an upward-pointing wedge combined with a downward-pointing wedge, sharing one common edge. These subdivisions standardized the data so that it could be processed by fixed-logic hardware operating in screen space. The coordinates associated with the newly added vertices were linearly interpolated in screen space.
When a GPU accelerates setup process, it analyses the provided screen space coordinates to determine:
It then initializes several internal counters to linearly interpolate coordinates along the edges using logic akin to Bresenham's line algorithm. The coordinates include not only the horizontal position and the depth (for Z-buffer processing), but as seen above, also ?????, ?????, 1??? and R, G, B color coordinates which were linearly interpolated in screen space.
Wedge Rasterization
This stage generates and processes fragments, constrained by the fill rate and often forming the performance bottleneck in older GPUs. The outer loop linearly interpolates all relevant parameters along the edges, leading to the subsequent step of linearly interpolating parameters along a segment of a display scan line covered by the wedge. The horizontal position is no longer interpolated but simply incremented one pixel at a time from the start to the end location on screen. Simultaneously, depth, color, and texture coordinates are interpolated using a variant of Bresenham's line algorithm.
Due to limited color information, early GPUs could not achieve photorealistic rendering. On the 3dfx Voodoo, the texture color could be used independently or multiplied by the interpolated color to enable basic lighting effects, such as using the same texture in well-lit and darker rooms or creating shadow gradients.
The depth buffer and framebuffers shared the same memory area. The 64-bit wide interface could read or write two pixels concurrently (16-bit color and 16-bit depth), but updating a pixel typically required one read and one write operation. With a 50MHz processing rate for two pixels at a time, this GPU had a 50MPixels/s fill rate. The Voodoo2 increased the frequency to 90MHz and achieved a 90MPixels/s fill rate. Notably, the Voodoo bottlenecked in processing power, texture fetches, and framebuffer memory simultaneously. The Voodoo2's 100MHz memory provided a modest memory bandwidth margin over its 90MHz ASICs.
The GeForce 256 dethroned 3dfx's supremacy in 3D accelerators, and NVIDIA acquired the company in 2000.
All 3D accelerators can perform a depth test using fragment data and pixel data from the framebuffer. This test compares the fragment's depth to the pixel's depth to determine visibility. Given the limited processing power of early 3D accelerators, accurate preselection of visible assets and surfaces was essential, with the depth test serving as a supplementary tool.
The Voodoo implemented chroma-keying as an alternative to a true alpha channel. A specific fragment color indicated transparency, enabling the use of sprites as texture maps, for example. When the texture unit returned that color with chroma-keying enabled, the fragment was discarded, and the GPU proceeded to the next pixel.
When the depth test and chroma key test were passed, the fragment was written back to the framebuffer. Without an alpha channel, the Voodoo could either replace the current framebuffer pixel or add to it (accumulator). In both cases, the depth buffer was updated with the new fragment's depth.
The accumulator mode was beneficial for multi-pass rendering and used in games. Multi-pass rendering could also employ minor camera movements to enable effects like depth-of-field, anti-aliasing, and motion blur.
Bilinear Filtering
In the 3dfx Voodoo architecture, the texture mapping unit (TMU) was a separate chip. The onboard memory was divided between framebuffer and texture memory. These memory areas were entirely distinct, and during operation, the TMU leveraged a dedicated 64-bit channel to fetch four texels simultaneously from texture memory. The TMU received linearly interpolated values and error terms from the other chip, executed two divisions per pixel to compute the final texture coordinates, and then retrieved the four nearest texels to the actual projection point within texture space. The fractional part of the texture coordinates was converted into weights for a weighted average of the four texel colors. This operation, which seamlessly blends the colors of four adjacent texels, is known as bilinear filtering.
It is important to note that bilinear filtering performs optimally when the projection of a screen pixel onto the texture is approximately the size of one texel and not excessively distorted by perspective. When an object is at a significant distance, the pixel corresponds to a larger area of the texture, making it virtually impossible to read and average all relevant information in real-time to render a single pixel. Consequently, bilinear filtering must be combined with mipmaps to achieve the best results.
When implemented in hardware, bilinear filtering can be executed efficiently using a concise pipeline and parallel processing units. In contrast, earlier games that relied on software-based texture operations merely selected the color of the nearest texel, resulting in a blocky image and increased texture aliasing. The nearest and bilinear filtering modes are standard feature in OpenGL.
Conclusion
The evolution of graphics hardware, from early rasterization and texture mapping techniques to modern programmable shaders and ray-tracing capabilities, has significantly impacted the way we interact with and perceive digital content. As computing power has increased, so too has the complexity and realism of graphics rendering.
The advancements in GPU technology have not only revolutionized the graphics and gaming industries but have also opened up new possibilities for artificial intelligence and machine learning applications. With the advent of general-purpose GPU (GPGPU) computing, GPUs have become a crucial component in training and deploying AI models, as they provide massive parallelism, high computational throughput, and power efficiency.
Additionally, the advent of AI has had a profound impact on graphics processing and rendering. Techniques such as NVIDIA's Deep Learning Super Sampling (DLSS) use AI to upscale lower resolution images to higher resolutions, providing improved image quality with reduced computational resources. While real-time global illumination itself is not inherently AI-driven, AI can be used to enhance or optimize global illumination techniques by approximating complex lighting interactions and accelerating the process of generating realistic lighting in real-time. This fusion of AI and graphics technology has further pushed the boundaries of what is possible in terms of visual quality and performance, opening up new possibilities in fields such as gaming, simulation, and immersive experiences.
The symbiotic relationship between graphics hardware and AI will continue to drive innovations in both domains, pushing the boundaries of what is possible in digital experiences and opening up new opportunities for research and applications across various fields.
Marketing, Relations Clients
2 年Merci beaucoup pour cet article complet, c'est très intéressant. Quel sera le prochain thème ?