In 2013, even more key CPU features will make their way into mainstream Cups- At Translating, we believe that this convergence will intention to the point where typical systems have only one type of processing unit, with large numbers of cores and very wide vector execution units available for high performance parallel execution. In this kind of environment, all graphics processing will ultimately take place in software. Through our Swifter’s software JPG toolkit, Translating has long been a world leader in software based rendering, with widespread adoption of our technology by some of the Copyright @ 2013 Translating Inc.
All rights reserved. World’s top technology companies, and a US patent on key techniques issued in late 2012. This whitepapers explores the past, present, and future of software rendering and why Translating expects that the technology behind Swifter’s will be a critical component of future graphics systems. Swifter’s Today In 2005, Translating launched Swifter’s, for the first time providing a software-only implementation of a commonly used graphics API ([email protected] [email protected]), including shaded support, at performance levels fast enough for realties interactive use.
Since then, Swifter’s has found an important niche in the graphics market as a fallback solution – ensuring that even in cases where available hardware or graphics drivers are inadequate, out of date, or unstable, our customers’ software will still run. Swifter’s: Why the Future of AD Graphics is in Software This fallback case is a critical one for software that needs to run no matter what system an end user has in place. Translating has licensed Swifter’s to companies such as Adobe for use with [email protected] as a fallback for the [email protected] API, and to Google to implement the [email protected] API within [email protected] and Native [email protected]
Beyond this, Swifter’s has found customers in markets as diverse as available or not reliable is that it is capable of achieving performance that approaches that Of dedicated hardware. With a 2010-era quad-core CAP], Swifter’s scores 620 points in the popular demark Directs 9 benchmark; this is higher than the scores for many previous generation integrated Spins. Figure 1 : swifter’s running demark medical imaging and the defense industry.
All of these customers require a solution that will put the right pixels on the screen 100% of the time. Another important area where Swifter’s is being used today is in cloud computing and formalization systems. Servers in data enters that include JPG capabilities are currently substantially more expensive than normal senders. Using Swifter’s and software rendering thus allows substantial savings and flexibility for developers with server-oriented applications that require some degree of graphics capability.
A key part of the reason that Swifter’s is useful as a fallback option in situations where a hardware JPG is not Copyright C 2013 Translating Inc. All rights reserved. Software Rendering Future Advantages While today’s software rendering results are good enough for some applications, current generation integrated JPL’s still have a absentia performance advantage. Why then does Translating believe that software rendering will have a more important role in the future, beyond a reliable fallback? The answers are straightforward.
As Cups continue to increase their parallel processing performance, they become adequate for a wider range of graphics applications, thus saving the cost of additional unnecessary hardware. Hardware manufacturers can then focus resources into 2 optimizing and improving hardware with a single architecture, and thus avoid the costs of melding separate CPU and JPG architectures in a system. As graphics drivers and Apish et more complex and diverse, the issues of driver correctness and stability become ever more important.
In today’s world, software developers must test their applications on an almost infinite variety of different Gaps, drivers, and SO revisions. With a pure software approach, these problems all go away. There are no feature variations to worry about, other than performance, and developers can always ship applications with a fully stable graphics library, knowing that it will work as expected no matter what. Software rendering thus saves time and money for all participants in the platform and ecosystem ring development, testing and maintenance.
Beyond cost savings, software rendering has numerous additional advantages. For example, graphics algorithms that today use a combination of CPU and JPG processing must split the workload in a suboptimal way, and developers must deal with the complexity of handling different bottlenecks to ensure that each pipeline remains balanced. Software rendering also simplifies optimization and debugging by using a single architecture, allowing the use of well established CPU-side profilers and debuggers.
A simpler, uniform memory model also liberates developers from avian to deal with multiple memory pools and inconsistent data access characteristics, creating additional freedom for developers to explore new graphics algorithms. Most importantly however, software rendering allows for unlimited new capabilities to be used at any time. New graphics API releases can always be compatible with existing hardware, and developers can add new functionality at any layer of their graphics stack. The only limits become those of the developer’s imagination. All of this however can only become true if software rendering can close the performance gap.
At Translating e believe that this is very achievable, and that upcoming hardware advances will prove this out. To understand why requires a deeper dive into the technical side of Swifter’s. Swifter’s: The State of the Art in Software Rendering This section highlights some of the key technologies that differentiate Swifter’s from other renders, and illustrates how the challenges posed by software rendering can be overcome. One of the seemingly major advantages of dedicated AD graphics hardware is that it can switch between different operations at no significant cost.
This is particularly relevant to real-time AD rapist because all of the graphics pipeline stages depend on a certain ‘state’ that determines which calculations are performed. For instance, take the alpha blending stage used to combine pixels from a new graphics operation with previously drawn pixels. This stage can use several different blending functions, each of which takes various input arguments. Handling this kind of work is a challenge for traditional software approaches that use conditional statements to perform different operations.
The resulting CPU code ends up containing more test and branch instructions than arithmetic instructions, resulting in slower performance compared to code that has been specialized for just a single combination of states, or hardware with separate logic for each blending function. A naive software solution that includes pre-built specialized routines for every combination of blending states is not feasible, because combinatorial explosion would result in excessive binary code size. The practical solution that Swifter’s uses for this type of problem is to compile only the routines for state combinations that are needed at run-time.
In other words, Swifter’s waits for the application to issue a drawing nomad and then compiles specialized routines which perform just those operations required by the states that are active at the time the drawing command is issued. The generated routines are then cached to avoid redundant recompilation. The end result is that Swifter’s can support all the graphics operations supported by traditional Spins, with no render-state dependent branching code in the processing routines. This elimination of branching code also comes with secondary benefits such as much improved register reuse.
This technique of dynamic code generation with specialization, has proven to e invaluable in making software rendering a viable choice for many applications today, and naturally extends to the run-time compilation of current types of programmable shade’s. Most importantly, it opens up huge opportunities for future techniques. In addition to using dynamic code generation, Swifter’s also achieves some of its performance through the use of CHIP SIMD instructions. Swifter’s pioneered the implementation of Shaded Model 3. Software rendering by using the SIMD instructions to process multiple elements such as pixels and vertices in parallel. By contrast, the classic 3 ay of using these instructions is to execute vector operations for only a single element. For example, other software renders might implement a 3- component dot product using the following sequence of Intel ex. SEES instructions: mulls axiom, XML novels XML, axiom adapt spiffs XML, axiom, 1 adds Note that this sequence requires five instructions to compute a single 3-component dot product – a common operation in AD lighting calculations.
This is no faster than using scalar instructions, and thus many legacy software renders did not obtain an appreciable speedup from the use of vector instructions. Swifter’s instead uses them to compute multiple dot products in parallel: mulls axiom, XML mulls XML, XML mulls XML, XML shaded-like language, which integrates directly into C++. This layer, which we call Reactor, outputs an intermediate representation that can then be optimized and translated into binary code using a full compiler back-end.
We chose to use the well-known ALVA framework due to its excellent support of SIMD instructions and straightforward use for run-time code generation. The combination of Reactor and ALVA forms a versatile tool for all dynamic code generation needs, exploiting he power of SIMD instructions while abstracting the complexities. A simple example of how Reactor is used in the implementation of the cross product shaded instruction illustrates this well: void Shadier::cars( Vector , Vector , Vector ) dust. X- – ISRC. Y* ISRC . Z – dust. Y ISRC. ;k ISRC . X adapt axiom, XML ISRC. X * ISRC . Z; dust. Z= *ISRC. Y- adapt axiom, XML 1 The number of instructions is the same, but this sequence computes four dot products at once. Each vector register component contains a scalar variable (which itself can be a logical vector component) from one of four pixels or vertices. Although this is straightforward for an operation like a dot product, the challenge that Translating has solved with Swifter’s is to efficiently transform all data into and out of this format, while still supporting branch operations.
Earlier versions of Swifter’s made use of our in-house developed dynamic code generator, Swiftest, which used a direct ex. assembly representation of the code to be generated. This offered excellent low-level control, but at the cost of the burden of dealing with different sets of SIMD extensions, and of determining code dependencies based on complex state interactions. We’ve since taken things to the next level by abstracting the operations into a high-level ISRC. Z * ISRC y; ISRC. Y * ISRC . X; This looks exactly like the calculation to perform a cross product in C++, but the magic is in the use of Reactors C++ template system.
The Reactor Vector data type is defined with overloaded arithmetic operators that generate the required instructions for SIMD processing in the output code. In addition to eliminating branches and making effective use of SIMD instructions, Swifter’s also achieves substantial speedups through the use of multi-core processing. While this ay at first seem obvious and relatively straightforward, it poses both challenges and opportunities. Graphics workloads can be split into concurrently executable tasks in many different ways. One can choose between sort-first or sort-last rendering, or a hybrid approach.
Each task can also be divided into more tasks through data parallelism, function parallelism, and/or instruction parallelism. Subdividing tasks and scheduling them onto cores/threads comes with some overhead, and they are typically fixed 4 Swifter’s: Why the Future of AD Graphics is in Software processes once a specific approach is chosen. Translating has identified opportunities to minimize the overhead and in some cases even exceed the theoretical speedup of using multiple cores, by combining dynamic code generation with the choice of subdivision/scheduling policy.
Information about the processing routines, obtained during run-time code generation, can be used during task subdivision and scheduling, while information about the subdivision/scheduling can be used during the run-time code generation. We believe this advantage to be unique to software rendering, because only CHIP cores are versatile enough to do yeoman code generation, intelligent task subdivision/scheduling, and high throughput data processing. Further information about these techniques can be found in Trainings patent filing, Patent #8,284,206: General purpose software parallel task engine.
While the patent was granted in late 201 2, the original provisional patent was filed in early 2006, well before other modern software rendering efforts such as Intel’s Learnable became public. Convergence and Trends The previous sections of this whitepapers show some of the substantial advantages of software rendering, and demonstrate that the technology to use the full computing rower of a modern CAP] as efficiently as possible is already here. In order to fully understand the coming convergence between CPU and JPG however, we must also consider the evolution of the JPG side of the equation.
Firstly, we must understand what makes modern Spins exceptionally fast parallel computation engines, and what the limits of growth on the approaches used to provide that speed may be. Modern Spins have two critical features that enable the majority of their performance: they provide a large number of heavily pipelined parallel computation units, and they drive many execution threads wrought these units simultaneously. This allows them to hide the long latencies that frequently occur when executing operations such as texture fetching.
While one thread is waiting for a texture fetch result, another thread occupies the computation units. Context switches are therefore designed to be very efficient on a JPG. Keeping many threads active simultaneously requires a large number of registers to be available. The more registers a given instruction sequence requires, the fewer threads can be run simultaneously. Copyright 2013 Translating Inc. All rights reserved. The lowest organizational level of computation on a JPG is known as a Warp’ on INVALID Spins, and a ‘waveforms on MAD Spins. This is similar to the SIMD width in a CPU vector unit.
Current generation JPG hardware typically uses 1024-bit or 512-bit wide SIMD units, compared to the bit wide SIMD units used by current generation Spins. The wide SIMD approach also has some important limitations. One is that control statements within a given instruction Sequence cause divergence, which requires evaluating multiple code paths. With a wider SIMD width, this divergence becomes more common, eliminating some of the execution parallelism. Another limiting factor for graphics recessing is that pixels are processed in rectangular tiles, so rendering triangles regularly results in leaving some lanes unused.
Another limitation is the number of registers available. Larger register files lower computational density, so JPG manufacturers must balance that against stalls caused by running out of storage for covering RAM access latencies By contrast, Cups are optimized for low-latency operation. On a CPU core, a significant amount of logic is devoted to scheduling logic that allows many functional units to be used simultaneously through out-of-order execution. Branch reduction units and shorter SIMD widths reduce the penalties for branch- heavy code, and more die space is devoted to caches and memory- management functionality.
Cups typically support run inning at significantly higher clock frequencies as well. Cups are now evolving to support increased Para Leslies at the SIMD width level as well as with additional execution units available to simultaneous threads, and larger numbers of CPU cores per die. Intel’s Haskell chips, available later this year, will include three 256-bit wide SIMD units per core, two Of which are capable Of a fused multiply-add operation. This arrangement will process up to 32 floatation operations per cycle: with four cores on a mid-range version of this architecture, this provides about 450 raw GOOPS at 3. GHz. Intel’s AVIS instruction set offers room to increase the SIMD width size to 1 024 bits, which would put the raw CPU GOOPS at similar levels to the highest end Spins currently available. At the same time, Spec’s are becoming more and more like Cups, adding more advanced memory management features such as virtual memory and the corresponding MUM 5 complexity that is required. JPG instruction scheduling is becoming more employ as well, with out-of-order features such as register scoreboard, and IL p extraction features such as supercargo execution.
Furthermore, GPO- vendor sponsored research suggests that running fewer threads simultaneously might lead to better performance in many classes . Die-level Integration and Bandwidth One of the trends that displays the clearest indications of convergence between Cups and Spins is the increasing frequency of die-level integration of current-generation differentiated CPU and JPG units. This trend has become more and more important with the rise of mobile devices, which quire both graphics and CPU performance in a single lopper chip.