Towards democratized integrated circuit design and personalized computing

Integrated circuit (IC) design is often considered a “dark art”, reserved only for those with advanced degrees or years of training in electrical engineering. As the semiconductor industry struggles to expand its workforce, IC design needs to be made more accessible.

The advantage of personalized computing
General-purpose computers are widely used, but the improvement in their performance has slowed considerably, from more than 50% per year in the 1990s to only a few percent in recent years, due to problems adapting the computer. power, heat dissipation, space and cost.

Instead, the research community and industry have turned to custom computing for better performance by tailoring custom architectures to the workload in certain application domains. A good example is the Tensor processing unit (TPU) announced by Google in 2017 to accelerate deep learning workloads. Designed in 28nm CMOS technology as an application-specific integrated circuit (ASIC), the TPU has demonstrated an almost 200x power efficiency advantage in performance/watts over the CPU of processing (CPU) Haswell general purpose, a leading server class processor at the time of publication. These custom accelerators, or domain-specific accelerators (DSAs), achieve efficiency through custom data types and operations, custom memory accesses, massive parallelism, and dramatically reduced instruction and control overhead.

However, this customization has a high cost (approaching $300 million at 7nm, according to McKinsey) that the masses cannot afford. Field-programmable gate arrays (FPGAs) offer an attractive and cost-effective alternative for DSA implementation. Given its programmable logic, programmable interconnects, and customizable building blocks (block of random access memory (BRAM) and digital signal processing (DSP)), an FPGA can be customized to implement a DSA without going through a lengthy manufacturing process and can be reconfigured for a new DSA in seconds. Additionally, FPGAs have become available in public clouds, such as Amazon AWS F1 and Nimbix. One can create a DSA on the FPGA in these clouds and use it at a rate of $1-2/hour to accelerate the desired applications, even if the FPGAs are not available in the local computing facility. My research lab has developed efficient FPGA-based accelerators for multiple applications, such as data compression, sorting, and genomic sequencing, with 10 to 100 times performance/power efficiency gains over advanced processors.

The Obstacle to Personalized Computing
However, creating DSAs in ASICs or FPGAs is considered a hardware design, typically using register transfer level (RTL) hardware description languages ​​such as Verilog or VHDL, which most software programmers do not use. are unfamiliar. According to the 2020 US Bureau of Labor Statistics Datathere were over 1.8 million software developers in the United States, but less than 70,000 hardware engineers.

Recent advances in high-level synthesis (HLS) hold promise for making circuit design more accessible, as it can automatically compile computational kernels in C, C++, or OpenCL into an RTL description to make ASIC or FPGA designs. The quality of the circuits generated by existing HLS tools is highly dependent on the structure of the input C/C++ code and the hardware implementation guidance (called “pragmas”) provided by the designers. For example, for the simple 7-line code of the one-layer convolutional neural network (CNN) widely used in deep learning, shown in Figure 1, the existing commercial HLS tool generates an FPGA-based accelerator 108 times slower than a simple heart. CPUs. However, after proper restructuring of the input C code (to tile the computation, for example) and inserting 28 pragmas, the final FPGA accelerator is 89 times faster than a single-core processor (more than 9,000 times faster than the initial unoptimized HLS solution).

Fig. 1: Simple 7-line code of a one-layer convolutional neural network (CNN) widely used in deep learning

These pragmas (hardware design guidance) tell the HLS tool where to parallelize and pipeline the computation, how to partition data arrays to map them to on-chip memory blocks, and so on. However, most software programmers do not know how to perform these hardware-specific operations. optimizations.

Our solutions
To enable more people to design DSAs based on software programmer-friendly code, we take a three-pronged approach:

• Architecture-driven optimization
• Automated code transformation and pragma insertion
• Support for high-level domain-specific languages ​​(DSLs)

A good example of architecture-guided optimization is the automated generation of systolic (SA) networks, an efficient architecture that uses only local communication between adjacent processing elements. It is used by TPU and many other deep learning accelerators, but it is not easy to design. A 2017 Intel Study showed that 4 to 18 months are needed to design a high-quality SA, even with HLS tools. Our recent work Auto SA, provides a fully automated solution. Once a programmer has marked a section of C or C++ code to implement in the SA architecture, AutoSA can generate an array of processing elements and an associated data communication network, thereby maximizing computational throughput. For the CNN example, AutoSA generates an optimized SA with over 9,600 lines of C code, including pragmas, achieving over 200x speedup on a single-core processor.

For programs that do not adapt easily to common computational models (such as SA or stencil computation, for which we have good solutions using architecture-guided optimization), our second approach is to perform a Automated code transformation and pragma insertion to repeatedly parallelize or channel computation based on bottleneck analysis or guided by graph-based deep learning. Based on AMD/Xilinx’s open-source Merlin compiler (originally developed by Falcon Computing Solutions), our tool – named AutoEHR – can eliminate most, if not all, of the pragmas inserted by expert hardware designers and achieve comparable or even better performance (as demonstrated on Xilinx’s Vitis HLS library for vision acceleration).

The third effort is to further raise the level of design abstraction to support DSLs so that software developers in certain application domains can easily create DSAs. For example, based on open-source HeteroCL intermediate representation, we can support Halide, a widely used image processing DSL with the advantageous property of decoupling algorithm specification from performance optimization. For the example blur filter written in 8 lines of Halide code, our tool can generate 1,455 lines of optimized HLS C code with 439 lines of pragmas, achieving 3.9x speedup on a 28-core CPU.

These efforts combine to create a user-friendly programming environment and compilation flow for software programmers, allowing them to create DSAs efficiently and cost-effectively (especially on FPGAs). This is essential to democratize personalized computing.

Broaden participation
In their 2018 Turing Award talk, “A new golden age for computer architectureconcluded John L. Hennessy and David A. Patterson, “The next decade will see a Cambrian explosion of new computer architectures, which means an exciting time for computer architects in academia and industry. We would like to extend participation in this exciting journey to performance-oriented software programmers who can build their own custom architectures and accelerators on FPGAs, or even ASICs, to achieve significant improvements in performance and power efficiency.

This article is based on Jason Cong’s recent Vision Address at 35th International Conference on VLSI Designs. The full conference can be found here.

Jason Cong

(All posts)

Jingsheng Jason Cong is a Fellow of the IEEE and a Fellow of the US National Academy of Engineering and Volgenau Chair for Engineering Excellence in the Department of Computer Science at the University of California, Los Angeles. He is the recipient of the 2022 IEEE Robert N. Noyce Medal “for fundamental contributions to electronic design automation and FPGA design methods.” Cong’s contributions include three key areas of EDA tool development: logic synthesis algorithms for FPGAs, interconnect optimization algorithms for physical design, and high-level synthesis from user-friendly programming languages ​​for software programmers.

Abdul J. Gaspar