Development of the VPU

The heart of neural networks

Jon Peddie

With the advent of AI, image recognition, and neural networks, there has been the emergence of a new category of processor, or functionality broadly known as a visual processing unit, or VPU. The name belies the real importance of the function and so we prefer to refer to it in the awkward term of VPU and neural network accelerator, which results in the mind-numbing acronym of VPNNA, or VPNA. But what makes a processor a vision processor? In many regards, “You'll know one when you see one” covers the situation but there are some things that can be easily specified.

The visual processing and neural network accelerator — VPNA


Included in the definition are IP blocks such as a GPU (think Qualcomm or Nvidia) or a wide-SIMD DSP (like Ceva or Tensilica), or a programmable camera pipeline (Apical or Intel). The term also includes full SoCs that combine one or more of the characteristics of the above IP blocks along with other functions such as Movidius and Inuitive which both add hardware acceleration of specific functions to their DSP-based blocks. Lastly, novel architectures such as Wave Computing which target a wider range of functions but are also well suited to acceleration of vision tasks are included.

The first thing that unites all these things is that, while they are programmable, they are not intended to be the main CPU in a system: they are primarily intended as accelerators and design decisions have been taken to optimize them for that role. Thus, even though mainstream architectures like x86, ARM and MIPS are fully capable of executing vision programs (and in some cases do it very effectively), they are not included in the definition because they are optimized to be the main CPU, not an accelerator.  

Second, they are all massively parallel and depend for their performance on the fact that visual data comes in 2D arrays (at least) and that vision functions in general exhibit massive data parallelism. Their hardware architectures have been optimized to take advantage of this and they devote significant silicon area to it, for example, in the form of extreme multi-threading hardware or specific data handling and memory access optimization hardware.

Third, the programmable elements have a vision-specific hardware architecture. The most obvious examples of this are the inclusion of high-efficiency integer hardware and the co-issue of smaller data types so that targeting 16-bit INT doubles throughput over 32-bit, for instance, but there are other optimizations as well, such as specific methods for handling non-linear functions.

Last, the unit must be usable as a vision processor, so it must have an SDK that specifically targets that. It can be based on OpenCV, or OpenVX, or can take input directly from Tensorflow, or whatever the vendor sees as most appropriate, but the tool chain must be in place to enable vision processing. A key feature of such an SDK is that it can expose complex intrinsic functions that access hardware accelerators.

That definition covers a lot of hardware architectures which is a reminder that we are very much in the same sort of situation graphics that was in during the pre-OpenGL, pre-DirectX days when APIs were proprietary and hardware architectures proliferated. That situation may settle down over the next few years as a smaller set of APIs becomes dominant but for now, that's where we are.

GPUs play a critical role in the implementation of a VPNA and although a GPU is not a VPNA, a VPNA can’t exist without some GPU functionality.

The vision processing unit is in some ways similar to a video processing unit which is used with convolutional neural networks. Where a video processing unit is a specific type of graphics processing, the vision processing unit is described as more suitable for running different types of machine vision algorithms – these tools may be built with specific resources for getting visual data from cameras—they are built for parallel processing. Like video processing units, they are particularly geared toward image processing. Some of these tools are described as “low power and high performance” and may be plugged into interfaces that allow for programmable use. Other aspects of the build can vary due to manufacturer and design choices.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

Vision processing units are distinct from video processing units (which are specialized for video encoding and decoding) in their suitability for running machine vision algorithms such as CNN (convolutional neural networks), SIFT (scale-invariant feature transform), etc.

They are distinct from GPUs, which contain specialized hardware for rasterization and texture mapping (for 3D graphics), and whose memory architecture is optimized for manipulating bitmap images in off-chip memory (reading textures, and modifying frame buffers, with random access patterns). GPUs also contain video codecs which are often referred to a video-processing unit or VPU.

Target markets are robotics, the internet of things, new classes of digital cameras for virtual reality and augmented reality, smart cameras, and integrating machine vision acceleration into smartphones and other mobile devices.

A few companies market such a function or device such as:

Movidius Myriad X, which is the third-generation vision processing unit in the Myriad VPU line from Intel Corporation.

CEVA's NeuPro AI processor which consists of the NeuPro Engine and the NeuPro VPU.

Developers of SoCs which include GPUs, and other engines such as ISPs and DSPs have to be aware of the implications, needs, and demands of a VPNA so as to not be caught unprepared for the expansion of AI inferences being run on the SoC.