Introduction
In this article, I will be discussing Intel Architecture and Intel Devices like CPU, GPU, FPGA, etc.
Key Terms
TDP (Thermal Design Power)
The maximum heat emitted by a processor to dissipate the coding mechanism under any operating load.
AI Accelerators
Specifically designed to satisfy AI specifications and to speed up AI and ML processors. For instance. NCS-2, VPU, FPGA, among others.
Compatibility
Any modern Intel Processor Generation is a subset of its predecessors that provides backward support for earlier chips and software and new improved features.
Hyperthreading
Any physical core is separated into two virtual cores. For Example, An OS-recognized CPU with 4 physical cores can contain 8 virtual cores. Virtual cores allow multiple execution threads to run concurrently on one physical core. The 8-core physical CPU can do better than a 4-core CPU that allows hyperthreading.
Clock Speed
The rate of completion of a processor is a certain number of cycles, usually given as a hertz which is a cycle a second. For instance, 3 GHz is 3 billion cycles per second.
Instruction Set Extensions
It adds the key features for optimizing the operations' performance. For Example, VNNI means Vector Neural Network Instructions.
Image Accelerator/ Hardware Accelerator
The computer offers functions such as H.264 and Motion JPEG encoder and decoder, and a warping engine for fish eye lense management, dense optical flow, and stereo depth perception.
Application-Specific Integrated Circuits (ASIC)
They are chips that are cabled to be optimally effective for a particular requirement during production. In machines with unique features that will not alter with time, ASICs are used. For instance, a circuit designed specifically for the backup camera in a car range.
HDDL/ VAD (High-Density Deep Learning / Vision Accelerator Design)
Devices with multiple Myriad X chips (either 4 or 8), which appear to the system as if you plugged in multiple USB NCS-2s.
Intel Devices
CPU (Central Processing Unit)
The real computer subset doing the basic function of the OS, memory reading and writing, or communicating with other computers. Used mostly in the form of a microprocessor or core.
Criteria for choosing CPU
- Cost
If you pick a CPU for a device, other components (for example power supply) will take far more time than the CPU.
- Performance
For a time that is more than the life of a traditional business processor, most conveyors are created.
- Power Requirement
To pick a CPU that is optimally adaptable for your device is not necessary to accept requirements only.
- Ambient Temperature
When selecting the Processor, an extended temperature range is important.
- Lifespan
GPU (Graphical Processing Unit)
Used for video presentations or images modification. Can also be used to run OpenVINO templates or inferences.
Basic building blocks
- Execution units (EU) are multi-threading processors streamlined. Each EU will concurrently run up to 7 threads.
- One slice is a 24-EU set. Slice performs programming functions such as executing OpenVINO inferences.
- The majority of the 1 GPU is unslice. The key features of Video Restoration are supported by unslice.
- Image encoding and decoding are done by the MFX or video box.
- VQE or Video Enhancement Box
Some facts
- A high number of EU means high performance
- GT-2 is Elementary Level GPU, it has 1 slice
- GT-3 is Medium Level GPU, it has 2 slice
- GT-3e is GT-3 with an extra RAM
- GT-4e is a High-Level GPU with an extra RAM, it has 3 slice
- Slice and Unslice run at different speeds.
IGPU (Integrated GPU)
It is a GPU located next to the CPU cores on a processor and shares memory therewith. IGPU will give you an increase in efficiency when you manage data in large block sizes. That's not really the case.
Key characteristics of IGPU
- Configurable Power Consumption
You can monitor the clock rate separately for the slice and slice. In order to lower power consumption, unused parts can then be shut down in the GPU.
- OpenCL Startup Time
OpenCL is able to compile code for whatever hardware we use in the loading of a sample using a just-in-time compiler. In contrast with the OpenVINO program operating on just a CPU, this will lead to substantially longer device load times for an IGPU.
- Model Precision and Speed
The EU instruction set and hardware for 16-bit floating-point data types are designed for embedded GPUs. It enhances inference speed since with 32-bit operations we can process twice more 16-bit operations per clock cycle than we can.
- Shared Components
CPUs and IGPUs share the same memory of the device, higher caches and memory controllers on the same die. This decreases memory loss when passing data between two computers.
VPU (Visual Processing Unit)
A specially optimized processor to operate the kinds of calculations mostly performed on CNN. Examples of VPUs are Myriad X and Google TPU.
Characteristics of VPU
- Interface Unit
The VPU component interfaces with the host computer. The CPU or some other computing unit may be the host unit. On the host computer, we will train ML models and run the VPU inference. There are various kinds of VPUs available for connecting (e.g. USB 3.1 and Ethernet GigaByte), allowing several types of pre-existing networks to be connected to a VPU.
- Imaging accelerators
These are particular kernels used for the rendering of images. These operations vary from individual image denotation techniques to edge detection algorithms.
- Neural Compute Engine
It is a dedicated hardware accelerator that is designed for low-power neural networks without loss of precision.
- Vector Processors
Processors that operate on a vector or a 1D array. They can be managed by scalar processors that often deal with individual data objects. The vector processor in a VPU disrupts and executes a difficult instruction in parallel.
- On-chip CPU
VPU has an on-chip CPU specialized. Myriad X VPU has two on-chip VPUs, one for host interfaces and one for chip coordination among the Neural computation device, the vector processor, and the accelerators for image processing.
Myriad X
It provides 4 output teraflops, of which 1 teraflop is supplied to NCE i.e. Engine Neural Compute. It has a 16-bit C-programmable 128-bit VLIM (Very Long Instruction Word) Vector processor and also shaves (Streaming Hybrid Architecture Vector Engine) processor.
The NCE supports the following features:
- Multi-channel convolution with matrix to matrix multiplication and accumulate
- Maximum and average pooling
- Fully connected layers with vector to matrix multiply and accumulate
- Various post-processing layers
NCS-2 (Neural Compute Stick)
A particular VPU (the Multitude X) with a memory of 4 GB and a USB 3 form factor. This is the lowest accelerator cost and lowest efficiency kind.
- It supports 4 inference requests per stick.
- The processor used is Myriad X VPU.
- It uses OpenVINO Toolkit as a software development kit.
- It supports FP16 precision
- It has a USB 3.1 plug and play interface. It can also be used with USB 2, but processing will be slower due to I/O throttling.
- Adding multiple NCS2 will allow multiple interfaces to run in parallel.
FPGA (Feild Programmable Gate Array)
These are the most powerful accelerators in general. But the lowest number of layers still appear to be supported. Therefore, FPGAs sometimes run "Hetro" to indicate that the CPU normally manages some inappropriate layers on a feedback system.
FPGAs have been configured to include the greatest flexibility to reprogram them after development and deployment as and when necessary in the field.
- FPGA need to be reprogrammed at every power-up
- Good for prototyping and low-volume production
FPGA Architecture
We have files at the bottom of the FPGA architecture. A file is also known as an ALM or Adaptive Logic Module, as a small subset of any FPGA-component available. The register transfer language (RTL) that then generated BitStream can be used for configuring FPGA at the Register Transfer Level (RTL).
A file contains:
- I/O (Example: DSPB or Digital Signal Processing Block)
- CB (Connection Block)
- SB (Switch Block)
- CLB (Configurable Logic Block)
For each CLB, there are 4 CB and there are 2 SB in each CB.
CLB
They are at the heart of the FPGA and are generally there by thousands per FPGA. Each block can use the search tables to execute its own functions. AND, OR, NOT, for example, etc. The Logic Block can involve flipflops, transistors, and multiplexer pairs.
Programmable Interconnects
They are made up of CB and SB, the input and output of the CLB.
Programmable I/O blocks
They connect the file to an external circuit for I/O.
FPGA can be Programmed using:
- OpenVINO plugin for HDDL-F card
- DSP Builder
- High-Level Synthesis
- Foundational software tools
FPGA Specifications
- High Performance, low latency
Programmed once with an acceptable bitstream. FPGA will operate a high-performance and very latency neural network. The high performance is dependent on the ability to run several parallel parts of the FPGA.
- Flexibility
They are flexible in different ways as given follows
- They can be designed to conform to modern, changing, and custom network. They are field-programmable.
- Several precision options are provided (FP16, 11, and 19 bit) to provide a balance between speed and precision for developers.
- Without modifying the hardware, the bitstream used can be changed. This helps you to boost your device performance without the FPGA replacement.
- Large Network
FPGA makes them very useful for deep learning since they have the ability to navigate networks of more than two million parameters and can accommodate massive networks.
- Robust
The FPGAs have been conceived with a 100% on-time efficiency, i.e. they can work 24 x 7 and 365 days a year, and they also have a 0 ° C to 80 ° C temperature range.
- Long Lifespan
The lifetime is long. Intel's FPGA has assured 10 years of preparation from product launch.
Conclusion
In this article, we studied and discussed various intel architectures and devices.