Playing with the VideoCore IV GPU on a Raspberry Pi Zero using VC4CL
Recently, I learned about VC4CL, an implementation of OpenCL on the VideoCore IV, the GPU on every Raspberry Pi (except the Pi 4, which uses the VideoCore VI). The press about it seemed to talk about how it’s been woefully underused in many projects, so I was naturally excited to use it myself. I was lately obsessed with making a fluid simulation toy, and I figured embedded GPGPU might be the answer.
I ended up picking the Raspberry Pi Zero for my project because it was small and cheap yet packing the same GPU, and I’m always attracted to running goliath things on David-like hardware.
To begin though, getting VC4CL on the Raspberry Pi Zero was a challenge to begin with–foreshadowing I didn’t notice at the time. I followed this short and neat guide, but I would wait hours just to see gcc getting killed at the linking stage every time. Some Googling revealed that this was an OOM (out-of-memory) kill, and the solution was a temporary swap space, according to StackOverflow. The below script makes sure to allocate that, so I guarantee that it works on a Raspberry Pi Zero.
sudo apt update sudo apt upgrade -y sudo apt install cmake git -y sudo apt install ocl-icd-opencl-dev ocl-icd-dev -y sudo apt install opencl-headers -y sudo apt install clinfo -y sudo apt install libraspberrypi-dev -y sudo apt install clang clang-format clang-tidy -y mkdir opencl cd opencl git clone https://github.com/doe300/VC4CLStdLib.git git clone https://github.com/doe300/VC4CL.git git clone https://github.com/doe300/VC4C.git dd if=/dev/zero of=./tempswap count=1K bs=1M mkswap ./tempswap sudo chown root:root ./tempswap sudo chmod 600 ./tempswap sudo swapon ./tempswap cd VC4CLStdLib mkdir build cd build cmake .. make sudo make install sudo ldconfig cd ../../VC4C mkdir build cd build cmake .. make sudo make install sudo ldconfig cd ../../VC4CL mkdir build cd build cmake .. make sudo make install sudo ldconfig cd ../.. sudo swapoff ./tempswap sudo rm ./tempswap
After a couple more hours of compiling, I could finally confirm OpenCL functionality with
sudo clinfo (sudo is necessary for all OpenCL applications on Raspberry Pi because the GPU is wired in with effectively privileged memory access, read the VC4CL repo for more information).
[email protected]:~ $ sudo clinfo Number of platforms 1 Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU Platform Vendor doe300 Platform Version OpenCL 1.2 VC4CL 0.4.9999 (2cf1d93) Platform Profile EMBEDDED_PROFILE Platform Extensions cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_khr_suggested_local_work_size cl_vc4cl_performance_counters Platform Extensions function suffix VC4CL Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU Number of devices 1 Device Name VideoCore IV GPU Device Vendor Broadcom Device Vendor ID 0x14e4 Device Version OpenCL 1.2 VC4CL 0.4.9999 (2cf1d93) Driver Version 0.4.9999 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Profile EMBEDDED_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 1 Max clock frequency 300MHz Core Temperature (Altera) 31 C Device Partition (core) Max number of sub-devices 0 Supported partition types None Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 12x12x12 Max work group size 12 Preferred work group size multiple 1 Preferred / native vector sizes char 16 / 16 short 16 / 16 int 16 / 16 long 0 / 0 half 0 / 0 (n/a) float 16 / 16 double 0 / 0 (n/a) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals No Infinity and NANs No Round to nearest No Round to zero Yes Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (n/a) Address bits 32, Little-Endian Global memory size 67108864 (64MiB) Error Correction support No Max memory allocation 67108864 (64MiB) Unified memory for Host and Device Yes Minimum alignment for any data type 64 bytes Alignment of base address 512 bits (64 bytes) Global Memory cache type Read/Write Global Memory cache size 32768 (32KiB) Global Memory cache line size 64 bytes Image support No Local memory type Global Local memory size 67108864 (64MiB) Max number of constant args 32 Max constant buffer size 67108864 (64MiB) Max size of kernel argument 256 Queue properties Out-of-order execution No Profiling Yes Prefer user sync for interop Yes Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels No IL version SPIR-V_1.5 SPIR_1.2 SPIR versions 1.2 printf() buffer size 0 Built-in kernels (n/a) Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_int16 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_khr_suggested_local_work_size cl_vc4cl_performance_counters NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) OpenCL for the Raspberry Pi VideoCore IV GPU clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [VC4CL] clCreateContext(NULL, ...) [default] Success [VC4CL] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU Device Name VideoCore IV GPU clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1) Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU Device Name VideoCore IV GPU clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU Device Name VideoCore IV GPU ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.2.12 ICD loader Profile OpenCL 2.2
Just to test my concept, I reused most of the code from ESP32-fluid-simulation. Although the simulation used in that project was rather crude, I just wanted a low baseline to quickly begin with. In fact, I couldn’t even get that to work. Sure, the numbers it yielded confirmed OpenCL worked, but it was strangely slow.
Over a weekend, I threw the kitchen sink at it: sacrificing accuracy, optimizing, and even overclocking. Still, to run 10 seconds of simulation at 30 FPS, the Raspberry Pi Zero took 16.982 seconds. Meanwhile, my Raspberry Pi 3–not overclocked and using the same GPU–ran the same code in 6.376 seconds. Considering that the Pi 3 was more than twice as fast, the CPU on the Zero was clearly too slow!
The entire simulation was running on the GPU, so why? Jacobi iteration. Each iteration was embarrassingly parallel by itself, but the next iteration was dependent on the previous. That meant each iteration meant a new kernel call in order to preserve the dependency, so I ended up needing to call hundreds of kernels per second. In fact, calling a kernel was quite expensive on the Zero’s weak CPU. The algorithm itself–as parallel as it was–just wasn’t parallel enough, and the Zero couldn’t handle OpenCL’s overhead as a result.
So, that meant that I should just pursue what I originally wanted on the Raspberry Pi 3; it’s got a CPU powerful enough to handle the kernel calls. However, I really want the neat and tiny form factor of the Zero. What I might do instead is cellular automaton fluid. A technique used in some 2D games, it’s not as mathematically rigorous as Eulerian fluid simulation, but it should require way less kernel calls. Anyway, I think it would be a fun exploration.
But I digress. I did succeed in using the Raspberry Pi Zero’s GPU, though it went in a way I completely didn’t expect. I think that VC4CL–and embedded GPGPU as a whole–can offer an unprecedented level of compute that enables projects that were unthinkable before. Eulerian fluid simulation on the Zero turned out to be too CPU-bound to prove it, but I’m determined to make a project that’s really, really parallel and demonstrate.