GPU not used — CUDA error 500

Symptom

The box has an NVIDIA GPU, but inference runs on CPU. The inference container log shows:

CUDA failure 500: named symbol not found
... resolved to CPU only

In the dashboard, the Python Inference card shows CPU (fallback) instead of TensorRT or CUDA. With strict GPU enforcement enabled the container instead exits and keeps restarting.

Confirm

The decisive test is NVIDIA’s own compute sample. Run it on the host:

docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0

Healthy host → prints Test PASSED.
Broken host → Failed to allocate device vector A (error code named symbol not found)!

If the GPUs are listed by nvidia-smi -L but this sample fails, the GPU is unusable for compute in containers. The problem is the host GPU stack, not the Xisom image.

You can also confirm the fallback in the running container’s log:

docker compose -f docker-compose.release.yml logs inference | grep -E "CUDA failure 500|resolved to CPU"

Fix

This is a host-level GPU-compute problem on the platform driver layer, so the fix is on the host — not in the Xisom stack.

Update the NVIDIA GPU driver on the host to a current build with full CUDA compute support. On a Windows + WSL2 host, also run wsl --update then wsl --shutdown afterwards.

Updating Docker / the container platform alone does not fix this. The compute failure lives in the GPU driver layer below the container runtime. A platform update can even reset GPU passthrough settings and make things worse.

Re-wire the container runtime to the refreshed driver:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Re-run the compute sample until it prints Test PASSED:

docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0

Restart the inference container and confirm it picked up the GPU. No config change is needed — the default execution mode auto-selects the best available provider:

docker compose -f docker-compose.release.yml restart inference
docker compose -f docker-compose.release.yml logs inference | grep "EP enabled"
# expect: "TensorRT EP enabled" or "CUDA EP enabled"

While the host is still broken

Run cleanly on CPU instead of fighting the GPU probe. Set EXECUTION_MODE=cpu (or FORCE_CPU=1) on the inference service and restart. The box keeps predicting on CPU until the driver is fixed.

Do not enable strict GPU enforcement (STRICT_EP=1) with a GPU mode on a broken host — it will correctly refuse to start rather than silently run on CPU, which is the opposite of what you want during the workaround.

Prevent

Make a silent CPU fallback loud where GPU is mandatory. On boxes that must run on GPU, pin the provider and enable strict enforcement (EXECUTION_MODE=cuda + STRICT_EP=1) so a missing GPU library fails the container at startup instead of quietly degrading to CPU.
Treat the compute sample as the acceptance gate after any GPU-driver or host update — nvidia-smi passing is not sufficient evidence that inference will use the GPU.
First run is slow, not stuck. On first GPU start the engine is compiled and cached (2–3 minutes). Subsequent restarts are fast. Don’t mistake the long startup window for the failure above.

Execution provider fell back to CPU — when the host GPU is fine but the provider still drops a tier.
Hardware Setup — supported GPU tiers.
Observability & Alerts — where the active provider is shown.