Running on CPU when GPU expected

Symptom

The host GPU works (the CUDA compute sample passes), but inference is slower than expected and the Python Inference card shows a lower tier than you provisioned:

You expect TensorRT but see CUDA, or
You expect a GPU provider but see CPU (fallback) or CPU.

The inference container log often contains a benign-looking line such as:

Failed to load library libonnxruntime_providers_tensorrt.so

This is the provider chain degrading one tier at a time: TensorRT → CUDA → CPU.

Confirm

Check which provider the running service actually initialised:

docker compose -f docker-compose.release.yml logs inference | grep -E "Active EP|EP enabled|resolved to"

TensorRT EP enabled → top tier, nothing to do.
CUDA EP enabled while you expected TensorRT → the TensorRT provider was dropped (missing native parser library).
resolved to CPU only while the GPU works → both GPU providers were dropped.

The default execution mode is auto, which is designed to degrade gracefully rather than fail. That safety net is also what hides a missing GPU library — a degraded box keeps running, just slower.

Fix

Decide whether the box must run on GPU. If GPU is mandatory, make the fallback loud so the gap is visible instead of silent — pin the provider and enable strict enforcement on the inference service:
```
EXECUTION_MODE=tensorrt   # or: cuda
STRICT_EP=1
```
Restart inference. The container now refuses to start (instead of degrading) if the requested provider is unavailable — turning a silent slowdown into an obvious startup failure you can act on.
If the requested tier is TensorRT and it was dropped, the TensorRT provider’s native parser library is missing from the image. Use a GPU-enabled inference image profile (TensorRT) for this box rather than the CPU profile. Confirm the image tier matches your hardware tier before redeploying.

Restart and re-confirm the active provider:

docker compose -f docker-compose.release.yml restart inference
docker compose -f docker-compose.release.yml logs inference | grep "EP enabled"

Prevent

Match the image profile to the hardware. The inference image ships in CPU, TensorRT, and Jetson profiles. Deploying the CPU profile on a GPU box caps you at CPU by construction — there is no GPU provider in that image to fall back from.
Pin + strict on GPU-mandatory boxes. EXECUTION_MODE=auto is the right default for mixed fleets, but on a box that must use the GPU, EXECUTION_MODE=<tier> + STRICT_EP=1 converts an invisible degradation into a fail-fast startup error.
Watch the provider chip after every deploy. The dashboard’s Python Inference card reflects the live provider; treat a tier drop there as a deployment defect, not normal variance.

GPU not used — CUDA error 500 — when the host GPU compute layer is actually broken.
Hardware Setup — which GPU tier maps to which image profile.
Observability & Alerts — reading the active provider and latency.

Running on CPU when GPU expected

Symptom

Confirm

Fix

Prevent

Related