Running on CPU when GPU expected
Symptom
Section titled “Symptom”The host GPU works (the CUDA compute sample passes), but inference is slower than expected and the Python Inference card shows a lower tier than you provisioned:
- You expect TensorRT but see CUDA, or
- You expect a GPU provider but see CPU (fallback) or CPU.
The inference container log often contains a benign-looking line such as:
Failed to load library libonnxruntime_providers_tensorrt.soThis is the provider chain degrading one tier at a time: TensorRT → CUDA → CPU.
Confirm
Section titled “Confirm”Check which provider the running service actually initialised:
docker compose -f docker-compose.release.yml logs inference | grep -E "Active EP|EP enabled|resolved to"TensorRT EP enabled→ top tier, nothing to do.CUDA EP enabledwhile you expected TensorRT → the TensorRT provider was dropped (missing native parser library).resolved to CPU onlywhile the GPU works → both GPU providers were dropped.
The default execution mode is auto, which is designed to degrade gracefully rather than fail. That safety net is also what hides a missing GPU library — a degraded box keeps running, just slower.
-
Decide whether the box must run on GPU. If GPU is mandatory, make the fallback loud so the gap is visible instead of silent — pin the provider and enable strict enforcement on the inference service:
EXECUTION_MODE=tensorrt # or: cudaSTRICT_EP=1Restart inference. The container now refuses to start (instead of degrading) if the requested provider is unavailable — turning a silent slowdown into an obvious startup failure you can act on.
-
If the requested tier is TensorRT and it was dropped, the TensorRT provider’s native parser library is missing from the image. Use a GPU-enabled inference image profile (TensorRT) for this box rather than the CPU profile. Confirm the image tier matches your hardware tier before redeploying.
-
Restart and re-confirm the active provider:
Terminal window docker compose -f docker-compose.release.yml restart inferencedocker compose -f docker-compose.release.yml logs inference | grep "EP enabled"
Prevent
Section titled “Prevent”- Match the image profile to the hardware. The inference image ships in CPU, TensorRT, and Jetson profiles. Deploying the CPU profile on a GPU box caps you at CPU by construction — there is no GPU provider in that image to fall back from.
- Pin + strict on GPU-mandatory boxes.
EXECUTION_MODE=autois the right default for mixed fleets, but on a box that must use the GPU,EXECUTION_MODE=<tier>+STRICT_EP=1converts an invisible degradation into a fail-fast startup error. - Watch the provider chip after every deploy. The dashboard’s Python Inference card reflects the live provider; treat a tier drop there as a deployment defect, not normal variance.
Related
Section titled “Related”- GPU not used — CUDA error 500 — when the host GPU compute layer is actually broken.
- Hardware Setup — which GPU tier maps to which image profile.
- Observability & Alerts — reading the active provider and latency.