# Running on CPU when GPU expected

> Inference degrades to a lower execution provider (TensorRT → CUDA → CPU) though the GPU works.

## Symptom

The host GPU works (the CUDA compute sample passes), but inference is slower than expected and the **Python Inference** card shows a lower tier than you provisioned:

- You expect TensorRT but see CUDA, or
- You expect a GPU provider but see CPU (fallback) or CPU.

The inference container log often contains a benign-looking line such as:

```
Failed to load library libonnxruntime_providers_tensorrt.so
```

This is the provider chain degrading one tier at a time: **TensorRT → CUDA → CPU**.

This page is for when the host GPU is healthy but the provider still drops a tier.
If the container logs `CUDA failure 500`, the GPU itself is broken — see
[GPU not used — CUDA error 500](/troubleshooting/gpu-cuda-error-500/) first.

## Confirm

Check which provider the running service actually initialised:

```bash
docker compose -f docker-compose.release.yml logs inference | grep -E "Active EP|EP enabled|resolved to"
```

- `TensorRT EP enabled` → top tier, nothing to do.
- `CUDA EP enabled` while you expected TensorRT → the TensorRT provider was dropped (missing native parser library).
- `resolved to CPU only` while the GPU works → both GPU providers were dropped.

The default execution mode is `auto`, which is *designed* to degrade gracefully rather than fail. That safety net is also what hides a missing GPU library — a degraded box keeps running, just slower.

## Fix

1. Decide whether the box **must** run on GPU. If GPU is mandatory, make the fallback loud so the gap is visible instead of silent — pin the provider and enable strict enforcement on the inference service:

   ```
   EXECUTION_MODE=tensorrt   # or: cuda
   STRICT_EP=1
   ```

   Restart inference. The container now refuses to start (instead of degrading) if the requested provider is unavailable — turning a silent slowdown into an obvious startup failure you can act on.

2. If the requested tier is **TensorRT** and it was dropped, the TensorRT provider's native parser library is missing from the image. Use a GPU-enabled inference image profile (TensorRT) for this box rather than the CPU profile. Confirm the image tier matches your hardware tier before redeploying.

3. Restart and re-confirm the active provider:

   ```bash
   docker compose -f docker-compose.release.yml restart inference
   docker compose -f docker-compose.release.yml logs inference | grep "EP enabled"
   ```

## Prevent

- **Match the image profile to the hardware.** The inference image ships in CPU, TensorRT, and Jetson profiles. Deploying the CPU profile on a GPU box caps you at CPU by construction — there is no GPU provider in that image to fall back *from*.
- **Pin + strict on GPU-mandatory boxes.** `EXECUTION_MODE=auto` is the right default for mixed fleets, but on a box that must use the GPU, `EXECUTION_MODE=<tier>` + `STRICT_EP=1` converts an invisible degradation into a fail-fast startup error.
- **Watch the provider chip after every deploy.** The dashboard's Python Inference card reflects the live provider; treat a tier drop there as a deployment defect, not normal variance.

## Related

- [GPU not used — CUDA error 500](/troubleshooting/gpu-cuda-error-500/) — when the host GPU compute layer is actually broken.
- [Hardware Setup](/install-deploy/hardware-setup/) — which GPU tier maps to which image profile.
- [Observability & Alerts](/operate/monitoring/) — reading the active provider and latency.