콘텐츠로 이동

Monitoring

이 콘텐츠는 아직 번역되지 않았습니다.

The Inference Stats dashboard shows how your runtime is performing in real time: latency, throughput, error rate, and where time is spent in the pipeline. It updates continuously while a datasource is streaming.

  • Latency percentiles — p50 / p95 / p99 of end-to-end inference time.
  • Throughput — predictions per second.
  • Error rate — share of inferences that failed.
  • Execution provider — whether inference is running on TensorRT, CUDA, or CPU. A GPU box that falls back to CPU shows up here.

Each inference is decomposed so you can see where time goes:

  • Queue wait — time spent waiting in the request queue
  • Pre-process — input normalization
  • Model exec — pure inference time
  • Post-process — output decoding

If latency rises, the breakdown tells you whether the model itself slowed down or the box is saturated upstream.

Two controls shape the time view:

  • Window — how far back you look (for example 5m, 1h, 24h).
  • Bucket — how wide each point on the chart is (for example 10s, 1m).

The number of points is window ÷ bucket. The dashboard auto-snaps the bucket when you change the window so charts stay readable — wider windows use wider buckets.

If the chart says “no data”, the most common reasons are:

  • No datasource is streaming yet — enable and Start a source.
  • The install is still warming up — wait for the first window of samples.
  • Empty charts, unexpected CPU fallback, or rising error rate — see the Troubleshooting runbook.