Bỏ qua để đến nội dung

Monitoring

Nội dung này hiện chưa có sẵn bằng ngôn ngữ của bạn.

The Inference Stats dashboard shows how your runtime is performing in real time: latency, throughput, error rate, and where time is spent in the pipeline. It updates continuously while a datasource is streaming.

  • Latency percentiles — p50 / p95 / p99 of end-to-end inference time.
  • Throughput — predictions per second.
  • Error rate — share of inferences that failed.
  • Execution provider — whether inference is running on TensorRT, CUDA, or CPU. A GPU box that falls back to CPU shows up here.

Each inference is decomposed so you can see where time goes:

  • Queue wait — time spent waiting in the request queue
  • Pre-process — input normalization
  • Model exec — pure inference time
  • Post-process — output decoding

If latency rises, the breakdown tells you whether the model itself slowed down or the box is saturated upstream.

Two controls shape the time view:

  • Window — how far back you look (for example 5m, 1h, 24h).
  • Bucket — how wide each point on the chart is (for example 10s, 1m).

The number of points is window ÷ bucket. The dashboard auto-snaps the bucket when you change the window so charts stay readable — wider windows use wider buckets.

If the chart says “no data”, the most common reasons are:

  • No datasource is streaming yet — enable and Start a source.
  • The install is still warming up — wait for the first window of samples.
  • Empty charts, unexpected CPU fallback, or rising error rate — see the Troubleshooting runbook.