I was making a dashboard in Grafana for our service when I came across the term 99th Percentile response time.


Say, if the 99th Percentile response time is 100ms, then out of 100 requests, 1 request took the max response time of 100 ms.

How to think?

You log all the requests and their response time. Sort them in the ascending order of times they took to complete. For 95th percentile, take the (mean/max) of the bottom 5% of the times they took to complete. This value is your 95th percentile response time.

Why is this better?

Mean and median are not a very good metric to show outliers. Say at 9 AM in my service, a peak traffic comes which delays the response time. For that much time, the Nth percentile response time will shoot up a lot as compared to mean and median. 99th percentile is 30 times worse than the median. Also, in the distributed system, when we make an upstream request, it can create a lot of downstream requests. And if those downstream request hit a service with bad 99th percentile delays, then it’s pretty bad.

So this is a better metric to note down. We are seeing the worst case for bottom X percent. Seeing the outliers well which mean and median are not capable of measuring.

