/ Insights / Diagnosing Disk Performance Issues Insights Diagnosing Disk Performance Issues September 5, 2019 ConcurrencyDisk performance issues can be hard to track down but can also cause a wide variety of issues. The disk performance counter available in Windows are numerous, and being able to select the right counters for a given situation is a great troubleshooting skill. Here, we’ll review two basic scenarios – measuring overall disk performance and determining if the disks are a bottleneck.Measuring Disk PerformanceWhen it comes to disk performance, there are two important considerations: IOPS and byte throughput. IOPS is the raw number of disk operations that are performed per second. Byte throughput is the effective bandwidth the disk is achieving, usually expressed in MB/s. These numbers are closely related – a disk with more IOPS can provide better throughput.These can be measured in perfmon with the following counters:Disk Transfers/secTotal number of IOPS. This should be about equal to Disk Reads/sec + Disk Writes/secDisk Reads/secDisk read operations per second (IOPS which are read operations)Disk Writes/secDisk write operations per second (IOPS which are write operations)Disk Bytes/secTotal disk throughput per second. This should be about equal to Disk Read Bytes/sec + Disk Write Bytes/secDisk Read Bytes/secDisk read throughput per secondDisk Write Bytes/secDisk write throughput per secondThese performance counters are available in both the LogicalDisk and PhysicalDisk categories. In a standard setup, with a 1:1 disk-partition mapping, these would provide the same results. However, if you have a more advanced setup with storage pools, spanned disks, or multiple partitions on a single disk, you would need to choose the correct category for the part of the stack you are measuring.Here are the results on a test VM. In this test, diskspd was used to simulate an average mixed read/write workload. The results show the following:3,610 IOPS2,872 read IOPS737 write IOPS17.1 MB/s total throughput11.2 MB/s read throughput5.9 MB/s write throughputIn this case, we’re seeing a decent number of IOPS with fairly low throughput. The expected results vary greatly depending on the underlying storage and the type of workload that is running. In any case, you can use these counters to get an idea of how a disk is performing during real world usage.Disk BottlenecksDetermining if storage is a performance bottleneck relies on a different set of counters than the above. Instead of looking at IOPS and throughput, latency and queue lengths needs to be checked. Latency is the amount of time it takes to get a piece of requested data back from the disk and is measured in milliseconds (ms). Queue length refers to the number of outstanding IO requests that are in the queue to be sent to the disk. This is measured as an absolute number of requests.The specific perfmon counters are:Avg. Disk sec/TransferThe average number of seconds it takes to get a response from the disk. This is the total latency.Avg. Disk sec/ReadThe average number of seconds it takes to get a response from the disk for read operations. This is read latency.Avg. Disk sec/WriteThe average number of seconds it takes to get a response from the disk for read operations. This is write latency.Current Disk Queue LengthThe current number of IO requests in the queue waiting to be sent to the storage system.Avg. Disk Read Queue LengthThe average number of read IO requests in the queue waiting to be sent to the storage system. The average is taken over the perfmon sample interval (default of 1 second)Avg. Disk Write Queue LengthThe average number of read IO requests in the queue waiting to be sent to the storage system. The average is taken over the perfmon sample interval (default of 1 second)Here are the results on a test VM. In this test, diskspd was used to simulate an IO-intensive read/write workload. Here is what the test shows:Total disk latency: 42 ms (0.042 seconds is equal to 42 milliseconds)Read latency: 5 msWrite latency: 80 msTotal disk queue: 48Read queue: 2.7Write queue: 45These results show that the disk is clearly a bottleneck and underperforming for the workload. Both the write latency and write queue are very high. If this were a real environment, we would be digging deeper into the storage to see where the issue is. It could be that there’s a problem on the storage side (like a bad drive or a misconfiguration), or that the storage is simply too slow to handle the workload.Generally speaking, the performance tests can be interpreted with the following:Disk latency should be below 15 ms. Disk latency above 25 ms can cause noticeable performance issues. Latency above 50 ms is indicative of extremely underperforming storage.Disk queues should be no greater twice than the number of physical disks serving the drive. For example, if the underlying storage is a 6 disk RAID 5 array, the total disk queue should be 12 or less. For storage that isn’t mapped directly to an array (such as in a private cloud or in Azure), queues should be below 10 or so. Queue length isn’t directly indicative of performance issues but can help lead to that conclusion.These are general rules and may not apply in every scenario. However, if you see the counters exceeding the thresholds above, it warrants a deeper investigation.General Troubleshooting ProcessIf a disk performance issue is suspected to be causing a larger problem, we generally start off by running the second set of counters above. This will determine if the storage is actually a bottleneck, or if the problem is being caused by something else. If the counters indicate that the disk is underperforming, we would then run the first set of counters to see how many IOPS and how much throughput we are getting. From there, we would determine if the storage is under-spec’ed or if there is a problem on the storage side. In an on-premise environment, that would be done by working with the storage team. In Azure, we would review the disk configuration to see if we’re getting the advertised performance.