Understanding and interpreting CPU Steal Time for your VPS

Virtual machines (VMs) report on different types of usage metrics, such as server load, memory usage, and steal time. And people often ask about steal time – what is it, and why is it reported on their virtual machines? In this article we will explain steal time to better understand what it means for your virtual machine. 

What is CPU Steal Time?

The definition of CPU Steal Time is “The percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.”

How does a Virtual Machine work with regards to CPU time?

In a cloud environment, the hypervisor is the interface between the physical server and the virtualized environment. The hypervisor kernel manages all tasks by scheduling the processes to the physical cores of the server. Processes such as virtual machines, networking operations and storage I/O requests are given CPU time to process jobs. CPU time is allocated between these processes.

Your virtual machine (VM) shares resources with other instances on a single host in a virtualized environment. One of the resources it shares is CPU Cycles. It is often not the case of not receiving the CPU time you should be getting, in fact in many cases you can often soak up spare CPU cycles beyond your allocated size. You won’t see any CPU steal time in those cases however.

How can I see CPU Steal Time?

In a modern hosting environment, a small amount of steal time is unavoidable, especially with shared cloud hosting. However, the steal time virtual machines experience is not always visible from outside the virtualized operating system.

You can monitor processes and resource usage using the top command on your Linux server. The output shows a whole host of metrics and steal time is labeled as ‘st’.

steal2

Steal time can also be monitored with the iostat, vmstat, sar commands.

On virtual machines steal time is visible along with idle time. Idle time shows there is CPU time allocated by the hypervisor, but the virtual machine is not using that time. In such cases it is clear any steal time does not affect the performance.

When the idle time percentage is 0 and the value of steal time shows a higher level over a longer period of time, it is safe to assume processes on the virtual machine are processed with some delay.

What is causing CPU Steal Time?

Since CPU steal time in the cloud is a bit more complex than in a traditional dedicated physical environment. Since reporting tools in the operating systems have not been adjusted for use in a VPS, on a shared environment or on a virtual machine, reports of CPU steal time can be a false positive. When you see CPU steal time it does usually mean the processes do run into some sort of resource constraint. The three most common situations are described in more detail below.

  1. You are using a smaller virtual core size
    The options for configuring a VPS are practically unlimited. You can select multiple cores and CPU% to fit your needs. Having more CPU threads of more virtual cores can be an advantage depending on your requirements. When you create a VM with a CPU setup of 4 × 2.4 Ghz @ 60%), the 60% is the upper limit to which you can use a CPU core. It is not dedicated to your VPS only. However, the diagnostics on the cloud server within the operating system will see the core size as the full physical size. Standard commands like top will report metrics based on the wrong assumptions. In this scenario you will always see steal time if you are requesting more than your allowed use of 60% of your CPU. You can counteract this issue by upgrading your VM’s CPU%.  Or if it is incidental, you can leave it all as-is.
  2. Your cloud server is overloaded due to processes on your side
    In this scenario processes on your VPS are bringing it close to (or even over) the maximum capacity. The allocated CPU cycles on your virtual server cannot handle  the workload. You will see CPU steal time while processes are waiting to be handled by the hypervisor and are being queued to the virtual CPU. If this occurs it should normally be a temporary overload of the system and no action is required. The CPU steal time should dissappear after a few seconds or minutes when your load goes down. If you see a direct correlation between load heavy processes on your system and the CPU steal time over a longer period of time, you need to have a larger VM with more CPU resources. This can easily be changed in your my.tilaa. And when your high load processes are done, you can just as easy scale down to your initial configuration.
  3. The physical server is overloaded and multiple virtual machines are competing for resources.
    In this scenario multiple VPSs on the same host are running load heavy processes and the physical CPU has trouble processing all of the requests in a timely fashion. This is quite exceptional since we keep our hosts well below their maximum utilization levels. We also actively monitor the load on our systems, so if we see these kinds of metrics, we can migrate virtual machines to other physical nodes and bring load levels down to normal. If you notice high CPU steal time over a longer period of time and your own processes are not causing it, there may be a problem on our side. In this case you should definitely contact us. We will check the physical host and dive into your log files and determine what might be going on. If there is no overload on our systems, it is probably not the root cause of CPU steel time and we must investigate other possible causes.

    time_02time_01sec

Should I be worried about CPU Steal Time?

If steal time is greater than 10% (above the normal value) for around 20 minutes, the VM is likely running slower than it should.

If you see a constant steal time on your virtual machine, do you notice any loss in performance on your applications? If this is the case, try to find out if your application is causing this. Keep us informed if you suspect the root cause is outside of your environment. In those situations, we can generally solve the problems quickly by moving your virtual machine to a different hypervisor. We would normally see high loads on our systems long before you do, and we would automatically adjust our systems to handle such peaks.

Possible solutions for CPU Steal Time

If the problem is some process eating away on resources at your end, you can try increasing the CPU resources % of your VPS. This will be a temporary fix however if the root cause is some slow and inefficient code within your application.

If you know you will be running heavy processes on your VPS for a few hours, you can increase the CPU resources temporarily via the configurator in my.tilaa. And when you’re done, you can revert back to your original settings.

However, if all the hosts within the POD your VPS is on are full (and we monitor those metrics on our end, so 9 times out of 10 we have received alerts from our monitoring systems, warning us about extremely high activity on the hypervisors, and we have already taken care of the issues at hand) we can migrate your server to a different POD.

You can always contact our support and open a ticket if you find suspect metrics for steal time or if you are experiencing performance problems with your VPS through my.tilaa and we’ll investigate the issue.

Credit: Source link