Private Cloud Project – Ch01 – The Mission

This is the first post in a series that describes the private cloud project at the company I work with, from the idea to the release, with as much technical content as possible.

I work as Windows system administrator in a team of 6-8 colleagues, and together we run what we call the Office IT Windows server infrastructure for our company, which consists of some 600 servers, including most of Microsoft’s products (AD, File, Print, DHCP, Exchange, Sharepoint, MSSQL etc.) and a lot of third party applications running on Windows servers. Due to the fact that the products the company we work with sells are IT based, we are merely one of many teams that are engaged in running IT systems.

The traditional end-to-end process of supplying our internal customers, who we call “Business Partners”, with the server machines they need, operating systems, the networking and other infrastructure, was until now a very stony road of opening tickets and orders in various systems, waiting for replies, and coordinating several involved teams. To make it short, the process could – and would – often take more than a handful of weeks to complete. Most Business Partners never explicitly expressed their discontent wiht this situation, but we were aware that we needed a solution for this nuisance.

To circumvent the week long logistics hassle, the way to go would have been our own environment, with our own datacenter, purchasing and networking teams, and setting up things ourselves. Not possible, though, due to the way our business runs.

In the end, virtualization was our only choice: setting up hardware would take place only in seldom cases, and deploying server systems as quick as possible after the business partner’s request, if possible from a self service or ordering web interface, i.e. from weeks to hours.

We would also make use of the other advantages a virtualized infrastructure could bring: quicker and easier patching, no more need to keep records of hardware life cycles, raising the density and usage of hardware, lowering energy and rack space consumption immensely.

So, out I was sent into the intahwebz to search for solutions. As usual, price would play a role, but what was even more vital to a possible decision was manageability. Our team runs Windows systems and this is what we would have liked to stick with. In came the infamous known contenders: vmWare, Citrix, Xen etc al. What we didn’t like about most of them: their need for an enterprise storage partner. EMC, HP EVA, HITACHI, NetApp, whatever. In the end, we came to a solution that was rather new to the market, and still is: Hyper-V cluster nodes, using SMB3 shares provided by Windows Servers in a cluster, which make use of Windows Storage Spaces.

And here it is, the mission: Create a private cloud environment, consisting of the full MS stack, from top to bottom, migrate all vmWare (yes, we have that, too) VMs and most of the hardware based servers to it. (Cream on top: do something similar for VDI also, not part of this blog post series.)

The voyage we set out on and the challenges we encountered are what’s in the next posts:
Private Cloud Project – Ch02 – The Design Concept(unreleased)
Private Cloud Project – Ch03 – Management Cluster (unreleased)
Private Cloud Project – Ch04 – File Server Cluster (SOFS) (unreleased)
Private Cloud Project – Ch05 – Hyper-V cluster (unreleased)
Private Cloud Project – Ch06 – VM migration (V2V) (unreleased)
Private Cloud Project – Ch07 – Hardware migration (P2V)

Here are some links I used for getting some basic knowledge and/or ideas:

virtualizationmatrix
Hyper-V server blog Rachfahl IT solutions (de)
Thomas Maurer’s blog
Aidan Finn’s blog
Altaro’s HyperV blog
Keith Mayer’s blog (MS)
MS decision aid

Stay tuned, feedback is very welcome.
Cheers,
Peter.

Using dynamic VHDx for IO intensive VMs may not be a good idea

For the hasty reader:
Using dynamic VHDx for IO intensive workload will generate high CPU usage in the management OS, but only on the first three cores. Using Sysinternals Process Explorer we found out that there are exactly three threads in the “System” process (ID 4) called “vmbusr.sys” that are the root of the CPU usage. We researched RSS, VMQ and other things. Basically, the huge load went away when we changed from dynamic to fixed VHDx.

The longer story:
During the testing phase of our Private Cloud environment we also did IO tests using SQLIO in up to 30VMs on the hypervisor machines. The hypervisors talk SMB3 to the storage environment. We ran test with 8k IOs and 40-80k IOs. We always noticed that, as soon as the VMs started doing the heavy SQLIO based IO, in the management OS the cores 0 to 2 were under full fire:
20140626-01
Looking at the process tree in Process Explorer we found that the System process (which has ID 4) was showing this load in the Threads tab:
20140626-02

That didn’t really help much, as you will find very little information on the web about the Hyper-V VMBus system. One is the architectural description at MSDN. It says:

VMBus – Channel-based communication mechanism used for inter-partition communication and device enumeration on systems with multiple active virtualized partitions. The VMBus is installed with Hyper-V Integration Services.

Another one is Kristian Nese’s blog post, for Hyper-V in W2K8, but the basics should still be true.

Not very enlightening (pun intended) four our case; why should the VMs doing the SQLIO workload talk to each other….? Maybe device enumeration…? We tried to assess the problem from various angles, playing with the size of the SQLIO blocks, tuning SMB network interface paramters, VMQ settings (although these VMs weren’t doing guest network traffic). In a calm minute my colleague Christoph tried doing the SQLIO with a bunch of VMs that were slightly different than the others. Tada: normal CPU load distribution among all cores! The difference was easily found: the VHDs were of fixed size. We will yet have to find out if there’s a limit in the number of VMs running on a host for not showing the strange behaviour.

The bad news: this happens already with only 5 VMs. We have not done a full comparison test, but the high VMBus load also seems to be introducing a limit to the IO a VM with dynamic VHDx can do.

Any helpful comments, hints or tricks are highly welcome.

Cheers,
Peter.

Edit 2014-06-30: Post from a guy with a similar issue