[olug] Request for feedback from users of KVM-based virtualization in RHEL Server 6 (Luke-Jr)

Sat Jan 8 19:18:15 UTC 2011

Hi,

Thanks for your interest in my OLUG post last Monday.  Here's some background to help you understand my rationale for the use of KVM-based virtualization via a host OS (not a bare-metal hypervisor) instead of OpenVZ.

Next-generation DNA sequencing (NGS) datasets are very large.  The processing of these datasets typically occurs in two stages.  First, the relatively short DNA fragment sequences produced by an NGS instrument are processed using very computationally intensive procedures to recreate (where possible—and to note where one cannot) the much longer contiguous DNA sequence of the individual DNA chromosome macromolecules in the genome of the cells from which the DNA sample was isolated.  Second, domain-specific end-users (scientists or physician-scientists in a particular field such as cancer, diabetes, or infectious disease research) carry out statistical tests and use various data visualization tools in order to interpret the biological or medical significance of their NGS results. 

My responsibilities include supporting NGS users across the entire UNMC campus.  If you go to a popular web site devoted to NGS platforms and technologies, you will notice that they have a list of over 300 different software tools that are relevant to this area of research (http://seqanswers.com/wiki/Software/list).  This list is not exhaustive.  Also, there are new tools published in the scientific literature all the time.  In addition, there are several very popular commercial software tools for NGS data management and analysis, and we will be supporting some of those as well.

So we have a tremendous variety of general and domain-specific end-user-focused tools to support, and not all run under Linux.  The Linux-based tools include some legacy software that may only work under a specific kernel version, or that have dependencies e.g., supporting libraries, that have specific version requirements.  Sometimes these dependencies may conflict with each other.  There is also the simple but useful advantage of thematic organization of tools into (desktop) VM appliances that have been customized to address the needs of a particular type of domain-specific end-user, say, someone studying bacterial metagenomics vs. another investigator studying cancer.  The portfolio of useful tools for the different subject areas are distinct from one another.  And once these thematically organized VM appliances have been created (e.g., as VM templates) they can be readily cloned on the virtualization hosts running in the data center.  Researchers are at the same time both highly collaborative and very competitive, and different labs invariably prefer to have their own VMs for their specific research group.

So why not use a bare-metal hypervisor?  There are several reasons, but the most important one is that we want the option to use the host OS for the most demanding "primary analysis" jobs, e.g., DNA sequence alignment against a human reference genome sequence, or de novo genome assembly.

Best wishes,

Robert

On Friday, January 07, 2011 Luke Jr. wrote:

Honestly, I don't understand what you need virtualization for at all, so maybe I'm missing something obvious-- but why not OpenVZ? It has basically no overhead at all, at the expense of having all your VEs using a single kernel version.

On Monday, January 03, 2011 Robert J Boissy wrote:

> We perform some computations on some very large distributed memory computer
> clusters at the Holland Computing Center (HCC) at UNL/UNO and on the
> TeraGrid. However, very large SMP systems are also useful for many of the
> compute tasks that we need to perform. By HPC standards we don't have a
> large budget for compute and storage infrastructure, and we don't want to
> duplicate resources that are already available to us through the HCC and
> TeraGrid. What we have settled on is a flexible, low-maintenance,
> on-premises "private cloud" type approach starting off with just two very
> large SMP systems each with its own direct-attached storage JBOD array. We
> can't afford a SAN or NAS storage. These are research computing systems,
> and will not run software related to UNMC's administrative or teaching
> missions.
>
> We are dealing with very large DNA sequence datasets (multi-100's of GB to
> multi-10's of TB), some of which contain DNA sequence data from tissue
> derived from human subjects, and the movement of these datasets to public
> cloud computing service providers is thus not really practical
> logistically and due to regulatory (HIPAA) compliance issues.
>
> KVM-based virtualization in RHEL Server 6 really appeals to us because the
> VMs we would like to deploy would themselves need to be substantial
> systems capable of the many different NGS-related compute tasks we will
> need to be able to perform. We would strongly prefer to not use a
> bare-metal hypervisor, as for some tasks we do want to use a physical
> (on-the-metal) host OS. Granted, this is not your typical enterprise
> virtualization scenario. Nevertheless, I would be very grateful if I might
> be able to hear some feedback from users of KVM-based virtualization in
> RHEL Server 6. For example, how does it compare with VMWare VSphere/ESXi?
> How well does it handle virtualized desktops that might need to deliver
> graphics-intensive applications? What are the best 10 GbE adapters to use
> to support virtualization under RHEL Server 6? What are some good
> open-source or reasonably priced commercial management tools for
> small-scale on-premises private clouds?