We have decided to bring as much as we can in house and only put the workloads that have strict contractual uptime agreements on our VMware or HCI stack. The rest of the stuff goes on KVM or bare metal to save costs.
This is similar to the recommendations I give my customers, but its never this easy.
Entire teams are trained on managing VMware. Years of VMware compatible tools are in place and configured to support those workloads. Making a decision to change the underlying hypervisor is easy. Implementing that change is very difficult. An example of this is a customer that was all-in on VMware and using VMware’s Saltstack to orchestrate OS patching. Now workloads they move off of VMware have to have an entirely new patching orchestration tool chosen, licensed, deployed, staff trained, and operationalized. You’ve also now doubled your patching burden because you have to patch first the VMs remaining in VMware using the legacy patching method, then patch the non-VMware workloads with the second solution. Multiply this by all toolsets for monitoring, alerting, backup, etc and the switching costs skyrocket.
Broadcom knows all of this. They are counting on customers willing to choose to bleed from the wrist under Broadcom rather than bleed from the throat by switching.
There’s a cost to keeping an agnostic solution that maintains that portability. It means forgoing many of the features that make cloud attractive. If your enterprise is small enough it is certainly doable, but if you ever need to scale the cracks start to show.
If you’re treating Azure VMs as simply a replacement for on-prem VMs (running in VMware or KVM), then I can see where that might cause reliability issues. Best results means a different approach to running in the cloud. Cattle, not pets, etc. If you were using Azure VMs and have two VMs in different Availability Zones with your application architecture supporting the redundancy and failover cleanly, you can have a much more reliable experience. If you can evolve your application to run in k8s (AKS in the Azure world) then even more reliability can be had in cloud. However, if instead you’re putting a single VM in a single AZ for a business critical application, then yes, that is not a recipe for a good time Nonprod? Sure do it all the time, who cares. You can get away with that for awhile with prod workloads, but some events will mean downtime that is avoidable with other more cloud native approaches.
I did the on-prem philosophy for about 18 years before bolting on the cloud philosophy to my knowledge. There are pros and cons to both. Anyone that tells you that one is always the best irrespective of the circumstances and business requirements should be treated as unreliable.