Representatives from Microsoft and the AICTE announce the deal. From left to right: Yogesh Kocher (Director, Business Development, Microsoft), Dr. Mantha (Chairman, AICTE), Sanket A (General Manager, Microsoft India), Kapil Sibal (Hon. Minister for Human Resources & Development, India), and Tarun Malik (Director, Product Marketing, Microsoft). Image: Courtesy of Microsoft
It’s strange how little the ground shook when behemoth Microsoft announced the world’s “largest cloud deployment.”
The customer: India’s body for technical schools, called the All India Council for Technical Education (AICTE). The ambitious plan calls for rolling out a set of cloud apps to 7.5 million students and faculty members — nearly the size of the population of New York City — at 10,000 institutes across the continent.
Microsoft says it will take 90 days to complete, but that seems like a ton of wishful thinking.
These students anticipate becoming users of Microsoft’s Live@edu, a set of apps that includes email and calendars with a 10GB inbox, 25GB of file storage, document sharing, instant messaging and video chat. Microsoft says Live@edu already has about 22 million users. This will not be the company’s first massive-scale cloud deployment. The Kentucky Department of Education has 700,000 people using the collaboration suite.
What’s missing from the storyline? How such a large installation will take place. What are the realities and challenges behind “the largest cloud deployment ever”?
On the backend, what Microsoft and its partners are doing to get the server-side up and running, there is little public information to grouse about.
There is some mention of “[deploying] a system across geographically distributed locations” but it’s not clear whether there is an on-premise component to this cloud solution or whether Microsoft is simply hosting it in multiple distributed data centers. Of course a lot of this info is proprietary, but we can make a few assumptions.
When I posed the question about the challenges inherent in such a large-scale deployment, Nick McElhinney of MacTech Solutions, chimed in: “The two largest issues are concurrent requests and cooling. In order to handle concurrent requests Microsoft will most likely need to use a two-tier server architecture that requires two servers for each request: one to process the request and one to serve the data,” says the network consultant. “The need for two separate servers generates more heat which will require a very advanced cooling architecture.”
Shriram Natarajan, senior director of the Cloud Technology Practice at software developer Persistent Systems, pointed out geography and access concerns and the issue of availability. “With the user population so widespread and presumably using a variety of access networks, it’s going to be a challenge to ensure uniformity of access,” Natarajan said.
“Key concerns are, ‘Where are my data centers? What arrangements do I have with the local carriers?’ Most of the time the users of a particular segment are going to use and create data on that segment itself so how will I partition my applications so that the relevant data is local?” Natarajan asks.
Andrew Phillips, VP of product management at XebiaLabs, experts at deploying apps to middleware environments like Microsoft’s .Net, elaborates on the local carrier issue: “With large-scale applications distributed across multiple data centers, the inventory of the ‘local’ servers is not sufficient: you may need to include cloud-hosted and external services in the overview too. You will also need an application view alongside a pure infrastructure ‘what’s running on this machine’ view to be able to blend out all the extraneous things running on your systems (monitoring, logging, tooling, etc.) and focus on the system architecture,” Phillips says.
What Would Facebook Do?
Most likely the Live@edu deployment will follow the Facebook model and run multiple versions concurrently in production (e.g., blue/green deployments where you test new functionality on a subset of users) allowing you to rapidly upgrade/downgrade portions of your target environment.
With large-scale applications it becomes very hard to carry out realistic performance or stability testing, since simulating real-world usage scenarios verges on the impossible. It is also hard to gauge customer reaction to new functionality.
For this reason, many large-scale systems test new functionality on a subset of their users and decide, based on performance, adoption etc., whether to keep or reject the changes. According to interviews with XebiaLabs, this demands the ability to be able to:
a) Deploy a new version to a subset of the production environment;
b) Track where the new version is running and separate its logging and statistics from the old version; and
c) Upgrade the remaining servers or downgrade the trial servers in an automated fashion.
‘The Cloud Ate My Homework’
One must wonder how realistic is it for this scale of deployment to make service level agreements (SLAs) to include high availability and a promise of zero downtime. If business-critical data were at stake (and not just academic), would anyone consider this realistic enough to attempt, or do we just suck it up and assume failure?
“If ever there was the perfect cloud deployment this could be it, with its massive consumer base using a constant, static application set that’s not mission-critical,” says Dave Laurello, CEO of Stratus Technologies, which promises an “uptime assurance” guarantee for enterprises in healthcare, manufacturing and government. “This suits the cloud providers approach to infrastructure and service delivery like a glove. The financial impact on the user community is virtually non-existent and students’ reputations won’t suffer if the network goes down for a few hours.”
“The converse of this example is where cloud providers don’t do so well, when the 10,000 institutions are individual, unrelated businesses, with varying workloads and different applications, and with their own customers to keep happy,” Laurello says.
With an application this large, there will be the challenge of supporting high availability. Experiencing downtime will be more widespread if the service spans multiple time zones since it will always be peak time for some users.
“High availability deployments are tricky to coordinate because you need to handle things like load balancer configuration and update certain groups of servers at a time. Because deploying a multi-component system to support 7.5 million users can’t rely on any manual steps it must be automated,” says XebiaLabs’ Phillips.
Because a large number of commodity servers in multiple datacenters will be used to scale the applications, the deployments will need to execute coordinated actions on hundreds or thousands of machines. The human resources required to do this are prohibitive; humans simply aren’t good at executing a large number of repetitive tasks, so getting it right on a reasonably regular basis becomes very hard, says Phillips.
“Cloud infrastructures today are built for recovery from failure, not failure prevention. The mindset is to build cheap and on a massive scale, assuming failure will happen and setting customer expectations accordingly,” says Stratus’ Laurello.
In other words, “good enough for government work” applies here.
“Approaching system design from the standpoint of inevitable failure is a self-serving, cost-driven model. It relegates customer service and satisfaction to secondary importance. This approach usually ends up costing more money over the long-run, not less. It’s the cloud users that suffer the consequences,” Laurello adds.
“In the case of the Indian education body AICTE, students will have a new excuse for missing an assignment deadline: The cloud ate my homework.”
Throw out Impossible Manual Processes
Automation is obviously called for. Automation that covers all aspects of this massive cloud deployment, including application code, server configuration, and database changes.
Deployment automation is not just about the application binaries — having no manual steps means being able to automate all aspects of the deployments. Or, at least, all aspects that occur more than just to a small number of teams (e.g., updating a single release ticket for the entire update might feasibly be considered something to do manually, although automating this would be better, of course).
“With today’s commodity server and commodity middleware and storage designs, even actions like server configuration changes will need to be carried out on many servers during a deployment and so need to be automated,” says Phillips.
In order to scale up to meet user demand and down to reduce cost, you are likely to be adding and removing machines to your production environment on a regular basis. A new target server does not help you as and of itself — you need your application tier components to be deployed to the system too, in an automated way.
And when you remove machines you will need to update your deployment automation solution to make sure it is aware of the fact that a certain server is no longer available and does not need to be updated. So you will a deployment automation strategy that can integrate with your provisioning systems, according to XebiaLabs’ way of thinking.
To summarize, the main ingredients for a large-scale cloud deployment to work:
- Need automation that covers all aspects: application code, server configuration, database changes, etc.
- Need overview of the deployment state of the systems components across multiple geographies and data centers
- Need to run multiple deployment versions concurrently in production
- Need support for high availability/”no downtime” deployments
- Need integrations with cloud platform/provisioning systems to ensure that, as new servers come online when the system auto-scales, the application is correctly deployed and hooked up to the already running instances.
Weigh in: How will the biggest cloud deployment for Microsoft go? With Microsoft misunderstood on the cloud, is this make or break?