Google Cloud NEXT '17 - News and Updates

Improving utilization and portability with containers (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
[MUSIC PLAYING] CRAIG BOX: Good morning, everybody. Thank you for coming along to a talk today. Want to tell you about how we improve utilization and portability of applications using containers. My name is Craig. I'm an advocate for the Google Cloud platform, focused on IT professionals, and also dev ops. It's great to be here on a fantastic warm, sunny day in San Francisco. But because I've come over from London with my girlfriend, we're going to spend the weekend in Yosemite. And we thought we needed to pack our snow boots just in case. Her boots didn't fit in her suitcase. But I had lots of empty space in mine. So we put her boots in the bag, and put them inside my suitcase. So I've increased the utilization of my suitcase. And by doing that, that means we don't actually have to pay to take another bag with us. You can tell I'm probably not hipster enough for those are my actual suitcases. But also in saying that, I'm not the kind of person who's going to vacuum pack my shirts, so that I can fit more in my suitcase.

But if you are the airport, and you're moving thousands of bags around every minute, being able to pack more into your trucks does actually save you a lot of money. We, of course, have a very similar problem with the computers here at Google. Every new computer we add costs money. But we have more computers than most people do. So we want to pack more things onto fewer machines and improve the utilization of them. So by building a system that allows computers to decide how to do the packing, rather than people, we found that, for example, we could run our batch workloads, like building the search index for, into the empty space on the same set of machines that run our serving workloads, like serving you 10 blue links when you go to do a search. And we've estimated that at the time– and this was a few years ago– just that one thing, being able to pack the batch into the empty space saved us the cost of building an entire Google data center. These are fantastic buildings.

But as you can imagine, they don't come cheap. How do we do this? We need to move away from managing individual machines to managing fleets of applications running on containers. So first of all, a quick refresh on virtual machines and containers because a lot of the workloads that people want to move to cloud today and into containers start on VMs. So we want to look at a recap of the strengths and weaknesses of the two platforms. Ultimately, I like to think of a container as just being a process with some restrictions on what it can and can't do. If you think of an app on your phone, it's a very close analogy to a Linux container. If you turn on your phone, you start the operating system. It's booting a machine. It takes around about 60 seconds from cold to get to the point where you can use, any computer really, but phones even so. Starting a process is just launching an application. If you don't have it yet, then you have to download it. It takes a few seconds.

But when you do have it, it should take milliseconds to launch. It should be imperceptible in the most case. Now for a lot of reasons, the applications on your phone are only allowed to do a subset of things. You want to restrict an app, for example, not to be able to download your entire contact list and send it out, and [? use ?] something. The permissions you set are about what you want to grant a particular application to do. And similarly, Linux containers have restrictions set on them to what they can and can't do, especially with regard to the resources that they can see and use on the machines that they're running on. Now some of the characteristics of virtual machines cause us to have lower utilization. For example, in this case, we have a host operating system. But in our three VMs, we run three redundant copies of a guest OS. In that image, we coupled in the libraries that relate to the particular operating system that we're running. And therefore, the apps that we run use those system libraries in a tight coupling.

And even though we have some free space available on all three of the virtual machines running on this particular host, if we want to run something that is, say, three units big, in this case, we can't do that because the space is stranded across three different machines. Now containers abstract away the operating system, first of all, so that you, as a developer, all you have to think about is your own application and the libraries that it needs to run. You think, also, about the interfaces with the other applications you want to run. And if we only had one machine that never failed and never ran out of resources, that would be all that we needed. Because we run on more than one machine, and we want to be able to scale horizontally, we invented Kubernetes– which is derivative of our internal systems, but developed as open source– to abstract away everything else that you might need to think about in a multi-machine environment, which we call a cluster. So scheduling containers is one part.

But other things, like keeping your containers alive and healthy, grouping them together into the services that you actually want to run, connecting traffic to those services, handling, logging, and monitoring, and identify and authorization, these are all things that we feel, as an application developer, the system should provide for you. Now those things make up two legs of what we've come to call cloud native. So container packaging gives you predictability of deployment and efficient resource isolation. So you're just copying a file when you're deploying a container that contains all of the resources, libraries that your application needs. Next is the dynamic scheduling– which is, again, provided by Kubernetes– that allows higher quality of service and efficiency by being able to have the computer decide where to place these workloads. And that means that you can do more with fewer humans and have a lower operations cost. Now the final leg of cloud native is microservices oriented.

And that talks more to the re-usability of code, and independently developed applications, and sort of domain-driven development as much as it talks of loose coupling. It's not necessarily something that everyone will do as they move their apps to the cloud. But it's something to think about for your newer applications. So I want to talk through a few use cases from both a business and a technical perspective. For example, what do you think the most common cause of outages is? Do you think it's people writing bad software? It turns out that it's actually people deploying misconfigured software. And it's especially the case when you have the kind of bad outages that bring everything down at once. We find that we're seeing studies with over 50% of outages caused by bad configuration pushes. Another point, our friends at Ticketmaster have estimated that simply the action of moving code around and deploying it into their environments was a $60 to $90 million a year problem for them, the amount of resource and people and machinery required to deploy code across hundreds of different bespoke systems that they run.

And ultimately, this leads to questions like, why can't we move quickly? Why are these deployments failing? And why are we having to run servers at 20% utilization when we're wasting and stranding all this excess space? We need to know that our deployments will succeed. So I want to talk about how we can solve some of these things with a cloud native approach. Now I think there are three distinct changes that you need to think about if you're going to this move. Just as cloud native has three legs, we have these three distinct migrations with individual characteristics. So you can move to containers without thinking about any of the dynamic scheduling. You can just deploy a single container on a single machine and treat them like an RPM or a package of some description. That's a perfectly valid thing to do. The other end, you could choose to run everything in VMs. You can run clustered applications. You don't actually need to worry about– we think a container's a great format for clustering.

But there are other approaches that you can run. Now cloud native doesn't actually require cloud. You can take those first two things and run in your own environment, move your application to container clusters, and run it on premise. And we think that's a fantastic thing to do because when it comes time to move to cloud, you'll find that all the hard work has been done for you. Makes your application portable. And we want to talk about some specific workloads and what that might look like. First of all, a three-tier application, we quite often think of a presentation, or web front end, some sort of middleware, and some sort of database. In this particular environment, we've already made separation. We have a distinct, clear contract between these layers. And it's quite often a network communication as it stands. So it's easy to scale them out independently. And so if you have a workload like this, it's a really good candidate to be the first thing that you move, for the first two tiers, at least.

We're going to talk about databases separately in a minute. The batch jobs are another great use case because of their burstable nature. If you can pack your batch work into containers, then you can schedule them only when your cluster has free space, as we talked about with Google Search earlier. There are package management options now for running things like Spark on top of Kubernetes. And if you want to run clustered batch processing stuff on Google, you also have hosted options like Cloud Dataflow and Cloud Dataproc. And both of those things will let you take a Docker container, and use it as the unit of processing that you do on your data. So again, these two things can be decoupled and used separately. Now a lot of people talk about something which we call carving the monolith, which is an industry term for taking a single, giant application and breaking it up into microservices. For more traditional workloads, I like to think you can apply the same logic to virtual machines.

Kubernetes has a concept of a pod, which is the unit of scheduling. It's not just an individual container, but it's containers that tightly couple together that need to be scheduled at the same time and need to communicate with each other. And this lets you break down your work into containers where it makes sense. And you don't need to pack two different things that are sort of built by different teams into the same container, have them maintain them, connect them together in a pod, and then you can deploy them on different life cycles. And eventually, that just moves you towards a microservices implementation. But if you have VMs today where you use something like Puppet, for example, to deploy a bunch of things, you think about how we can break them down to individual containers and then map them to a pod. Now databases are a very contentious topic. There's a popular meme saying you probably shouldn't run databases in containers. But remember that all the container is is a Linux process with some restrictions about what it can and can't do.

So again, think about the difference between containers and clusters. It is possible to run Oracle on top of Docker. Oracle even published images on GitHub that let you build containers to run this. It's possible that you don't want these dynamically scheduled and just moved around your environment at will because you actually need to connect them up to the data that backs them. But there are more modern, more cloud native databases that support this kind of clustering themselves. So the way I look about this, it pays to ask, do I get value out of being a DBA? Think about it this way. If you know how to scale the database, how to back it up, how it deals in cluster partitions, how it deals in network failures, and you know how to do all the same thing with a cluster, yes, it is absolutely possible to take a traditional database– something that's not cluster aware– and run it on top of Kubernetes. If you're dealing with a commercial vendor, they might not like you doing this.

They might say, well, that's not a supported environment. I'm not really happy about that. We'd rather you run on VMs. And that's OK. Because you can just run those VMs next to your cluster and call out to them. With open source databases, you sort of weigh up the pros and cons of hosted services– and you've all seen that we announced our hosted PostgreSQL service in the keynote this morning– versus running your own. And with our hosted service, there are some extensions that we just don't provide, that you may need to run custom. And there may well be specific-use cases where it does make sense to run your own. And my colleagues from Samsung will talk about that in a bit. Patterns and helpers for running these databases on Kubernetes are coming thick and fast from the community. Vitess, what you see there on the logo on the left, is a new application developed by YouTube which makes it possible to scale MySQL horizontally. It works well on Kubernetes. It a great way of running MySQL in a way that is aware of the clustered environment where workloads can move and change.

But remember, it's not all or nothing. It's still OK to use VMs. You can use the service abstraction in Kubernetes to refer to things outside your cluster, as well as inside it. If you do have a workload where locality is important, where you want to make sure it's tied to a physical machine that has a card, or a GPU, or a license and dongle, for example. Or if you already run something that's commercial and has cluster manager in its name, chances are you'll want to keep those things separate. And because we have both container and VM platforms, it's easy to make that choice. OK, we've made some choices about what applications we want to move. The first step is actually putting them inside a container. So for applications that you build yourself internally, the easiest thing to do is just take the output from your current build system– in this case, we have an RPM– and then write a little script or Dockerfile to build a container based on installing that on top of an image that already has legs on it.

So in this case here, you'll see we just copy in an application that we've built. We take a centos image. We update it. We install any of the dependencies that we need for that application. We run the RPM. And we instruct it, when it starts, this is the application that I want to run. That's really all it takes to get a container that runs an application. It's probably going to be bigger than you want, because you're dragging in a lot of things. But it will work. And it will do what you need to do. The next step is changing how you package your application. So if we're just building an RPM for the privilege of installing it in a Docker container, we don't really need that RPM. We should just build the Docker container directly. So we make whatever changes we need here in this pseudo code build script. We'll check out the application, build the release binaries. And now instead of calling the step to build an RPM, we could say, take that directory of output, and then build a Docker container from a file which is on top of a generic Linux operating system, whatever it is that you need, install and run these files.

Now that's fantastic when we start looking at continuous integration tools. So all modern CI systems can build containers. This is Jenkins here. It's very easy to get to a world where you commit a change and have a container automatically built. And the great thing about Jenkins is that you can use the clusters that you have to provide the compute to run the builders that build the containers that will then go run on these clusters. So you can use the empty space, again, to increase the utilization of the machines that you have. Alternatively, this week we've launched Google Cloud Container Builder, which is a hosted service for building container images in the cloud. So again you have a choice, do I just connect this up so it can watch for pushes to GitHub, and then automatically build containers, and put them into my secure registry? Or is there something that I want to run yourself? You always have that trade-off between the things you want to manage versus taking on a managed service.

And I do encourage you to have a look at that. So with no changes to the code at all, we've gone from a situation where people say, ah, well, it worked on my machine. Why doesn't it work in production? And it's so hard to deploy. And rolling back is very difficult, to having an artifact which runs the same any way you run sitting on top of the container run time. You just start the container using a command like Docker run to deploy it. And then if you need to roll back to an old image, all you need to do is restart the old image. You have them both there. You don't have to think, oh, I've upgraded the machine. I now have to downgrade it again– or worse, redeploy the entire machine– to do a safe rollback. Now let's look at the clustering. So if you remember the title of this session, the two things we're looking for are utilization and portability. So we'll take what we have that runs on a single machine and move it to run in a multi-machine clustered environment.

The first thing I suggest is if you haven't read "The Twelve-Factor Manifesto" recently, do go back and have a look at it. It was published a few years ago. But it really boils down the idea that to run these scalable applications led to the development of cloud native. Your applications are probably missing a few of these steps. So for example, Twelve-Factor says that the configuration for your application should be provided by the environment that they run on, rather than packed into these containers. You don't want to have to build a different image to run in your development environment to run in production. What you should do is just say, in that container, fetch from an environment variable– or as a command line flag, for example– the configuration. And then have the environment provide that. In the Kubernetes case, we have an abstraction called Secrets, which are abstractions for simple credential management. We also have config maps, which are a way to parse configuration into environments.

And if you want, you can connect up to more complex things like a [? hasher ?] called Vault, or our Cloud Key Management Service, to provide credentials and your applications, things you definitely should not be checking into GitHub, for example. Given that we now have our code and our configuration, the next thing we want to do is actually deploy an application. So we'll create a deployment object. And that will create a replica set with all of the pods having the same metadata. And we just specify how many replicas we want. We should make that choice based on the resource requirements. And we encourage you to test what it is that you get when you have a certain amount of CPU and RAM per container because we want to set those limits to make sure that we don't run away with all the free space in the cluster. Then we create a service, which groups those pods together, and potentially connect that to something like an NGRES to get inbound traffic in. And now we have our container application inside our cluster receiving traffic and serving.

Having done all that, all of these things are API-driven. So we can lead to a world where all of this is done automatically for us. We can do continuous delivery where you push your change. It gets built. It gets tested. Perhaps there's a human approval step. And then the deployment is done to the cluster. And first, we'd like to use the tool called Spinnaker, which we've worked with Netflix, who built this tool initially as an enhancement on their work on VM-based deployments. And we've especially collaborated with them around being able to do Kubernetes in container-based deployments. But again, the APIs are there, so you can tie this into whatever continuous delivery system you use. Or you can even run this from the cloud container builder, and just have a step that does deployment. Now if you're running a common application, the open source databases, for example, you think, well, I don't actually need to go through the effort of building all of these containers myself.

Someone else has probably done that. And then I can just reuse their work. So there's a package manager for Kubernetes, which is called Helm. It contains many open source packages that are ready to deploy. But it's also an engine you can use to describe how your containers, and your pods, and services relate to each other in an environment, and then use that as the thing that gets deployed on your cluster. So similar to how a package manager, like Docker RPM, for example, installs the thing on a single machine, Helm brings that up a level, and says, install this on the cluster wherever it happens to need to be. So when we would think about anything that handles states, we need to think about storage. Traditionally, what we would do is we would have some large set of disks, and a computer attached to them that runs to the database that serves that data. You'd say that the work follows the state. Now we're talking about distributed clusters in Kubernetes. You should think about turning that around, and saying, the state follows the work.

So wherever your database engine gets scheduled, the state is just attached by way of taking a network disk that contains that, putting it on the right machine automatically as a volume, which is then mapped into that pod. If you think of a use case where you have a number of different nodes which run a set of replicas for a database, they all have slightly different data, depending on what's [? shouted ?] on each one. And instead of just having a replica set with each thing having the same data attached to it, we now have an abstraction called a stateful set, which lets you say, this particular replica has disk zero, replica zero, disk one, replica one, for example– map through to the number that you need to have. And it's a great way for running these stateful workloads on top of Kubernetes. Where am I going to store this data? The great thing about Kubernetes is that you don't need to think about the abstraction below as someone who is simply deploying and running applications.

Kubernetes abstracts away the physicality of the parts, the sort of below the imaginary line. It might be that, in your own data center, you have network disks attached by iSCSI. If you run OpenStack, you might use the Cinder plugins. And on cloud, we have network attachable persistent disks. So all you do is say, Kubernetes, give me a volume. Or you have a volume provision by your administrator. And it is backed by whatever is available in your environment. So again, you don't have to write different code to run in different environments. So you end up with an application which is eminently portable because of that consistent API. Same is true of network. How am I going to get incoming traffic? Well, it might be that I have physical load balances that might require a network team or someone to program before they work in my physical environment. But in the cloud environment, I might just be able to make an API call, and then have that wired up for me. The same things that are true of these abstractions are both storage and networking.

Now that comes in very handy when we get to step three in our three-step process, which is actually moving to the cloud. In the simplest case, we just take the Kubernetes configuration files that we've built, that we were running on our on-premise environment, we point our command line tool to a cloud environment. And we do exactly the same thing. It sounds flippant, but I actually sat down with a customer at home in London. And we said, let's get your app running on Google Container Engine. It took them longer to figure out how to get their laptop connected back to remote into their desktop from the conference room, and then figure out what environment variables they'd set wrong, which meant that none of the commands would run, than it did to actually get their environment running on the cloud. That's the magic of possibility. Now, of course, there is more to it than this, because you need to synchronize your data, as well. And you need to generally have a hybrid environment where you're running while that process is happening across multiple places.

We're working on that Federated API servers, so we have the facility to have clusters that run in different places managed in the same way. And then it really just comes up to you to be able to handle the migration of data by whichever way makes the most sense for whatever it is that you're running. So stateless workloads, you don't necessarily have any data to migrate. But when we start talking about storage, you do. So now I'd like to invite our guests from Samsung SDS, who are going to talk through a migration that they did. Bob Wise is the group's CTO. And he's going to talk a little bit about how customers decide when to move. And again, if you're a developer, this is something you can take back to your boss, and say, these are things that we really need to think about, and that will let me do these cool deployments. And then Lee Chang is a senior architect. And he's going to talk about the technical details of the particular migration that they did.

So thank you very much to Bob. [APPLAUSE] BOB WISE: I think there is a lot of good material in those slides. Hopefully, we'll be able to help you instantiate a bit by going through a real-life example that we've been working on. So Lee and I work for Samsung SDS, the Cloud Native Computing team. So what does that mean? I should say, first of all, even though I'm on the board of the CNCF, and our group is all about cloud native, there's something I don't like about the cloud native word, which is it has this strong implication that you have to build things from scratch, that they have to be natively engineered and designed for that kind of cloud native environment. And we don't believe that's true. And the example that we're going to talk about today is a great example of how is it that you can take existing applications and migrate them without a lot of re-architecture into this new world. So there's a few things we believe that are very important.

We are obsessive about automation. That's an important piece of being cloud native. We're very dedicated, and believe that open source is really where infrastructure is being built. And we're motivated by being involved in that community. Within Samsung Group, we work with various teams to help them with their transformations. But we also have an outside business. And Lee's going to have a little bit longer part of the talk. We're going to talk about one of our customers engagements. A couple of things we're going to talk about– so I talked about cloud native a bit already. I want to introduce this term cluster ops, which you're going to start hearing more, which is this new term that's trying to capture that there are some new things that are happening, not just with the technology, but with the organization. How do companies organize properly around this? They will frequently need some help because this is a big transformation. The thing that companies get out of this kind of cloud native work is, although, you can use Kubernetes and Containers to make much more efficient use of your infrastructure, most companies– Google might be an example that spends enormous amounts of money on their infrastructure.

But most companies spend more money on their people. And it's much more important to have great product velocity and high quality. The kind of infrastructure optimization is a less important priority. So cluster ops is three things. How do you manage your infrastructure? How do you manage your org? How do you manage your tooling? Deep embedded in that concept is the notion of an SRE. And now we're getting into these kind of more complex organizational topics that are part of what we help with. So we do executive consulting. We do professional services for migration. And we help companies adopt cluster ops. So we've been contributors involved in the Kubernetes community for a long time at this point. We were on stage with Google for the Kubernetes 1.0 launch. We made a very early bet. Why did we make that bet? Was it a good bet? So, yes, it was a great bet. Kubernetes has become the fastest-growing, best-supported orchestration system out there. It has the best options for multi-cloud.

Craig mentioned that. And it's open source. Open source is good, but it also comes with some commitment that's required in order to really adopt it well. And if you're going to use open source the way we do, deep industry participation is really important. So the way we do that is we're involved in helping run the Kubernetes SIGs. I'm on the CNCF board. We have contributors on the team. Now some of our customers don't have the time, the energy, the capacity to be engaged at that level. And this is a way that we help. So I'll also do a quick mention, if you go to our GitHub, we do all of our work in open source, as well. We have a project called K2, which is a lot of the provisioning management tooling that we built around Kubernetes. So what drives an enterprise-wide transformation? So Craig mentioned that if you're a developer, come talk to us. We can help you figure out how to do this. But probably most of our engagement– and that is help we provide.

How do you understand how to explain this to your boss? What's going to be the effect on the organization? Do you need validation that companies like Samsung are using this technology? Most of our engagement, though, go top down, where the C-suite is starting to wonder, well, why do I feel like my competitors are going faster than I am? Can we save money with container packing? Do I have a modern team? Do people really want to work here because we're using the best and most modern technology? And why is it that it seems like other people are growing faster as I read about it? So consultants will come in. It's pretty easy to find consultants that will come in and help you with this kind of problem. And most consulting groups will offer a couple of solutions straight out of the consulting 101 playbook. Build a massive parallel team. Let's help you hire some new leadership into your team. Let's completely re-architect the system. This was my slight objection to the cloud native term.

But we've taken a different approach. I would say, the way I would characterize our approach, generally, is an ops-first approach. It's frequent that we see agile transformation kinds of projects go a bit sideways because the development teams get into this really rapid pace. But then, because they can't actually get their applications all the way out into production, that it all gets wadded up around that. So we prefer to think of operations as being the place to start with this kind of technology. We will talk about our engagement here a bit more. So the customer we're going to talk about today is a company called Zonar. They are in the industrial IOT space, have a long-running business, lots and lots of devices, lots and lots of customers. And again, Lee's going to talk about that a bit more. But to give you a sense for– like, how does a project like this really get started from the executive level? The place that we decided would be a great place for us to engage is around disaster recovery refresh in the light of public cloud options.

So we said, we'll analyze those options for you. We'll look at risks. We'll embed our team with your team to make sure we really understand what's going on. We'll build some small experiments. And then we'll re-plan based on doing that. So with that, Lee, why don't you come on up? And I don't know if you want to leave slide here. Great. Thanks. LEE CHANG: Hello. Thanks, Bob. And thanks, Craig. I know we keep shifting around. So I hope you guys could keep track of who's who. So I know you all were listening really intently to what Craig talked about with regards to containers and running services in Kubernetes. Because what I'm going to be talking about– and convince you, hopefully, that– taking these concepts, you can actually do it in the real world. And what we did was Zonar, which is what Bob mentioned, was we analyzed– so I don't know if Bob talked about this. But what our project was was to do a disaster recovery refresh with Zonar. They had systems that were currently running.

They were looking at probably modernizing the systems– I should probably put this in my pocket– and also trying to see what they needed to do to add to their business continuity plan. One of the things that they wanted to bring to the table was potentially moving to the cloud. At Samsung CNCT, we have a lot of engineers on staff who've actually have done a lot of lift and shift from on-prem systems into the cloud, VMs, containers. Like Bob mentioned, we've been doing Kubernetes since the get go. So it's been about two-some odd years that we're working with Kubernetes, and also containers. We did containers, actually, a few years ago before that with Cloud Foundry. There could be arguments if that's containers or not. But that's moving small packets into the cloud. So we have a lot of experience with that. And Zonar, knowing that we were such subject-matter experts, they brought us in, had us look over some of their systems that were running in their secondary data center– which was their disaster recovery system– and give them some options on potentially moving that system to the cloud– not all of it, maybe part of it.

But what we did was, like Bob mentioned, we had our teams of embedded with their teams. This was a nice place to be because they were only two blocks away from our office in Seattle. It's not that we don't like traveling, but it's great to be able to just go down there. If you needed to do deep dives, you need to analyze their systems, is just two blocks away. So we could go over there for five minutes. What we did was did some POCs. We looked at their stack. We made sure that what we were going to do was always going to be feasible. So mostly what we're looking out for was, did they have any proprietary systems that prevent them from moving to the clouds? So if they needed network equipment that was proprietary, or they needed systems that needed to be run on bare metal. Luckily, none of that was there. So it was a lot simpler story to say that we could move to containers. Since we are here to listen to people talk about containers and Kubernetes, the question wasn't if they could move to the cloud.

The question actually was, where can you run Kubernetes? Because we went through the steps. We figured that containers was a viable option for them. So where can we run Kubernetes? Kubernetes is in our DNA. We've worked with Google since day one. They were already using Google Container Engine. I forgot to tell you, this was about a year ago. So we followed closely that project. And we realized, this is probably the best place for them to move their systems for the secondary data center. Part of the benefits of using GKE– since we've been using Kubernetes for a while, we understand any cluster that you want to operate, there's operational loads, start-up costs, putting your staff through training to understand how you're going to run a new cluster. It's different than what they had in on-prem. And the other thing is, what are you going to do for building a new cluster? They had business that were running currently servicing customer needs, bringing in money. How are they going to extend that?

This is what they were trying to do with their new data center. They needed to extend that. They were getting more business. So how do you not increase more in your capital costs? So Kubernetes, again, with GKE was a great solution for that. And another thing that they had with– after many, many years, 15 plus years, of operating a data center on-prem, they had expertise to do that. So GKE is using Kubernetes underneath any place that you can run Kubernetes, whether it be on-prem or in the cloud. So we can use Kubernetes running on GKE very easily. A little bit of context of what the customer was running. It was a classic N-tier system, three tiers. You had the web tier application tier. You had the middle tier, and a database back end. The database transactions, or the business transactions, were spread across multiple PostgreSQL database servers. They had about 100 database servers running. All their application systems, and middle tier and databases systems were running on VM. 90% of it was VMs all running on bare metal in their data center, and also in their secondary data center.

They had a large installed base of head units. So their business was running head units, 450,000, 500,000 head units that were basically IOT systems. And what they did was take sensory data, telemetry data, sent that back through the internet to their data center to do the processing and do reports off of that. The traffic that was resulting from that was all run through their network appliances. So the load balancing network route switchers were all hardwared running in their data centers. And the majority of all their data that was landing on files was all stored on network attach source systems. And 200 terabytes of data was being stored remotely from their database servers in their on-prem data centers. So again, it was pointed out that their dev teams were excited about moving forward fast. Some of them had experience with containers. So there was a big drive to say, hey, if we're going to move forward, let's see if we can use containers in the new solution that we're providing for them.

So that is a great plus. We didn't have to do a lot of convincing on their side to get them to move to a container-type environment. So what we came up with, a model that we can go with– and I forgot my pointer. So I'm going to have to do a lot hand waving here. Their current system, their current data center, and their secondary center, consisted of the head units– 400,000 to 500,000 head units that go through the internet and send their telemetry data to their current data center in Seattle. So they had their gateway. The load balancers were all physical appliances. Web servers running on bare metal, and VMs, and database servers. So the new system using the Google Direct Connect through our partner, which was a 10 GigE low latency, since we're using the central location, about 40 to 50 milliseconds connected to the Cloud Connect using cloud load balancing. If any of you have used Kubernetes and Angular, you all realize, if you want services that need to talk to a Kubernetes resource, you need to use a load balance.

So Cloud Load Balancer was a great solution for that. Container engines– since this is a GKE solution, we used Container Engine. Persistent disk to backup back any systems that needed stateful files. And we use Google Container Registry for all the Docker images that were built. It's a great product. It's co-load with your cluster, so you get really fast response times. And also we were able to leverage the IAM system, so the security was there built in. The process that we did to move them from their current data center, or their secondary data center and their primary data centers, to the cloud, we used a lot of our own tools, called K2, K2-CI, which is our automatic delivery system. Those consisted of other tools. We took a lot of open source projects. We also used Google's Cloud CLI. And brought that up to do some lifecycle management. One of the biggest tools that we use from Google– and we got this sort of free, free as in beer. You still have to pay for some of the transactions.

But we leveraged the ability to use Stackdriver for logging and monitoring. This actually contributed to reducing the operational load on the teams, in that they didn't have to create another monitoring system. Now there's going to be some swivel chairing. People are going argue, well, you've got some swivel chairing went back and forth. But the benefits of having a system that's pre-built, that's running automatically– there's like a flag or some command line you could add to it– and able to get all the feedback real time. And also, monitoring was a big benefit. As mentioned before, automation, automation, automation is really important. We used a pipeline called CAIC– we need a new name for this. I'm sorry– K2-CI Application Delivery Pipeline to do our automated building, and testing, and deployment of applications to GKE. So what we started with was production-quality artifacts from Zonar, binaries that were already going to their production systems. We took any changes from there and put them into this deployment pipeline.

We could also use source code. But in this case, we used their binary artifacts from their artifact system and packaged that with Docker files and Helm charts. So Craig talked about Helm charts back in his previous talk. We use Helm charts as the delivery mechanism. So everything was wrapped in Helm charts that's within the source code. Any changes there were reflected in the Helm charts. And actually, what Helm charts provide you is the ability to say, this is the state that I expect Kubernetes to be running this application, which is really nice for us. We also use Helm charts to codify manage services outside of Google Container Engine itself. We have dependent resources that are running in Google Cloud. We didn't want to have a separate mechanism or a configuration system that we needed to run in addition to our pipeline system. So we actually codify and managed that in Kubernetes charts. So those usually included, the cloud load-balancer solution, the networking cloud solution, and the cloud DNS solution that's provided by Google.

So those were all managed to Helm charts, which really simplified the pipeline itself. Database, so I know Craig, in his slide, had a big, red maybe on databases. We've been doing this for a number of years. Our teams have done a lot of lifts and shifts on open source databases, from MySQL to PostgreSQL. And the answer to this is, yes, you could do it. Previously, with VMs, it was a little bit more complicated, and sometimes more hairy with the wiring and everything you had to do in the back end systems. Actually, let me backup. So there was questions asked to us, why don't we use a hosted solution? Google announced their PostgreSQL solution today– yesterday. RDS is probably one of the biggest things that people know about. They have PostgreSQL running in RDS. So why don't we use them? It's a Cadillac solution. It gives you all the nice benefits of redundancy and automated failovers. The biggest reason why is the customer, Zonar, actually uses proprietary extensions. Their business, a lot of the domain logic, uses proprietary PostgreSQL extensions.

So asking their development team to refactor that, put that into the middle tier, was pretty much out of the question because they needed this project done at a relatively rapid manner. So hosting and using our own database was really the only path forward. So we said yes to the question, can we do the lift and shift for the database server. This caused consternation, a lot of debates. But we prototyped this and proved that it worked. And again, Kubernetes really benefited us in this. We used stateful services and Persistent Volumes using Persistent disks off Google. Google– I don't know if you all know this– the bigger the disk, the better IOPS, the better latency, and all that good stuff, which was really nice compared to products out there– compared to other cloud providers out there. So you have your on-prem system with the database running on bare metal, and your data landing on the network-attached storage. Or we had a stream of application going through the internet over the big pipe that we had configured and provisioned with Google to Cloud Interconnect Load Balancer.

So that could talk specifically to one database server running in a pod as a stateful pod. And that guaranteed that the disk that eventually gets created with the data file– if that pod goes away, stateful service brings it up and does all the wiring for you, which made life really simple. When you first try to do these things, when you're showing the customer that this can take the load based on your calculations, it's not going to do that. So if it doesn't do it, five minutes later, spin something up with a bigger resource requirement. The quota gets increased, and boom. You can just change your wall files. Everything just syncs up, hopefully, magically. But if it doesn't work, you could spend another day doing that. Versus a VM, where you have to reconfigure, rewire all the database files in the back end. So this really helped us show that the customer can run their workload over in the cloud fairly simply. The pipeline itself, now this is not a literal representation of our pipeline.

If I had that chart up here, it if even was landscaped, it would be wall-to-wall, a bunch of flows, and a bunch of arrows going all over the place. So this is a very high-level, probably 100,000 [? file ?] view. The way it worked was– I wish I had my pointer. So we took production artifacts from their systems on-prem. Any changes there, the pipeline picked it up, pushed that change through Docker Build into the Google Container Registry. That updates the tag inside the Helm chart. And there might be other changes to Helm chart, but we'll simplify that and say there isn't. Once those changes are detected, we do a Helm install, the pipeline does. It goes into a staging environment. The staging environment is created on demand. So that way, if you need another test, you're not going to pollute that environment with the previous test. If the staging environment test passed, the conformance test passes. And if this is the one that goes to production, you do a Helm upgrade. And the reason why you do a Helm upgrade versus a Helm install is you don't want to affect any resources or objects that are already running on that cluster.

Helm [? uninstall ?] only does changes inside opponent. So it only makes changes to things that are actually changed within either the Helm chart, or the artifact itself. So if you have load balancers, you have stateful services. And if you have applications out there, it'll do a rolling update. But the load balancers and network connection DNSs, those don't get changed or get deleted if those changes aren't reflected inside the Helm chart. So some of the things that we ran into during this process, with any type of trying to understand applications from the outside, it could take a while to figure out the domain logic, also what replicating applications. One of the things that we also ran into– but this is not something that we've never seen before– is VMs could become bloated. You get monoliths at times. They had the classic monolith, where they had database is running– I'm sorry. In the VM, they actually had databases running their caching solutions. And they had web applications that were run, Nginx, Tomcat, RabbitMQ, and their Memcache systems.

We, with Kubernetes, was able to refactor that pretty quickly, put the applications in single container, and just use pods to manage that, do all the wiring, and the networking for that. There's also some lack of management tools. Kubernetes is still pretty young. It's about two years old. So given any new project, there might be a lack of tools of lifecycle management. But we have K2 with the years of experience that we've had. We have K2, K2-CI. So we have those tools already built in-house. That leverages a lot of open source tools there. Some of the wins that Zonar experienced doing the [? post ?] process was they are able to use the pipeline in conjunction with GKE and really speed up their productivity, not just with the disaster recovery solution. They were able to get containers out faster, databases out faster, changing their schemas. Everything was just a lot more quicker. GKE also reduced the operational load. I think I covered that a bunch of times. Not having to operate a cluster by hand, we were able to depend on Google's– or Zonar was able to depend on Google's Site Reliability engineers to do the control plane act management itself.

And containerizing their applications, they experienced better scalability running on GKE. Scalability– and not that when disaster happens, and they move all their traffic to the cloud, and they cannot respond faster to production load, but they can also now be portable across different infrastructures. So they can run that container– kind of what Craig described– one application that's running in the container in the cloud. They could run locally on their own personal laptops using Minikube. Or they could also run on on-prem systems if they have a Kubernetes cluster running on-prem. And they can also run in other hosting solutions, and/or they could run another cluster in themselves. GKE is really easy to start. Two command line commands, boom, you got your own cluster up and running. Do another Helm install. Helm install is available through the repo. And you could deploy another environment pretty much at whim. Some of the takeaways I hope you get from this, you can adapt any application stack– and I'm using the word any– to move to the cloud without major refactoring.

And some of the benefits that we went through– and I went through this really fast, but you can run GKE and on-prem in a hybrid model. So you don't have to lift and shift everything all at once. You can have services running in your on-prem and have services running in a cloud still talking to your on-prem as long as you have a nice direct connect provisioned. Or we actually have VPNs running. So as long as it's not full database traffic, you're able to actually run it in a hybrid model using GKE. Classic IT and Cloud Native can be on the cloud. So both systems, you can definitely run in the cloud. And originally, when we were first doing some cost analysis, this project versus them doing a data center refresh in their secondary data center, what would the costs of spanning that, refreshing their gear there versus building out the cloud using our services to do the life and shift for them? They were actually saving money by bringing us in to help them do the lift and shift, and also paying for the Google services.

Thank you. [APPLAUSE] CRAIG BOX: Thank you. All right, thank you, Lee and Bob. So in conclusion, very quickly, three distinct changes we see here. Package your applications and containers. Make them work in clusters. And then deploy them to the cloud. Here's a tweet I saw a couple of days ago, which really sums this process up. So one guy, six months to move from an RPM-based deployment platform using Docker CI, Kubernetes, doing continuous deploy to Kubernetes. And took the deployment process for the VPN software from 8 hours to three minutes. Thank you very much. [MUSIC PLAYING]


Read the video

Interested in learning best practices for adopting containers? Migrating from VMs is a gradual process. In this video, Craig Box, Lee Chang, and Bob Wise explain how to take your existing workloads and move them, one by one, to cloud native, distributed apps.

Missed the conference? Watch all the talks here:
Watch more talks about Infrastructure & Operations here:

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog