Google Cloud NEXT '17 - News and Updates

Google Container Engine and the path to cloud-native operations (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
FABIO YEON: All right. Let's get started, then. Welcome. This is IO-314. My name is Fabio Yeon. I'm one of the engineers on the Global Container Engine Team and tech lead on a team as well. "Google Container Engine and the Path to Cloud-Native Operations," or as I like to call it, "the pi talk." Wow, no math person in the room. [CHUCKLES] All right, so start with the basics– containers. How many people here have used or are deploying containers in production? Wow, that's a pretty good number. How many of you have played around with containers, in general? All right, how many people have heard about containers? There, OK. So let's start with, then, some simple intro. What are containers? The most succinct explanation I've heard of a container is that a virtual machine virtualizes hardware. A container virtualizes a kernel. So what happens, then, is if you have a bunch of containers on the same machine, they're all sharing the same kernel. But the container runtime provides enough abstraction so each one of them thinks they have the machine for themselves.

So, why are they useful? Well, in this little box called a container, you can put in just your application, its dependencies, and you're done. Compared to what you had to do to set up a deployment on a VM or a bare metal, you don't have to install an OS. You don't have to configure your networking. You don't have to configure and set up the dependencies. It just shows you, your application, your binaries, and a little container image, and you're done. In many ways, this is what I call getting serenity in a sea of chaos. All right– whoops. Clicker. So once you start playing around with containers, you start running into little more complicated, complex scenarios. And this is where Kubernetes comes in and helps you out. Your little box– if you have one or two of them, sure, it's easy to get them up and running and deployed. Once you start getting more of those, then you start getting a little bit more help. Kubernetes helps you ensure that there's a place for your container to run, makes sure that the proper number of them are running, makes sure that the network packets intended for your containers gets routed properly through the cluster and arrives at the appropriate place.

And, of course, if you use Kubernetes, Google Container Engine provides to you the best and purest Kubernetes experience you can get. Beyond just allowing to set up a Kubernetes cluster easily, Google Container Engine strives to provide to you management capabilities that makes it easy for you to maintain and keep your Kubernetes cluster running. So the short little intro, let's dive in a little bit into what I'm going to call cloud-native management capabilities and show you some of the things that we do in Google Container Engine to help you be more productive and to care less about the machinations of your clusters and go back to what you want to do more, which is maintain your applications and your services on Kubernetes. "Change is the only constant in life." I like this quote for many different reasons. But that's because in the little container box, you have been able to provide a level of stability on your thing, on your applications and your service. Remember that around you, the clusters, things may be changing continually, and you have to have a little strategy to be able to manage some of those changes.

So just a quick overview of the agenda. Gonna talk a little bit about Kubernetes, the Kubernetes versions and why the version management is kind of important for you. Talk a little bit about node health– what it takes to maintain your node health and how to evaluate that. There's a little bit about operations logs and then talk about some strategies that we have come up with for successful operations of a cluster. So keeping up with Kubernetes. So remember, Kubernetes is an open source project. As such, it has its own ecosystem, it has its own community, and it also has its own release cadence. Kubernetes strives to get one minor version release a quarter. Now [CHUCKLES] this is where a minor version is like a 1.4, a 1.5, or sometime later this month, I think a 1.6 is slated to come out. It seems– seemingly like a small thing. So it's a small number change. But because Kubernetes is still a fast-moving, fast-developing platform, every quarterly release tends to be a fairly large release, a fairly big release.

New features gets launched, some existing features get upgraded from alpha to beta, beta to ga, and so forth. So every minor release of Kubernetes comes in as a big package. If you look at the release notes or change logs for every minor release, it is very comprehensive. It is very extensive. Beyond that, every couple of weeks patch releases come out. These may be bug fixes, security fixes, small improvements that were missed, and so on. In Google Container Engine, our goal is to provide to you the latest version of Kubernetes as quickly as possible. Order days, if possible, but if not, sooner. And we also provide with you a collection of versions, because we understand that not everything is perfect. And therefore, sometimes you want to hold back a little bit. So we provide to you the latest version of Kubernetes– one older version, plus the latest version of every previous minor version that's been released. Now, I know that's a little hard to parse, which is why I put an example here.

Currently, the latest version of Kubernetes available is 1.53, and that is available. The previous version of that is 1.52, which is the previous patch release of 1.5. And then we try to provide the latest version of the 1.4 branch, which is 1.49, and the latest version from the 1.3 branch, which is 1.310. So why does collection of versions, why this seemingly subset of minor versions and not others? In Kubernetes, the master and the nodes can be running different versions of Kubernetes binaries. And the official support states that– by the community and so forth, is that the master and the node can differ by, at most, two minor versions. So Kubernetes Master is running at 1.5. The oldest version of Kubernetes that can be running on a node and still be considered supported is something in the 1.3– 1.4, 1.2, 1.5, 1.3. When 1.6 comes out, that will go up to 1.4. Anything beyond that, meaning that any version skewed beyond those two minor releases is considered out of support by both Kubernetes and Google Container Engine.

It goes into what is called best effort. Your node may still be running, it may still work. But the likelihood is that some functionality may or may not work and may break in the future. So Container Engine, remember, will automatically upgrade your cluster masters. This is one of the things that we offer to you when you create a Kubernetes cluster on Google Container Engine. Among the things that we do in managing a master is to upgrade it to ensure that it is at the latest version of Kubernetes. So, as an example, if six months ago you created a Kubernetes cluster on Container Engine at version 1.3, at some point, your master was upgraded to 1.4. And if you didn't do anything [INAUDIBLE] nodes at 1.3, you were still OK. It's one version difference. At some point in the near past, it was probably upgraded to 1.5. So now your master is at 1.5 and your nodes remained at 1.3. And 1.16 comes out, and at some point in the future, when Google Container Engine upgrades your masters to 1.6– uh-oh.

If you have upgraded or nodes, you're not officially out of support. So if [INAUDIBLE] to do anything, because of the management of the master and the auto upgrade that exists, you have to upgrade your nodes at least once every six to nine months to at least stay with the support band. Of course, we highly recommend that you upgrade more often to pick up some bug fixes and security fixes that may have [INAUDIBLE] show up in Kubernetes. So we come now to the very first feature that we have on Google Container Engine to kind of help you manage some of this change. Node auto upgrade. Just like the cluster master, where we manage it for you, node auto upgrade will monitor the nodes that you have in the clusters. And if we detect it, it is, shall we say, older than the master, we will automatically trigger an upgrade of the nodes. It is the same exact logic that is applied if you upgrade your nodes manually. It can be enabled and disabled at any time. And just briefly, you can actually do this– whether the auto upgrade is enabled or not, you can always upgrade your nodes manually first.

If you want to try out, test out things, or try to get the latest version into your cluster sooner. Now, we kind of try to do this also in a very good judicious manner. Patch upgrades, because they're typically bug fixes or security fixes, tend to be applied much more quickly. So if my nodes are at 1.50 or 1.51, for example, and I enable this feature, I will probably be upgraded to 1. 153 fairly quickly. Because we understand the minor version, changes on Kubernetes can be more disruptive. And even though Kubernetes and the community tries very hard to make sure that it is not it is not very disruptive and lack of compatibility and all those things are taken care of– but because we understand that sometimes they can be a little bit more disruptive, minor version updates happens at a slightly slower pace. So, for example, the 1.4 to 1.5 upgrades will happen either later cadence than the minor patch releases. So how to enable this thing? You can enable in the gcloud. When you create a cluster, it is on the gcloud beta channel set of commands.

The flag is a enabled dash auto-upgrade. You can also do it when you create a node pool, and you can also do it in the UI at cluster creation time. Like I mention, you can also enable or disable disk of ability on node pool at any given time. One thing to keep in mind, if you disable this feature while an auto upgrade is running, it will run to completion. It just means that next time you do a scan of your clusters, it will not auto upgrade it. Now, let's pick a little bit behind the scenes as to what Google Container Engine is doing for you when this thing happens and go over the set of scenarios in case you want to opt out of this and do it yourself. So first thing, as I mentioned before, you can always trigger upgrades manually– does not affect the settings for auto upgrades. The first thing is that every week, Google Container Engine does a push, and we do release nodes. Among the things we publish in this released node, along with bug fixes and teacher announcements, is what versions of Kubernetes, if any, are made available to you as a user.

We also provide the same functionality or notification, both in the CLI, via gcloud whenever you list it, or in the UI. Notice that there are two versions that are listed in this particular output have two different types of asterisks. The more asterisks there are, typically the worse, the more dangerous, the worse it is. In this case, it is telling you that while my cluster is one version behind and the second one is two versions behind, I, unfortunately, did not have one that was three versions behind. But you can imagine that would be– the mourning would be even more severe in this one. And in the UI, of course, I think there the upgrade available notification is available, as well, whenever an upgrade is available for your nodes. If you want something a little more programmatic, gcloud container get server config will give you all the current information about the potential configurations of your Google Container Engine for a particular zone. And its particular output, you can see that the default cluster version is 1.53, but the valid masters are now 1.53 and 1.49, which is the latest version from 1.4, which we still are currently supporting at this time.

And then your nodes can be any number of these versions. Notice that even though 1.27 is still available, it is only supported if you actually have your master at the 1.4 branch. If you have a master in the one 1.5 of Kubernetes, the oldest version of Kubernetes that could be supported on the node is in the 1.3 branch. So thinking a little bit back. I mentioned that Google Container Engine managers your masters. One of the things that we do is upgrade them. You yourself can trigger an early upgrade of the master before we get to it, if you still choose to do it. And the option is from the command line. It's a Container Engine upgrade, and then you add the dash, dash master option. This tells the system that you want to upgrade the master. And then you can go ahead and trigger a manual upgrade of your nodes with a send command, except this time, you just take out the dash, dash master. Why do we upgrade the master first? It's because the node can, at most, be at the same version as the master.

You cannot have a node that is a version higher, even a patch version. So in the hypothetical world, the 1.54 were to come out today and made available to Container Engine customers. If you want to try out 1.54 before we upgrade your cluster, you would have to upgrade your masters first to 1.54 and then upgrade your nodes in your node pools to 1.54 manually. And this is regardless whether you have enabled auto upgrade for your nodes or not. You can just do it yourself. Ah, so we've started doing an upgrade or we trigger an upgrade of the nodes, and you detect that something is going awry, something has gone wrong, and you need to stop. We have two sets of commands that are currently in the alpha channel of gcloud. The first one– actually, the second one this list, but I think the more important one is you just want the upgrade to stop. So you call gcloud, alpha container operations, cancel and give you the operation ID, which you can get it from listing the operations currently running.

The operation list is probably going to say something like auto upgrade or a manual upgrade, cluster upgrade. Node upgrade, I think. You can cancel the operation. Cancelling an operation does not stop an ongoing upgrade of a node. So if a cluster has five nodes and you detected that after the first one, something has gone south and you want to stop. If the second node upgrade has already started, the cancelling of the operation will– that upgrade will finish, but the subsequent upgrades of the nodes in the node pool will not start. The upgrade will be canceled at that point. Now, your cluster is going to be in this kind of a mixed-mode version between couple of nodes are in the new version, a couple of nodes in the old version. So then you can do what's called a rollback. If you call gcloud alpha container node pools rollback, we will then take those nodes that were at the newer version and roll them back to the previous version. Couple things to keep in mind, as well, as you're doing this.

This is a recreation of a VM. So if you happen to have any local configuration or local data stored on those VMs, fortunately, they don't not survive an upgrade operation at this time. Healthy nodes, healthy clusters. Keeping your nodes healthy can be a little tricky. The nodes can go unhealthy for a number of reasons. You can run out of resources. Maybe you over provisioned stuff and you used too much memory, too much CPU. It can run on a local disk. Maybe you have a configuration problem or a bug in Kubelet that is causing it to crash. Maybe your workload is triggering a kernel bug, or, of course, network segmentation network drivers crash, so on and on. Some of these are auto recoverable by Kubernetes, specifically the first two, where you run out of memory or resource. Some signals provided to Kubernetes to the cluster master and the rescheduler controller on the master, you're supposed to observe these things happening and say, oh, this node is overloaded. Let me just take away some pods from it and scatter them elsewhere.

So they're self-healing for the most part. As long as you have enough resources in your clusters, then there's nothing you need to do. But if your click Kubelet starts crashing on you, if your current node became deadlocked because you happened to trigger a kernel bug, what then, then? This is where it starts getting a little more interesting. Second thing to remember about node health is that it is the master's evaluation of your node's health that is important. Why? Because it is the cluster master that determine what pods are scheduled where. So if your master thinks that your node is unhealthy or vice versa, you may not schedule anything on it if it's considered– if it thinks it's unhealthy, or if it's giving a false signal that is healthy but it is not, and we try to schedule work into an unhealthy node. So that's something to keep in mind. It's kind of important, is that the self-evaluation of the health of the node is less important here. It is what the master thinks of your node's health at that time that is important.

And, of course, repairs like upgrades are typically limited to recreations of a node. Oops, wrong button. So this comes to a second [INAUDIBLE] container on Google Container Engine to help you manage your cluster. It is node auto-repair. It is the semantic equivalent of the master repair that we already do for your cluster today. But this– this is where we are observing and monitoring your nodes. We are taking into account the master's view of your node health state. We are also evaluating other signals around it. We take a look at the signals from the managed instance group backing your node pool. We also sometimes take a look at the signals provided by the node itself. With those signals, the of it all, be coming to an evaluation whether your node is healthy or not. Too many unhealthy signals will trigger repair– in this case, a recreation of the node, in attempt to get it back into a good state and having rejoined the cluster. We also try to make sure that we rate limit this so that we're not doing this too often.

Now, why rate limit? It's because sometimes the node going on health and may be triggered by your workload. And if you start triggering this wait too often, we may be in a situation where the node never is alive long enough to do any useful work. So we have somebody do some of the rate limiting as well. Currently in beta– as a matter of fact, I think it's rolling out as we speak, and so fairly soon you should be able to have this capability show up in the UI and in the CLI as well. How do I enable it? Just like auto-repair, the flag is dash dash enable auto-repair. You can do this at cluster creation time, at node for creation time. And, as a matter of fact, I would highly recommend that you enable both of them– auto-repair and auto-repair, to make your life a little easier when you run Kubernetes on a Google Container Engine. One more thing before I go into the next slide, which I've [INAUDIBLE] here. You can also enable and disable this at any given time. It is just like for auto upgrades.

You do a gcloud beta container node pools update, and you can enable or disable auto-repair on your cluster– on your– sorry– on your node pool at anytime you that you want. So let's now peek into behind the scenes and see a little bit of what Google Container Engine is doing on your behalf. And kind of give you an idea if you wanted to do this yourself, what kind of things you have to do. So I mentioned that it is the master's view of your node health that matters. So how does a master know what's going on in a node? The Kubelet publishes to the master periodically a bunch of information. And one of them is its evaluation of its node health plus what's going on in its internals. All this information is available from the node object from your Kubernetes master. So from the command line, you yourself can get the same data by calling [INAUDIBLE] control, get nodes. Of course, in this output, it gives a little more formatted-shortened version of it. But if you see the JSAR or YAML output, you can actually get at the raw data.

And in a little bit, in the demo, kind of show you how I can look for this. So you're listening your nodes, and you're determined and one of them has gone unhealthy, not ready. So what are the steps now for you to try to get back to a good state? Well, the first thing you should do is cordon off your nodes. What this thing does is tell the scheduler, hey, regardless of what's going on in this particular node, don't schedule any further work on it. Next thing we do, we attempt to do a drain, try to do a graceful drain of all the pods, make sure they all go away some place so that the pod becomes empty. This may or may not succeed depending upon the state of Kubelet and other things on your node. If your node had a hard deadlock because of a kernel issue, this may not work. But at least we try. The next thing is we try to list your manage instance group back in your node pool. And from that manage instance group, we have to identify the specific node from that and then try to tell the manage instance group, hey, please recreate that node.

Because we have set up– on Google Container Engine we have set up your cluster using instance group templates, a simple recreation of it using the existing template gets your back a node into your cluster. Operations log– [CHUCKLING] as we start getting more and more capabilities on Google Container Engine to help you manage your cluster more easily, a question that keeps coming up is, like, hey, how do I find out when you guys have done something on my cluster? So the first that we have done right now is that we are augmenting the operations log for a Container Engine to contain more information about when we do automated operations on your clusters on your behalf. So before we had repair clusters and upgrade masters available, and we're adding auto-repair nodes and auto-upgrade nodes as additional operation log types so that you can find out whenever we have done these things on your behalf. How do you get to the operational log? Well, gcloud container operations list will give you a order list of all the operations that have been done on your cluster.

Actually, I think it's beyond just your cluster– is in your project across your clusters. And then if you do a container operations describe on an operation, it gives a little more detail as to what we did and what the status was. Were we successful? Did we fail? So on and on. Node pools and resource management– another favorite topic of mine. How many people here have heard about node pools or have used node pools? Handful. OK, good. So what are node pulls? Node pulls– the shortest definition is just a collection of nodes managed together. From a Google Container Engine perspective, what it means is that you can have multiple node pools in your cluster, each are defining a particular configuration. And When I say configuration, I mean machine type, machine resources, and Kubernetes versions, scopes, and on and on. Within a node pool, all the machines have to have the same exact configuration. They have to be the same exact machine type, they have to have the same exact machine– Kubernetes version.

They must have the same exact machine resources. So if I ask for an n1-standard-4, with, let's say, two local SSDs, every single machine in that node pool will have the same configuration. But, of course, I can have multiple of these in a cluster, so this gives me an additional level of functionality– flexibility, as well, where I can have different machine types and different– and mix and match these to meet my workload and my resource needs more carefully. And this is also a foundational unit for many management stuff that I talked about, so when, before, I was talking about cluster creation, enabling it at cluster creation or node pool creation, when you create a cluster on Google Container Engine, what is actually happening is that we're creating the master and so forth, and we are also creating a default node pool for your cluster. So all the options available for creating a node pool are available on your cluster creation as well. So let's take a look and some things that you can do with node pools might be kind of interesting.

So can we switch to the demo machine, please? Ah, OK. So you go container clusters list. And here I have my cluster that I created for this demo. It is currently running on master 1.53, and oh, wow– my master was upgraded since I ran this demo, so my nodes, right now, are version 1.52. So now I have a notification that a node is– upgrade is available. It also tells me that my current cluster is running and, of course, my master IP address. Let's take a peek underneath, then, at my node pools for this particular cluster. Of course, pedantically– has to be plural. I have three node pools configured for this cluster. I have a main focus comprised of [INAUDIBLE]. I have something that I call a pre-pool of n1-standard-2s, and something that I called a lowmeme pool, which is running the custom machine. A custom 2×2048, which signals to me that this is a two-core machine with only two gigs of RAM. Let's then describe the main pool– oops, let me pipe that. OK. n1 [INAUDIBLE] 4, identification scope's defined.

The number of initial counts enabled. And, of course, in our management, I can see that I have enabled both auto repair and auto upgrade for this particular pool. OK, I think– interesting on this one. But if I go to my pre-pool, among the things that you'll notice on this particular listing is that this thing is comprised of preemptible VMs. Preemptible VMs, of course, are options provided to you from DCP, where the costs for these are lower. But in return, the VM may be preempted underneath of you with a moment's notice. But if you have a workload that are preemptible-friendly or can be preemptible easily, this is an easy way for you to instantiate a couple node pools in your cluster to have this and leverage it– the lower of cost that they provide. And then, of course, the final one, which is my lowmeme pool. It is just like my main pool, except the difference here is that I'm running custom 2×2048 machines. So, like I said, very low memory, high, superior kind of workloads would benefit from this kind of node pool.

So previously, I have already pre-deployed a number of pods on this particular cluster. So let's take a look at all of them, make sure they're all running. So I have deployments that are very high CPU. Some deployments that are just regular, main workloads. And I have something that I'm calling preemptible workloads are workloads that are eminently preemptible. With a little bit of trickery– and this is where I will cut and paste the command, because I can never remember from memory. I called [INAUDIBLE] control pause provided a JsonPath path that says, hey, fish out this little fields from your particular listing and then combine them to write this output. So in this view, you can see that all my pods are listed there like before, but now I'm also listing which node they are running on. And you can see this thing scattered all over the place. My high-CPU workload's running in a variety different machines. My main machines– I have main workloads running on a variety of other machines.

But more importantly– they've been not a good idea– I have some workloads that are running on my preemptible VMs. Should be OK, maybe, but certainly not ideal. Because maybe they're not workloads that you want to run on a preemptible VM. So, how do we then take care of that? Let's take a look, then, in a node and see what I can find out. So if I list the raw Json object for a node, you will notice a couple of things here among the metadata so far. It is that there's a whole section for labels. Some of these labels are applied on each node by the Kubernetes runtime, and they are identified easily by the kubernetes.io prefix. So thing like machine architecture, in the 64, machine OS– Linux. Down here– host name, host name on machine. You'll also notice a few of these that are actually set by Google Container Engine itself. They are prefixed with cloud.google.com. In this case by default, every single node that we create on Google Container Engine has, automatically, as part of it, the node pool name associated with it.

And, of course, the other thing there is this other label. On Google Container Engine, when you create a node pool, you are allowed to provide to it a comma-delimited list of key value pairs that become part of your node labels. So you can, then, use these on your pods spec in Kubernetes so that it ensure that those workloads, those pods, ends up at a node that has this label in them. One more thing I want to show up along here. This is another node. In this case, this happens to be a VM that is a preemptible VM. When you create a preemptible VM– a node pool, sorry– in Google Container Engine, we will automatically add apply this label as well, so you don't have to do it. You can certainly add more, like I had done below, where I've added the pre label as well, where you can actually add the Google Cloud Comp. You can use this label as part of a node selector on your pod spec as well. Oh, before I go off– I mentioned earlier about node condition. If you want to explore a little bit more about doing the node health evaluation yourself, every single node contains this array of node conditions under the conditions list.

And there are a variety of these. Some of them talk about [INAUDIBLE] health, disc, memory, pressure. And, of course, the bottom one is whether the Kubelet is ready or not. This is part of the signal that is pushed from the node to the masters periodically to let the master know what the status of the node is. And this is the signal that both the master and Google Container Engine uses– well, one of the signals– to determine whether the node is healthy or not. So back to this demo. I have, then, deployment YAML for my high-CPU workload. Of course, I have commented, for demo purposes, the node selector. So let's go ahead and delete the comment tags and make sure that the node selector's there. I'm gonna go to my other workloads as well. And remove the node selector comments. And notice that on this one, at least, while on the other ones I've used my custom labels that I've set up on each node pool, on this one, I just decided to use the one that Google Container Engine sets for preemptible VMs.

Might be kind of nice, because at this point, I'm not relying upon a custom label that I've set or the pool name or anything like that. I'm just using an automatically set label that Google Container Engine has put on it. So it's available across any VM that is printable. Let me go ahead and replace the deployment on Kubernetes. And preempt. OK? Give control, get pods. Bunch terminating, bunch running, bunch container running. We'll let this run for a little bit, let everything kind of settled down again. Mhm. Wow, taking a little bit longer. OK, things seem to be settling in, settling down. Things are getting scheduled. Excellent. Ah, everything is running again. So if we go back to that somewhat complicated Kubelet control that I had before, what you'll find here now is all of my high CPU deployment– it is now running on my low-memory pool. My main workload is my main pool. But more importantly than anything else, my preemptible, friendly workload is now all scheduled and running on preemptible VMs.

So with this, I was able to, shall we say, make sure that my high-CPU workloads end up on machines that have very high CPU of low amounts of memory– save me some money, ensured that the proper VMs are assigned for those. My main workload can be whatever. My generic set of workloads can go on the main workload. But more importantly, my preemptible workloads that low priority, low cost are all set up and now are all running on my preemptible VMs, exactly as I desired to be. Couple more things to keep in mind– there are more node selector options becoming available to you as part of the Kubernetes 1.6 release. So, for example, node selector right now is a match. One of the things that is coming up that might be kind of interesting, especially in this kind of scenario, is the [INAUDIBLE]. You may have workloads that I don't really care where they run, I just don't want them to run on this type of VM– preemptive VM, for example. So that's one of the options that will come online– available in the 1.6 timeframe, I believe, where you can just state that in your pod spec and will be available when you deploy them.

So can you go back to the slides, please? All right. So a couple of slides I put it in here for you to reference. Feel free to take a look at it, how to list some node pools, how to create the node pools that I've– whoops, sorry– that I've used for this demo. Notice the machine types, notice the dash, dash printable option as an option, and the node labels, how to specify these on your node pools. And, of course, in assigning pods to node pulls, it's the node selector option in your YAML file. OK, some final notes. Upgrades can be disruptive. So to help mitigate them, you can use– the upgrades are for node pool. So you kind of use them to segment the upgrades of your cluster so that you can have either a canary test kind of pool, you can try out newer versions. So you have an existing workflow that is running at version 1.51 and you want to try a 1.53, you can spin up another node pool at 1.53, schedule some workload on that particular node pool, make sure everything works fine before you let the rest of your cluster be upgraded.

Consider having what I call the one-node slack to ensure proper capacity. When we do upgrades, we take down one node, do the upgrade, bring it back up. Go on to the next node, and so on. While the one node is down and being upgraded, it would be nice for you to have enough slack on your cluster so that whatever pod was running there can find a new home while the upgrade is occurring. Google Container Engine limits you to having one cluster operation running at any given time. Now just to differentiate between the Container Engine operation versus the Kubernetes operation, for example, these are things like adding or deleting a node pool, updating a cluster-wide configuration, doing an upgrade, a repair, any kind of events. All those are serial on Google Container Engine for various logistic reasons. So for the most part, this restriction is not a big deal except if you're upgrading a fairly large cluster or node pool. And you can imagine that because we're doing the upgrade sequentially, it may take a little while for it to complete.

In the middle of it, if you have to do something on your cluster, go ahead and cancel the upgrade. It will stop at a good point. You can do whatever operation you have to do on your clusters, and then re-issue the upgrade command. It will just kind of pick up from where it was left, scanning your nodes there we have, skipping those that have already been upgraded and just keep on going the rest of them. And, of course, we're truly trying to make progress and try to minimize some disruptions. And we have more capabilities that we're going to be adding in a future for upgrades and repairs. Some of the best practices that we have found– pod disruptions are a reality. Try to prepare your workload for them. There are a number of capabilities being added to Kubernetes that are gonna help you with that. So please do take advantage of them. And we'll keep integrating them into Google Container Engine. Do use deployments of replica sets for your pods. Make sure that you have enough sufficient redundancy.

The second point is actually kind of important. In the cloud-native world, please to not write set once and forget configuration scripts. The mindset you need to have is reconciler. You need to write reconcilers that read the current state. If it's not the desired state that you want, do some action to try to get there. And then once it's completed, go to sleep and then start over from step 1 again. This way, whether it's a repair event, an upgrade event, or some other things that happening outside of the container, if you have some configuration that you absolutely need, you know that at some point, if it goes out of alignment for whatever reason, it is a temporary thing that hopefully should autocorrect itself. And, of course, please use node pools. Organize your resource more efficiently in your cluster. Allow you to have [INAUDIBLE] configurations. Leverage different types of machines, different types of resources, and so on. A quick recap– change is constant. Please be prepared.

Do enable auto-upgrade, and then try auto-repair. Operations logs– please look into them to see if– whenever we do something, we'll try to make sure that they all end up in your operations log for your clusters so nothing is hidden. Use your node pools to organize your resources. And, of course, think reconciler, not set and forget. For those that may not be familiar, here are a couple links for the Container Engine website, the Google Groups where the release notes come out whenever we do a push on Container Engine. A lot of members of our engineering team are on Stack Overflow, so if you have a question that is container-engine specific, please use the tag, the Google Container Engine tag in Stack Overflow. And, of course, as always, the Kubernetes GitHub depositary of highly technical, kind of complicated. But always looking for contributors for the Kubernetes project. And while you're here at GCP Next, I highly recommend going to other container talks if you want to find out more about us.

IO205, I think is this afternoon and it's a good intro to other capabilities and features that are being added to Google Container Engine. And, of course, I also want to highlight IO307, even though the title says ABC is a– Google Container Engine tips and best practices is actually a talk mostly about monitoring with your Google Container Engine cluster. And with that, I want to say thank you for coming.

 

Read the video

Adopting Kubernetes and containers helps organizations move to a v, but how do you manage it once it’s out there? In this video, Fabio Yeon walks through the features available in Google Container Engine for automatically monitoring, updating and ensuring your cluster runs at peak efficiency.

Missed the conference? Watch all the talks here: https://goo.gl/c1Vs3h
Watch more talks about Infrastructure & Operations here: https://goo.gl/k2LOYG


Leave a Comment

Your email address will not be published. Required fields are marked *

Loading...
1Code.Blog - Your #1 Code Blog