Google Cloud NEXT '17 - News and Updates

Spinnaker: continuous delivery from first principles to production (Google Cloud Next ’17)

NEXT '17
Spinnaker: continuous delivery from first principles to production (Google Cloud Next ’17)
5 (100%) 1 vote
(Video Transcript)
[MUSIC PLAYING] TOM FEINER: We've all been there. You're on call, in charge of an entire production system such as this. It's peak traffic on the East Coast, but you're on the other side of the world, sleeping like a baby. You have no idea what's about to hit you. In my nightmares, here's what I see. The pager goes off saying something like this. You wipe the sleep from your eyes, grab a computer to investigate. You're still half asleep and have no idea what's really going on. Traffic jams are forming all around the area. Users have no idea where to go to avoid traffic. Time to start debugging. Millions of people are counting on you. You open the CPU graph for this server group, as you know the service is CPU bound. And it looks normal. All the scaling did its thing. It was set to keep the CPU under 45%. And that's exactly what it did. The number of instances doubled when the CPU started to climb. So why isn't the service working? You feel a cold sweat forming.

This is not going to end well. All instances in this server group are currently completely failing their health checks. You connect to a sample server, and you see that it's at 100% CPU. Why? Pick a jstack, and nothing comes to you, can't see anything. So you restart the application, look at the logs, just to see the start-up sequence, see if you can find anything. But it all looks fine. The dependencies are loading. The application starts fine. But yet, once it's started, CPU jumps right back to 100%. A bad feeling creeps up your spine at this point. You connect to another server, and this one's completely idle. How come? You look around, and you can't even find the application running there at all. What's going on here? Was the CPU graph from earlier lying to me? CPU graphs usually don't lie. You check it again, and you realize, it's an average CPU graph. How did I miss that? So it appears, that some servers are running at 100% CPU, burning up, practically thrashing, not doing anything, while others are completely idle.

So you ask yourself, am I in over my head? Your heart tries to pound its way out of your chest when you realize you have no idea how to fix this. Then, while looking at the logs, suddenly, instance gets terminated, thrown out. And you can't debug anymore. In the UI, you see even more instances committing suicide, instances which were just started a few minutes ago. This doesn't make any sense. The system's scaled up, only to kill itself in the loop? Come on. Is it time to admit failure? Wake up some other people from the team to help? You're supposed to know what you're doing here. You're a professional. You think for a minute. What on Earth could be causing this? And nothing comes to mind. OK, OK. So you pick up the phone, about to call for help. You're about to hit Send. Then it hits you. That's auto-healing. Auto-healing is like the immune system, it's supposed to replace unhealthy instances. But in this case, it's all of them, so it's doing much more harm than good.

You dig through the console to try to find out where to shut down auto-healing, wasting a few more precious minutes. You find another server which is still up and idle. Why is the application not running there? You keep combing the logs, trying to find– you're looking everywhere you can, till you finally find this. Apparently, each server, upon startup, connects to a configuration server to download this configuration. Without it, it's pretty much useless. So this configuration server, apparently fell over and managed to do two things at the same time. One, it's taking down this entire production system with it, turning a perfectly nice auto-scaling group into a set of zombie servers which are committing suicide. And two, is to give you a heart attack. Now that you understand the problem, you quickly fix the configuration server, scale up some more servers, and we're back to normal [SIGH] finally. You double check, just to make sure, and go back to bed, patting yourself on the back for saving the day.

You try to sleep, but find it impossible. Something's still bothering you. Did we really save the day? We were down for 20 minutes. 20 minutes with millions of users frustrated. Then you realize, your job is not to be a ninja, even though it's fun sometimes. Your job is to prevent these kind of failures in the first place. You remember hearing once about the immutable server pattern, and a light pops up in your head. If this fleet of servers was immutable and completely self-contained, all of this would have been avoided. Baking eliminates the external dependencies such as this configuration server, improving the stability of the system, not to mention faster boot times. So you wake up in the morning and run off to work to start baking immutable images. [APPLAUSE] STEVEN KIM: There is an astronaut saying in space "there is no problem so bad that you can't make it worse." And it speaks to dealing with a level of complexity and pressure like in this story, where you can quickly feel like you're in sinking sand.

The complexity of rolling out software continues to increase. New architectures, micro-services architecture, the scope of deployment, multi-regional deployments, global deployments. And, as a response, what we've done, is we've introduced measures and processes that slow us down. And I think, at this point, most of us don't think that rolling out the production hourly, daily, weekly, monthly, quarterly is a good idea, let alone feasible. And meanwhile, what's on the table, is early and often feedback that we're doing the right things, frequent opportunities to pivot and change direction when we need to, and some side effects of doing more frequent roll-outs, such as the scope of change is smaller because you're doing more frequently, so there is inherently less risk. So there's a lot of talk out there about continuous delivery. But we need to talk about really what's important in terms of first principles behind continuous delivery and not just the tooling.

And Google's been doing this a long time. We have made our share of mistakes. We have lessons learned. And based on that, we have a set of opinions. So we want to go to share those lessons learned with you, along with some tangible examples. Continuous delivery can't just be about better orchestration. It's one of the things, but that's not it. So my hope throughout this talk– all of you guys came and attended, because you care about this– my hope is to get us, the community, thinking about, and talking about, and implementing the right first principles as we go to take on continuous delivery. My name is Steven Kim. I'm an engineering manager out of New York City. And this is Tom Feiner. He's a systems operations engineer for Waze, if you didn't guess. And we're here to talk to you about continuous delivery, and a tool called the Spinnaker, which we want to walk you through. Spinnaker is an open source, multi cloud, continuous delivery platform developed at Netflix.

Google joined the project some three years ago, and, along with other members of the community like Kenzan, we open sourced in October, November of 2015. Google and Netflix, we found a kindred spirit in one another, in the way we go to approach the kinds of problems that Tom described earlier. So what I want to do today is– I want to share with you four first principles. And for each of these, walk through how they manifest at Waze to ground them a bit and give you some tangible examples. And then at the end, hopefully we'll have enough time we'll walk through a simple demo. The first principle is immutable infrastructure as Tom talked about. It means that once you deploy something, you don't change it. When there's change, you rebuild it up from the host up, or the container up, and then you tear down the old one altogether. How many of you guys have not heard of immutable servers or immutable infrastructures? OK, great. So it's widely out there. Why do we want to do this?

Why is immutable infrastructure important? We want our process of building, and testing and deploying and validating to be as deterministic as possible. We know that across environments there is a lack of permaticity where there are configuration drifts. And, as you go and deploy and promote move from environment to environment, that things changed. And just because you know that it worked in one environment– whether it be your developer local workstation or in a non-production staging or stress environment– there's still a lot out there through which things can go wrong from that promotion to the last production stage. So we want to go to minimize changes slipping in as we go to move up the runway. And here's the way it works. In practice, you bake. So every time you have a change, you bake an image using Packer or, for those of you who are using configuration management software like Puppet and Chef, you run that as much as you can. And then you bake it into an image. So in GCP, it's Google GCE images.

And then, at deploy time, you go ahead and dole those out into GCE instances. So for those of you who are from the container world, this seems really familiar, right? Because it's the same thing. You take your application binaries, you dependent runtime stacks. You find them in a Dockerfile, package up you configuration file into a Docker image using GCR build, which we announced earlier last week, or your favorite service such as Docker Hub or Quay. And you go ahead and create Docker container images. And then you go ahead and instantiate them using, for example, Kubernetes scheduler into pods or whatnot. So there's some other benefits as well to immutable infrastructure. If you remember, one of the early messages about the public cloud was elasticity. Elasticity had to do with long-time window elasticity that you don't have to go ahead and buy upfront and then eat the costs, but it also had to do with responsiveness to demand– spiky traffic, for example. Well, for those of us who might, for example, instantiate an instance of VM, run your Chef or Puppet scripts on it, if that Puppet script takes 20 minutes or an hour and a half, you're not taking advantage of elasticity.

The demand has come, and it's gone. So there's some other side effects like that where you have faster start-up times. You can respond to demand much quicker in traffic spikes. And also, it opens up opportunities to take advantage of things like deployment strategies that are very particular to the cloud. So immutable infrastructure is an example at Waze that Tom gave. It was somewhat rhetorical, but I think you guys get the idea. The second practice is around the use of deployment strategies. There are a lot of tools out there that people can use to go ahead and deploy out the infrastructure and deploy changes to code. Jenkins is a popular one. But I started by saying it can't be just about orchestration. And deployment strategies is an example of this. So we're talking about deployment strategies that can specifically take advantage of some of what you get in the cloud. So let's walk through a couple. Whoop, wrong button. A simple first one is called Red/black, some people also call it Blue/green.

And it's the notion that you have a production cluster, V1, and then you have a V1.1, and you basically create a whole new stack running that. And after it's been validated through health checks, for whatever your criteria might be, you cut over traffic. The advantage here is that if something goes wrong, you can quickly cut back. There are some disadvantages as, obviously, 2Xing the amount of infrastructure and resources required to go ahead and do this. But this is a simple example where if something goes wrong, you can go out and very quickly cut back. An evolution of this is something that we're calling rolling red/black. And it's the notion that you can go ahead and define incrementality of traffic percentage cut-overs with validation gates in between. And so we can simply say 1%, 5%, 25%, 50%, and 100%, and between each cut over to the next percentage incrementality, you can go ahead and set a validation gate. So it might be a pipeline that you run which might be very robust.

Or it might be a simple smoke test, or a functional probe. And it also has the continuous take on the advantage of Red/black, which is if something ever goes wrong, rolling back is immediate. Another evolution of that from there is Canary. And Canary is a different way to do the validation gates in between the percentage roll-outs. But it takes a slightly different approach. Rather than using functional tests or probe or smoke tests, we go to look at a set of measurements which are much more common regardless of what your application is. So we look at underlying metrics. So obviously things like CPU and memory profiling, but also latency, response codes, error rates. And you take all these, and you create a weight by which you go and say, a certain Canary score is acceptable. And that Canary score also takes into effect about an acceptable deviation of that Canary score or the performance characteristics between your previous version and your current version. So here's what it means, test coverage is always a cat-and-mouse game.

Test coverage is at 80% today. Every CodeCommit will go to the lower that has test coverage. It goes in that direction. So with Canary, you can actually go ahead and say, well, no matter what we're rolling out, we're going to look at a different set of characteristics that are common to applications. So with that, let's go ahead and ask Tom to go ahead and talk about what that looks like at Waze. TOM FEINER: So 10 years ago, the cloud did not exist, at least not as it does today. Back then, most companies ran their own infrastructure. And, from what I've heard in this conference, some still do. What's up with that? It wasn't pretty then, and it isn't now. No matter how hard you tried, you could have sent API calls to your hosting provider 10 years ago asking for 500 servers, and they would have just dismissed you as some lunatic. Luckily, these days, it's possible. And, as I mentioned, it's part of Spinnaker. So, using deployment strategies, it's basically baked into Spinnaker.

And we use that all the time at Waze. All right. This is how it looks like. Oh, I'm sorry. In a large system, as Steven mentioned, it's usually impossible to test everything in advance. So we also use a Canary stage. It's not fancy like the one Netflix uses, but it does the job. It looks like this. Basically, you have a production system running with 100% of traffic flowing to it. You then run version two in a Canary and send a tiny bit of traffic to it. You then run a Canary analysis on each and every API of that server. So for each API, we test for errors, latencies, and number of requests. We perform a relative compare against the production group. If it all works fine, we blow that up to a production scale and send 50% of traffic to the new group, but still 50% of traffic to the old one. We then run the Canary analysis again, this time expecting a very similar behavior between the two groups. If there's any significant difference, we can easily roll back by just shifting traffic back to the old group, and debugging, and then destroying the old one.

But if everything's fine, then we just disable the old group and eventually destroy it. STEVEN KIM: The third first principle is automation. So automation promotes a consistency. And a repeatability, a consistency, is important both in success and failure modes. When things are failing, you need a consistency of failure to be able to do root cause analysis. What really sucks is when it fails in different ways every single time. And success, a successful consistency promotes maturity of process that you can go and build on top of. And so it's about identifying and putting in mitigating measures, without losing velocity. So if you were, for example, to go ahead and write– a number of tools out there that does orchestration– and say, step one, step two, do this, and do a deploy. Something fails. You go, oh, that can't happen again. I'm going to go ahead and write a mediating measure so I can very quickly roll back and, say, write something else. And the way that this evolves, I said at the beginning of the talk that it slows process down, and it becomes very difficult to go ahead and improve if you're slowing down constantly, every time.

So it's difficult to improve if you haven't instrumented. It's difficult to– or less useful to instrument if you haven't automated. And it's hard to get to a consistency and a scale of information that gets interesting enough if you're not automating. So let me talk about what I mean by automation. So top level pipelines go out and define and execute for you your process on how you go ahead and release software. So, for example, find an image that you validated, looks good, go ahead and deploy using Red/black strategy, do some rudimentary smoke tests, wait a certain amount of period, go ahead and scale one down and scale it up– all the different deployment strategies that we talked about before. So this is good, but the automation also has to happen in a more fine, granular form underneath each of these steps. So let me show you. For that deploy to prod Red/black [? 1 ?] stage, here's what Spinnaker does. For each stage, there are a number of steps. And so it goes and deploys in default.

It shrinks the cluster. It disables it, seals it down. And then of each of those steps, for any one given step, there are a number of tasks that we go ahead and perform. So this is where it gets really tricky, determining the health of things. Do we need to go forward? Do we to go back? Monitoring, doing necessary cache refreshes for us to be able to do our job. And then finally, for any one of these tasks, there are three cloud platform level operations that we have to go ahead and orchestrate, so dealing with load balancers, association, disassociations, target pools, and removing instances. And for every single one of these minute actions– if something goes wrong– to be able to go out and auto remediate. So, this is a lot that you would have to actually go and do to say, we have velocity and we have confidence. What allows you to go fast is confidence. And this is the amount of work that is done in Spinnaker to go ahead and promote that confidence. And you shouldn't have to build all of this yourself.

You should go and get behind an open source project, like Spinnaker, which I will go ahead and show you a little more later. And so, for automation, let's go and hear what Waze does in a little more detail. TOM FEINER: By the way, Waze really didn't want to go and build this stuff. We were really happy Spinnaker existed. And we've actually been users of Asgard for a few years back. And we seamlessly transitioned into Spinnaker from Asgard. So it's been great. And thanks for the Netflix guys for doing this, for open-sourcing. So once deployment pipelines are in place, some interesting things can start to happen. We basically can start using some fail-safe patterns to limit the blast radius of changes. For example, at Waze, we shard everything by geography, obviously. This allows pipelines to limit the blast radius of each upgrade. So, for example, we can upgrade, for a certain service, one part of the world completely using the Red/black deployments, then move to a few more parts of the world, and eventually, worldwide.

If you're lucky and your usage patterns are predictable, it's good practice to use time restriction in deployment pipelines. This allows us to say, for example, that a certain pipeline can run between six and 12 UTC only between Mondays and Wednesdays. So this ensures that, when an upgrade occurs, it affects as few users as possible. And if you're really lucky, you can try to find a sweet spot between the off-peak hours and the office hours, then you have staff caffeinated, and ready to go to solve any issue that might come up during this upgrade. STEVEN KIM: The fourth and final one, which I don't think gets talked about enough, is operational integration. And what operational integration speaks to is the notion that the process of how you go to release software needs to work hand-in-hand with how you operationalize that software. And that goes in both directions, so two simple examples. One is that, when you're releasing– build, release, and promote, and so forth– that process is gathering and generating a lot of information that is germane to the scope where you are doing root cause analysis or defect triaging.

So, for example, while you're trying to go ahead and figure out why something went out, it's useful to go ahead and understand, oh, when we built that software, what are the commits that pertain to the change from the last deployment to the first? Here's the blast radius of the code and config change. What kind of validation tests were run? What were the results and output from that test? All this information that is generated at the time of doing your deployment becomes very, very useful and important when you're doing operational things. It goes in the other direction as well. So you have systems that are running. And the information that is gathered as your systems run and production are very, very useful as you're doing your process of doing deploys. A simple example, as we talked about, is Canary analysis. Canary analysis goes and looks at how your systems are behaving, and it could do it at point in time or statistically across time to go and determine what is the right validation gate to go ahead and say the system is behaving normally as we go to promote from one environment to another.

That make sense? And finally, release process inevitably crosses multiple teams. I know there's a lot of good aspirations out there about whether it's [? dev-ops ?] or project methodology like agile or whatever it might be, but your process inevitably has a lot of orchestration and coordination that's needed across people and teams. And so what that means is a Spinnaker pipeline, for example, can help you communicate and validate and record hand-offs that are necessary between teams as you do a release process. So to give an example of that, back to Tom. TOM FEINER: Thank you. We all know real life is messy and imperfect, especially when you're moving fast. During a post-mortem meeting a few months back, the team recognized a certain critical mirco-service which doesn't have proper automated testing. It will eventually be written, but that will take time. In the meantime, we decided that we want a person, someone from QA, to run a sanity after each and every deployment to production.

Here's the problem. Deciding to do this is one thing, but actually doing this in a repeatable manner on every single deployed production, is easier said than done. People can simply forget, tasks get shift around. Before you know it, recency bias kicks in, and we're back to square one. So here's how it looks like. What we're going to do here is that, basically, the developer commits a code. That code that starts a pipeline in Spinnaker which initiates a manual judgment stage. That manual judgment stage sends a notification to QA, asking them to approve this upgrade because we need resources from them to be there to run the sanity and make a decision. Once they approve– and they have a button to either approve or not approve, or they can just wait– Spinnaker takes over, runs a full Red/black deployment. And once that's done, another notification is sent to QA, asking to run the sanity. Then they have an option to either pass the sanity and finish the automatic deployment, or roll back.

And Spinnaker takes care of all of this. Basically, going forward just means deleting the old version, and moving backwards just means enabling the old version and destroying the new one. This has been very helpful for us. STEVEN KIM: So we've talked about four important ideas. And I know it's a lot to take in. And we've worked hard to make this sort of consumable and easy to adopt and create a ramp instead of a step. So let me walk you through a simple, basic demo that takes source code, and through Spinnaker, applying the key principles that we talked about, deploys and promotes something out to production. Can we have the demo PC, please? All right. [? Simple ?] go app. And really, it just goes and responds with, hello world, and the background color is yellow. So let's make a change– purple. I made it purple, right? TOM FEINER: Yeah. STEVEN KIM: So I'm going to go ahead and initiate my release process by pushing to the release branch and source. So what happens here is, first of all, right now we'll see that– what you're looking at is a Spinnaker– and we see a staging environment and a production environment.

And in the staging environment, if we go to what's out there, it says "hello world" in yellow. And also in the production environment, it also says "hello world" in yellow. So far, so good. And you'll also notice things like the version of the container, the tag on it, matches. Its the same thing that's been promoted. We've gone through this once. When I committed that code, what I have set up is a build trigger. This is not in Spinnaker, but you can go ahead and build your containers in any way. Docker Hub and Quay both have build triggers. And I set up a trigger in GCR build so that when there is a push to the release branch, I go ahead and trigger a build. And I can use the Dockerfile that's sitting in there, but I want to use a feature called GCR Build Steps which allows me to go ahead and build, among other things, lean containers. I just want to go ahead and take a quick segue to go ahead and talk about this a little bit. The notion is that instead of– a typical Dockerfile looks like something like from this base image, pull in this source, go ahead and build it, and then identify an entry point.

And then that whole thing happens in a container. One of the challenges there is that everything that you needed to build it stays with you as you go ahead and run it– so the SDK, dependent packages, all these things that you really don't need. It's a general notion that how you go out and build in a validation context is very different than how you might want to build for what you actually run in production. It's not really controversial. And so, what you ideally like to do is do it in two steps. One step, I want you to go ahead and build my binary. And then the second step, containerize just that binary. So that's what I've defined over here. If you guys want to look at more, you should go ahead and Google Container Builder. And what you'll see is I have the build that happened just now. And if I look at my Container Registry for that service, you'll notice that it's like four megabytes for that "hello world" container. Previously, I had built it using the standard method, and it is 241 megabytes.

So that's like 1 and 1/2 orders of magnitude we're talking about there. And it's because the Go SDK is in that container as well as all the different packages that I downloaded. OK, so back to our normally scheduled program. So in Spinnaker, I go ahead and manage the process the production in three pipelines– deploy to stage, validate, and then promote to prod. So the first stage, first pipeline, looks like this. The configuration is set such that every time there is a change to my Docker registry, a new tag appears. I want you to pull that down and trigger this pipeline. And it really has just one stage, deploy. And deploy– I'll look at the deployment details– sends it to a stack I designated as the staging environment. It deploys the container that trigger the pipeline. And it uses the Red/black strategy. And what that means is, once again, it goes and brings up a new [? assisted ?] Kubernetes. So it's going to be a replica set. And then once it's deemed to be healthy, it'll cut over the load balancer and disable the other one.

And you'll notice that I checked on the box that says "scale down, replace the server groups to zero instances." This is a non-production environment. I don't have to cut back in a hurry. Let's save on the resources. So that's running right now. And you can see, once again as I described before, for that one deploy stage, it's deploying, shrinking the cluster. Under each of those, you can go and see the individual tasks that are being managed to do this. So the next pipeline is called validate. And the validate pipeline is set to trigger when the deploy to stage pipeline completes successfully, the first one that I just showed you. And it has two simple steps. The first one is a validate stage. And this is what we call a run job stage in Spinnaker. And it basically allows you to point to an arbitrary container and run it. And whatever the exit status for that container is, it'll go ahead and proceed accordingly. And for me, it's actually just a simple– I think it's a curl command.

It goes and curls the endpoint that I specify, which is my staging endpoint. And if it comes back with a 200, it'll say it's a zero exit code. It'll continue. And then finally, at that point, I have a manual judgment stage which notifies somebody. And then that person has to go and say, yep, everything looks good, yes, or no. This is, by the way– I'll take a moment real quick– continuous delivery and continuous deployment. So continuous deployment is all the way to production automated. And continuous delivery typically entails somebody being a gate to go and say, yeah, this is ready to go. Finally– actually, let me see if that came through. Oh, yeah, here we go. So at this point, you'll notice that my staging environment– please be purple. Please be purple. It's purple. TOM FEINER: Yeah! [LAUGHTER] STEVEN KIM: And you notice that the container tags are also different, right? So it looks good. I'm going to go ahead and promote it. So when I hit Continue, my third and final pipeline will run.

And here's what that looks like. The configuration, as you might imagine, is set so that this pipeline will trigger when the validate pipeline completes successfully. And it'll do a couple of things. First, I find the image from the staging environment. I know it was good. I tested it. I want to use that instead of rebuilding or locating that image otherwise. And so I can give it coordinates to go ahead and say from the staging environment, take the image from the newest group that's out there and give me that image. I take that image, and I deploy a Canary. And for that deployment, I go ahead and specify the Canary environment. I don't use any strategies because I don't want to manipulate traffic in any way. I just want to bring something up and put it into the same load balancer. And I put up one instance. So now we have the old production as is, at full capacity. And then we have one instance of the new version in a Canary environment, both taking traffic from the same load balancer.

So what will happen at this point is your users are visiting and 1/9, or however many ever that percentage comes out to, will get the new version. And you'll start gathering data. And the idea is to start gathering telemetry around how well that's going. If that looks good, after the Canary is done, I want you to go ahead and notify somebody. They'll look at the Canary data and approve. And then at that point, they'll do a Red/black deployment to production. So it looks good. Let's go ahead and cut over the production environment using Red/black to the new version, which will disable the old one. But you'll notice, or I'll point to, that I deployed prod, I use Red/black, but I don't scale down the replaced server groups. So they're sitting there, not taking traffic because Red/black disabled them, but it's sitting there. Why? Because if something goes wrong, we want to able to cut over, cut back, very, very quickly. And we have ad hoc operations where you can click a button, and then it goes and rolls back, sends traffic out of the old one again, instantly.

And then we tear down the Canary. And once again, we use locators that say the newest server group in the Canary environment, go ahead and tear that down– destroy server group stage. And at this point, Canary's dead. We have just old prod, new prod. Old prod is disabled. Let's not be too hasty. Let's force a two-hour wait because at worst, all it's doing is tying down resources. Then destroy that old prod. So at this point, if I look at where my pipelines are, my Canary went through. It's waiting for manual judgment. And if I go back to my clusters view, here's my production. And here is my Canary. So I don't know what to really expect because I've never done this with more than me. Everybody, pull out your phones. Can we go back to the slides, please? That is the production endpoint. So, if you please navigate to, what should happen is most of you should get a yellow page. And about one out of every 10 of you should be getting a purple page.

So once you have it loaded up, can you guys raise your phones and point it toward me? TOM FEINER: Oh, yeah. It worked! STEVEN KIM: Oh, my god. TOM FEINER: Oh! Well, you can't see it, but– STEVEN KIM: All right. I'll go ahead and point that over to– TOM FEINER: Some are purple and some are yellow. STEVEN KIM: And so at this point, it's going to look good. Canary looked good. It showed up purple. I'm going hit Continue. And it will proceed on to go and destroy the old prod. Is that pretty straightforward to you guys about how that would work? Now a couple of things that are important that happened there. It was all automated. There was every opportunity where we recorded every step that was taken. It followed a process that we defined. And, really, at the point where we get comfortable with things that can go wrong, we can start cutting out those manual approval stages. And we can get much smarter than this. This is a demo scenario. If you ask what Waze is doing with Spinnaker, or what Netflix is doing with Spinnaker, or a number of other organizations, they're doing far more sophisticated things.

Like, for example, Waze is using Spinnaker to go ahead and do fleet upgrades of base OS based on pipelines that go ahead and handle all of this for them. So, as I mentioned, we open-sourced in 2015. And we open-sourced with GCE and EC2 providers. As I said, Spinnaker is multi-cloud. And a Cloud Foundry provider came out there in beta at the time. And we had Red/black strategy and some basic support. In 2016, we were headed down development, and we introduced the Kubernertes provider. The Azure provider joined in beta. Microsoft joined. OpenStack provider– and, really, those people from the community who are building these things, some of them are here in the room right now. So thank you very much. We added authentication and authorization support through OAuth 2, Google Groups, GitHub team, SAML, [INAUDIBLE]. We added much more robust support for load balancers, layer 3 and 4, layer 7 HTP, internal SSL support. We integrated a Stackdriver logging. So we're trying to make this operationally ready for you guys to go ahead and take on.

And support for things like autoscaling and autohealing. And also we're beginning to reach out into other popular systems in the ecosystem like Consul for doing things like discovery and health readiness. And in 2017, what we're working on is we're going to be adding an App Engine provider. And so that's actually pretty much done. We're ready to start taking alpha testers. So I'll tell you how to go and reach out to us later. We're beginning on formal support for Canary Strategy. That's available in the OSS realm. So Google and Netflix are working very closely on this. And we expect to be at a pretty good point in the latter half of this year with that. We're going to be introducing rolling Red/black strategy. That's also at a very good place already. Another major effort is declarative CD. And you can look at it as config as code, but it's really more than that. It's more of a holistic look at when we go and say, here's an application, so far we think about binary and config.

But also we want to go to make that bundle be defined as, and here's how you should deploy me, and manage me, upgrade me, and so forth. So what that means is that as you have Spinnaker in your organization, you can have application development teams who should be self-sufficient, define all their artifacts as code, check it into a repository, and have Spinnaker enact that on their behalf. Spinnaker is a complex system– many micro services, lots of configuration, and we recognize that this is a high friction to adoption. So in about a month, we'll be introducing formally versioned releases. And so we will run the full, synthetic transaction integration test for you to go ahead and say, [? here ?] is the combination of different services that we have validated, and work for these providers. So we can look and know what you're getting. And also, we'll be done with a Spinnaker configuration management tool that allows you to install, upgrade, configure, and validate your Spinnaker environment as well.

There's a lot of documentation and doubt there right now currently, but we think that this tool is going to be a necessity for people to really adopt and be successful with this. And finally, we're going to be filling in operational monitoring for the Spinnaker instance itself, so that if what's deploying your software might have performance issues, or something goes down, that somebody can be notified of that. And much more. Really, there's actually some really exciting stuff that you'll hear from us. It's not little things. It's pretty cool things. So is our website. And we develop out in the open, so if you go to the Spinnaker GitHub organization, you can see everything that we're doing. The best way to get engaged is our Slack channel. We are a community that's 2,000 strong– heavy participation. People are helping each other out. It's a great place to come, get introduced into the community, ask for any level of help. People are extremely friendly and helpful over there.

And also, just I'll mention as a shortcut, if you want to quickly get an instance of Spinnaker out for yourself to play around today, go to, search for Spinnaker, and you can, in one click, get an instance of Spinnaker up that you can go and start working with. There are also a set of code labs on that you can go ahead and look at. That's how you can go ahead and get started. [MUSIC PLAYING]


Read the video

Continuous Delivery (and Deploy) is a new look at how we should get our deployable artifacts into production. Google’s been doing this for quite some time with success, and we have opinions and corresponding best practices based on lessons we’ve learned. Steven Kim shares a set of first principles behind continuous delivery with tangible examples from Waze. He also demonstrates how users can take advantage of these first principles today using Spinnaker.

Missed the conference? Watch all the talks here:
Watch more talks about Infrastructure & Operations here:

Comments to Spinnaker: continuous delivery from first principles to production (Google Cloud Next ’17)

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog