Google Cloud NEXT '17 - News and Updates

ABCs of Google Container Engine: tips and best practices (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
[MUSIC PLAYING] PIOTR SZCZESNIAK: I'm super excited to be here with you. My name is Piotr. I'm a software engineer working at Google on Kubernetes project and Google Container Engine product from its early days. I'd like to show you how you can easily monitor, autoscale, and troubleshoot your applications surrounding a GKE. A number of useful tools are a part of the Kubernetes itself, but to have a full-blown monitoring solution, we need more. So I invited my colleague Deepak to join us today. Deepak, please introduce yourself. DEEPAK TIWARI: Absolutely, Piotr, thank you. So my name is Deepak. I am a product manager in the Google Cloud platform team. Specifically, I work on Google Stackdriver, which is our DevOps tool. And I'm really excited to be here to talk to you about what's available for monitoring and logging, and for identifying issues in your application, and troubleshooting GCP native, open source, as well as third party commercial solutions. We're excited to be here.

Thank you. PIOTR SZCZESNIAK: Thank you, Deepak. So I'd like to tell you a story of a team. This could be your team. The team worked really hard to build a product. And it's time for its launch. They have chosen to run the product on GKE, which is a great choice. And during the next hour, you will see why. Let's start with a very quick recap. So pod is the smallest unit of deployment in Kubernetes. A pod consist of one or more containers that will be scheduled atomically. The containers in a pod can share an IP address, local host networking, and inter-process communication. Deployment is a Kubernetes abstraction, which helps you to manage your application. It offers straightforward deployment, advanced updates, including rolling update, and self-healing mechanism. Node is a machine where containers run. When using Kubernetes, you shouldn't care about it. It's Kubernetes, it manages your containers rather than machines. This can be either a virtual machine in a cloud environment, or physical machine when you run on-prem.

And kubectl is a common light client tool which allows you to perform all operation on Kubernetes cluster, like managing your pods. So let's get back to our team. The code is written. The initial version of the project is ready. It's high time to introduce it to the world. So let's go. To render application, we will use kubectl run command. The only mandatory parameters are the deployment name, which is bar-app in our case, and container image to run. We also specify optional parameters here. CPU requests to one core, which will be guaranteed amount of resources for the pod. This will also help scheduler to make optimal decision, while choosing a node for our pod to run. This is required by Horizontal Pod Autoscaler, and I'll explain the reason later. Expose flock will end up with the service being available externally via public IP address. In GKE, this means that the Google Cloud load balancer is created underneath. Port is the default HTTP port. OK so the deployment was successfully created and exposed externally, which means that our app will be available in the internet.

How can we access our application? We need to know details about the created service. This can be done with kubectl get service command. We can see some information about service bar-app. The most interesting for us is external IP address. We can see it here. The application is publicly available. We have some users. And they are happy. It's easy. We have more and more users. The team is happy. The management is happy. Everyone is happy. But one day something bad has happened. Users started claiming that our service doesn't work. The team has to investigate the problem as soon as possible. OK, so let's do it. The first obvious idea which comes to our mind is to check application logs. We can do it with kubectl logs command. The command allows you to get most recent logs from containers stderr and stdout. In our case, we can see that not every single HTTP request succeeded. The application was not able to handle some of them. So the hypothesis is that we have some performance issues.

To confirm that, we will check resource usage of the application. We can use kubectl top command, which can display pod and node level resources. Indeed, the CPU usage is high. As you may remember, we requested 1,000 millicores to be available for the pod. So this means that CPU is saturated. In such situation, Kubernetes will try to the pod. The diagnosis is clear. We just need more resources. We can either request more resources for the pod or create yet another instance of the application. The second option is better for us since this will also provide more reliability. For example, when a node where one instance of application is running goes down, we'll still have another instance running on healthy nodes. In Kubernetes, we can easily add new instances of the application by using kubectl scale command. We need to specify the deployment name and target number of replicas. The instance will be automatically detected by the load balancer. And the traffic will be sent there. We may want to check the status of our application.

You can see that we have two instances and both are available. So, problem solved. The team is happy. They were able to resolve the problem. But Susan, the lead engineer, is wondering whether they need to scale application manually all the time. So, of course not. To decrease personal overhead, Kubernetes offer autoscaling features, which basically are designed to handle traffic fluctuations. One of them is Horizonal Pod Autoscaler, which is an answer for the problem we just resolved. Let me explain how does it work. So the idea behind, it is fairly simple. There is a Kubernetes controller which observes application. And when it notices more users of the application, it will add more instances in order to handle traffic spike. Of course, it also supports scaling down, where less users mean just less pods. Scaling down helps us to save compute resources. Horizontally Pod Autoscaler, are also known HPA, has other capabilities. It works in the way that it periodically adjusts the number of application replicas up in demand.

It ensures that average CPU utilization of the application pods meets the target value. HPA works within defined minimum and maximum boundaries. And it supports a number of Kubernetes obstructions, which have ability to be resized, including replica set and deployment. OK, how to use Horizontal Pod Autoscaler. We can set up scaling our deployment by using kubectl autoscale command. We specified the deployment to be scaled, we is in our case, bar-app minimum and maximum number of replicas, which is between 1 and 10, and the target CPU utilization to 70%. As you may remember, we said CPU request to 1,000 millicores. HPA, we try to keep average CPU utilization of our deployment around 70%. This means that when instances of our application use more than 700 millicores of CPU, by average, the controller will rescale the size of the deployment up in the same way that we did manually. It will prevent from the situation when a pod will saturate all available CPU. And here is some rumors from the Kubernetes community.

The community works currently on introducing two more autoscaling features. The first feature is ability to scale application based on signals from inside of it, like a number of HTTP requests or average requests latency. This is important when research consumption of deployment doesn't translate one to one to the necessity of scaling. Another features that Kubernetes community is working on is Vertical Pod Autoscaler, which instead of tuning number of replicas, we'll be able to adjust research the requirements of the application. This is important when application doesn't scale horizontally. Once the feature are available in [INAUDIBLE] of Kubernetes, they will be, of course, available in GKE. OK so we solved the problem. And we automatized the process so we can feel safe. Everyone is happy again. But again, something bad has happened. Let's open the panic room. Our application was supposed to be autoscaled. Let's check its status. We can see, indeed, it was scaled up to 10 instances, but only six of them are available.

We can learn more about the application by using kubectl describe command. The output of the command is usually a huge blob of various kinds of information. You can see only a piece of them. And what is most interesting for us is the events section. There we can see that our pod failed to be scheduled. The message says insufficient CPU, which means that on any node there is not enough free resources to fulfill the requirements of the pod. We have two options. We can either free up some resources by deleting either unneeded or lower priority pods, or we can just add more resources to our cluster. We live in a cloud world, so of course there is no need to call our technician, or, let's say, order new servers. We can just go to Google Cloud and request more VMs. But can we automate this process? At Google, we love automation. So that's why we introduced Cluster Autoscaler. The idea behind it is fairly simple. Cluster Autoscaler increases cluster size by adding new nodes to be more precise here.

The nodes are VMs in Google Compute Engine to the cluster when an unscheduled pod appears. In a similar way, when a Cluster Autoscaler can see unneeded nodes, it will remove them. As I mentioned, the idea is simple, but not the implementation. For example, scaling down is an NP-hard problem. There is also a number of other hard technical problems to solve there. If you are interested in this area, please join Kubernetes community, the work around Cluster Autoscaler is open to everyone. And the Cluster Autoscaler is currently in beta, but we've gotten a lot of good feedback from users and have made significant performance improvements that will be coming with the 1.6 release of Kubernetes. Let me show you how Scale Up works. So a new, unscheduled [INAUDIBLE] pod appear in our cluster. Cluster Autoscaler will perform scheduling simulation to check whether adding a new node will help with scheduling this pod. And in case the answer is yes, it will send a request to Google Cloud to create a new VM, which will be adopted by Kubernetes and become yet another node in the cluster.

Then scheduler will retry to place the pod to the new node. Scale down is slightly more complicated. So let's imagine that we have a cluster. We have some pods running there. The state of cluster changes over time. Some pods are created or deleted, either manually or automatically by Horizontal Pod Autoscaler. So we can see that a pod from Node B was deleted. And a pod from Node A was deleted too. Yet another pod from Node A was deleted. So we ended up in a state where the cluster is pretty low utilized. Cluster Autoscaler observes the cluster state all the time. It will detect such situation and perform a simulation whether it's possible to safely remove a node. What I mean by safely remove is all pods from the node will find a spot across other nodes. In this case, the answer is yes. Green pod from Node C can be rescheduled to Node A. So the node C is empty now. So it can be safely removed. So the benefits from using Cluster Autoscaler are clear. It allows to handle the service popularity growth, or just a temporary traffic spike automatically.

It saves money when the cluster is low utilized. So up until now, we've been troubleshooting with command line. But some of us, like Steve, an engineer on Susan's team, are wondering if there is another way. It's 2017. We are immigrating to Cloud world. Can we migrate old-fashioned command line tool to a new, shiny UI? No problem. Kubernetes provides a great dashboard, which allows you to, for example, to do all troubleshooting that we've done so far. And don't be afraid, there is no plan to sunset great, old-fashioned kubectl. So we can see Kubernetes dashboard, which offers a view of events within the cluster. You can see logs there and metrics, which is an area where Kubernetes dashboard is better than kubectl. You can see some visualization, but only from a couple of last minutes. So we've seen a lot of useful tools so far, such as real time monitoring, real time logging, support for basic troubleshooting. But to have a full blown monitoring solution, which is sufficient for production use, we need more.

Deepak can tell you what your options are. DEEPAK TIWARI: Thank you, Piotr. So I totally agree with Piotr on both counts. First one, totally love GKE, provides you native, inbuilt solution and tooling, if you wanted to look at how your application is performing, where you're identifying an issue when it's happening, and then the ability to set up auto scaling for nodes and clusters. And that's very, very useful. But we also know that when you are running in production environment, when you're running large number of nodes and clusters, that you actually sometimes need more powerful tools and more capability for monitoring, and logging, and for running your operations. So what are some of these capabilities? These capabilities are things like historical, monitoring, and logging. So you might want to actually not look at the things as they're happening, but you might want to actually look at the data, what happened maybe this Monday morning. And you might want to see something that happened on Friday, or maybe something that happened, there was a spike a week ago and you wanted to see how this week looks compared to that.

So Historical Data, time series data, as well as logging, you want to look at that, start to compare things. Also you want to do advanced troubleshooting. So what you want to do is you want to start looking at, if you're looking at a particular handler, or a particular machine, or particular application, you want to be able to do some sophisticated queries so you can actually narrow down a particular container, or particular pods, or particular nodes that you were looking at or application and see what was happening there. And then of course, visualizations, so you know great tools available in GKE, but for more visualization on charting, distribution metrics, and things like that, that would be very, very useful. And finally setting up alerting. So when you are looking at these metrics, looking at your log data, you might want to create metric ad hoc on that and then set up alerts so that things, errors and exceptions, are [? pilot ?] and see whenever those things happen, you get paged or alerted on that.

So we'll talk about what are the solutions and options that are available. And of course, first we'll talk about Google Stackdriver. Sp what is Google Stackdriver? So let's give a little bit of introduction about this. Here we have a very brief summary of what Stackdriver is. In summary, it's basically our GCP native solution for monitoring, and logging, and diagnostics, so you can keep running your application without any worries and identify issues as they're happening. But I wanted to draw your attention to the keywords that I have here in bold. So the first one is that it's an integrated solution. So when we started to envision Stackdriver, we started to see that in the marketplace, as well as we talk to customers, they are using a logging solution. They're using a monitoring solution. They're using some type solution and hooking things up, but some alerting solutions. Then sometimes they have a separate error reporting solution. And then diagnostics and other application level monitoring, they are using something else.

But we realized that really what you're trying to do is it's a single workflow. You had an issue, something happened. You need to figure out what happened. Go identify the issue and resolve it. So we envision this as one single product which has all of these capabilities inside it, one single go-to market, one single package. Second one is SaaS. So as you also heard in the keynote today, take an example of "Pokemon GO." They mentioned that they saw traffic, which was 50 times more than not just their average, but their best estimation. So what happens is, of course, as Piotr mentioned, you can autoscale. And you can get more computing power. And you can get more storage power and all that good stuff. But also you have to realize that you're monitoring and logging solutions also need to scale. So now you're sending perhaps 50 times more log data, perhaps 50 times more monitoring data than you envisioned. And if you are running your own clusters, you have to now start to see what you provisioned and resize your clusters and manage that.

With Google Stackdriver, you have to do none of that. It's completely managed for SaaS solution. And then the last one is, it's the hybrid solution. So of course, we all live in a multi-cloud world. And we also live in a world where predominantly people's companies still have loads on-prem. So there are use cases where you would want a single pane of glass. So we are not built only for GCP– we have built in GCP, but we are not only for GCP. We are built for any cloud. And right now we officially support monitoring and logging for AWS. And in the future, we'll expand that. But right now we already provide capabilities with public APIs that you can actually bring in. Your application might be running anywhere and you can bring in some of those signals from anywhere, from any cloud, or even on-prem. So that's really sort of the vision of Stackdriver and how we are thinking about that. So this, again, talks very briefly about things that I mentioned, which is the first point around integrated.

So within the Stackdriver package, you'll see a Monitoring solution, Logging solution, monitoring to bring in all the time series metric data, and then visualizing them, setting up alerts, logging data to bring in freeform, high volume log data. And then debugging, if you're hosting your code on GCP, you can actually hook that up so you can actually see when errors are happening, exceptions are happening. You can go to the line of the code that is actually triggering that and you can debug it. And then Trace is where you want to get full stack trace. You can go to Tracing and then Error Reporting, proactively identify the exceptions and errors happening in your application, and gives you visibility into that without you hooking anything up. So again, great out-of-the-box solution there. So [? sample ?] all of that in Stackdriver, be it building, innovating, [INAUDIBLE]. In terms of different signals, like so what is really happening? So like I mentioned, we are building truly for a hybrid world.

So you might be actually on any cloud servers. And you can even envision not a cloud environment, but we support that. Like I mentioned, AWS and GCP officially supported. And then you might be running any operating system. We support all the key operating systems today. And then you would be running, so your VMs and servers would be running that, and then finally your application. So all of these are generating a large amount of data, which come in form of, as I was mentioning, Log, Metrics, Error Traces and Events. And we can take all of that and provide a single way for you to analyze that Get Notification on that and visualize that data. So let's talk about a little bit of detail of GKE monitoring and logging capabilities. So for example here, in very brief, you will see monitoring, when we talk about, again, time series data, you are used to seeing these kind of charts and dashboards whichever solution you use, and that's what we're talking about in monitoring. And then logging, you basically have all the log data that's like a time stamp, structured with a text [INAUDIBLE] coming in.

And you might want to do more queries on that data in real time, and look through what is really happening, and slice and dice that data. And then perhaps, actually, use the log data to create more metrics that feed onto the metric side. So that's really at a high level what we provide for GKE. Now let's go into a little more detail. So at a high level, the way we think about it, or I think about it, is divided here into four high level steps. One is Data ingestion. You really care about how do I get my data into Stackdriver, whether I'm running GKE or something else. And then how do I manage that data? What kind of tools that are available to me for management, and what hierarchies do you support so I can define a cluster, perhaps, and then see data that way? And then what's happening with visualization? What are the capabilities available to me there? How to actually see this? And then finally, this is really important because we are building GCP in a very open way. Similarly, we're building GKE in a very open way.

And then we are building Google Stackdriver in a very open way. So it's not just about providing the product, but also providing you the data service. If you did not want to use Google Stackdriver, but want to take all of that data out, you have that capability as well. So let's talk about that first piece here, data ingestion. So today, any log that you have– system log, audit log, or application log– would come in by default into Stackdriver from GKE through stderr and stdout. Now that's not the best solution. So I'll talk to you about Road Map and how we are going to enhance that solution. But today, that part is available for you to bring data into Stackdriver by default. The other thing that you're interested in is in metric side. And so Heapster and cadvisor already send system level metrics, again, by default, to Stackdriver. So you would be able to see that at the node level. Then we have public APIs available, both for custom metrics as well as full logs.

So you can write any metric from your application, use our client libraries, built for the metrics, again, and for logs, or send any custom log that you might have. So that's really the capability that's available for GKE, in terms of what you can bring in to Stackdriver and how you can bring that in. Now, as I mentioned, we'll touch upon little bit of the roadmap and what's coming. But what you will see there is what we're missing today is application level metric. So it's not very easy today to have a package solution that brings all of that in. So we'll talk about that. But let's talk about sort of management and hierarchy. So the way Stackdriver is set up is that you have a concept of Stackdriver accounts, and you can keep multiple projects that you might be running on the cloud within that Stackdriver account. And you can see the metrics from those multiple projects in a single view. So you can create a single chart, for example, or a single dashboard, where you're seeing signals from all of that in one place.

So you might have multiple clusters that you might be running, some maybe one running front, one running back end in different projects, but you can visualize them together. So that's really useful. And again, within the projects, you have concept of groups. So within a project, you might be running multiple clusters. But if you want to define a group based off of nodes, you can do that. And you can, for example, again, some stuff is running maybe on staging versus broad. And you can identify that and create groups, and then see the metrics only at that level. So you have that capability. Charts and dashboard, so you can, again, visualize all this data, both the log data that's coming in, and then create log based metrics, visualize that, as well as the system metrics that we talked about is coming in, you can visualize or any custom metrics that you're bringing in. You can create your custom dashboard. And then you can set up alerting on that. And then data liberation, touching upon that quickly, you have the read APIs both for monitoring as well as for logging.

So you have metric data as well as log data that you can use read APIs or take that data out. But you also have for logs, because we know that logging data sometimes, when you're taking it and transferring it, you need a more robust pipeline to be able to do that. And we provide logs export to BigQuery, GCS, and Pub/Sub, again, at no charge. This is free of charge for all the GKE log data that's coming in, so which also gives you the power of other GCP products, like BigQuery, so you can take all your log data there and you can do, perhaps, more analysis on your log data there. We store the log data by default for 30 days, but you could perhaps even keep it for longer in BigQuery, or you can take it for archival into GCS. And then we also send log data, again, pretty much in real time to Pub/Sub. So you can actually stream the data out and you can take that to a third party solution. We have partnership with many commercial solutions, like Splunk. And you can take the data, for example, if you wanted to, all your GKE logs or any of the logs there.

So what's really coming up on the roadmap? So I think the two things that I think I wanted to say one thing very emphatically is that we are fully committed to provide the best in class solution for GKE monitoring and logging. What's missing that is application level monitoring from that. And we are working very hard on that. We are working on as meta data solution to be able to capture meta data agent, to be able to capture all the information, as well as provide you with application level monitoring, as well as logging in a more agent-based fashion, rather than taking it from stderr or stdout, or from just relying on Heapster and cadvisor. So there will be an agent-based solution for both. And then you've stalled out. And you do not worry about anything else, application system, audit level information is all available to you. And then the second thing that we're working on is a curator dashboard for GKE. So if you're running GKE cluster or GKE nodes, then you would be able to see, in Stackdriver, a fully curated dashboard.

So those are the two things we are working on, in that order. And you will see both of those later part of this year. So with that said, this morning in keynote, Eric Schmidt talked about sort of freedom to innovate but also freedom to choose. So again, we're building it in a way where you are not restricted to that. If you wanted to choose a third party solution, there are many good open source solution, commercial solutions available out there. Here are a few that we have worked read that we have heard that customers like. And so one is Prometheus. It's an open source toolkit, as you're aware, that gets you application level metrics. It scrapes all of that from GKE. And it's an open source solution and provides you some light level of dashboarding as well, like PromDash and things like that, so a very good solution. Similarly, Sysdig, third-party monitoring solution, they are built natively for container. That's how they came out. They were built for container monitoring, so again, a good solution, a commercial solution.

And Datadog, Datadog is a more generic, third-party monitoring solution. But they have really up their game on the container world as well, so again, a good solution and a good option. And then for log data, Splunk, again, like I was mentioning, we have a Pub/Sub API-based integration available already with Splunk. So again, through one click, if you already have a Splunk running in your internal cluster, or Splunk cloud, you're running that, one click contribution, you can send all your log data to a Pub/Sub topic and then Splunk can subscribe it to your account, can subscribed to it. We have a well-documented and validated solution on that. So those are the things that we are excited about offering for GKE and recommending, as well as what is available within Stackdriver for you to look at, and what we're building. We're really, really excited about that. And I'll give it back to Piotr. PIOTR SZCZESNIAK: So basically coming back to our team, they are happy because they can focus on building great product, rather than being involved in all those depths of stuff.

Kubernetes and GKE offers them built-in simple monitoring and logging solution. It offers tools for troubleshooting. It offers various kinds of autoscaling, both cluster and pod level, and more features around autoscaling coming soon. GKE offers out-of-the-box advanced instrumentation with Stackdriver. And of course, you can use any other monitoring solution. It is possible to integrate with other monitoring solution. I would like to thank you very much for coming, for attending this session. If you are interested more in Google Container Engine, I can recommend you the following sessions, especially before the first one where there is customers invited Philips Hue, who will show how they use autoscaling in their production environment. Thank you so much. [APPLAUSE] Thank you. [MUSIC PLAYING]


Read the video

Google Cloud customers use Google Container Engine for cluster management and orchestration of containers and rely on its features for production monitoring, logging and troubleshooting. In this video, Piotr Szczesniak covers autoscaling, maintaining application SLOs, monitoring, troubleshooting and managing custom extensions to ensure smooth operations of your service.

Missed the conference? Watch all the talks here:
Watch more talks about Infrastructure & Operations here:

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog