Google Cloud NEXT '17 - News and Updates

Auto-awesome: advanced data science on Google Cloud Platform (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
LAK LAKSHMANAN: Welcome to GCP Next. Today, this morning, on the Twitter feed, there was someone saying, oh, no, here's the keynotes. And we have Disney. We have Home Depot. We have HSBC. Google's going all enterprise. Where's all the nerd stuff? Well, here's the nerd stuff. So welcome to the nerd stuff, right? [LAUGHS] [APPLAUSE] What we're going to talk about today is how to data science GCP. Who does data science is GCP? Data engineers, right? So we're going to talk about how a data engineer can do data science in GCP and how GCP makes that easy, makes it auto-scaled, makes it auto-awesome. The popular imagination, if I told you what machine learning is, machine learning has been in the air today, right? And we say what is machine learning? People say, well, you have lots of data. You do some really complex math somehow and out comes magic, right? That's basically what people's imagination of ML is. In reality, though, what is ML? ML is a lot of work.

You're going to spend a lot of time and effort collecting the data. You're going to use all your ingenuity, all your knowledge, all your experience to organize this data. And then you're going to basically bring your insights, your domain expertise in order to create a model that somehow represents what you know about the domain that you're working in and how it relates to the data. And then you're ready to go. Once you have this model in place, you can now train this model. And that training is all very automatic. And then you end up in magic, right? You do end up in magic. But in order to get there is a lot of work. And what GCP can help you with is to make this work much more tractable, much easier to do. So what we're talking about here is that GCP is a place where data science meets data engineering. And I would like to introduce to you my two colleagues, Reza, who's going to play the part of a data engineer; myself, I'm Lak. So we're going to play this role of myself as a CTO, the head of BI for a company called Alphaplex.

We make widgets. And Alex here, another of my colleagues, is going to play the part of a data engineer. So what's the scenario? The scenario here is that we run a factory. And in our factory we make widgets. And of course a factory has a bunch of different machines. And these machines heat up the factory floor. We have a factory. These machines heat up the factory floor. We have people working in these factories. So we have to ensure that this factory is efficiently cooled. We need to make sure that we're not spending too much on cooling the factory. But if the factory floor gets too hot, then we are going to result in wear and tear. So we need to figure out the optimal way to cool the factory floor. Fortunately, we've been collecting data, right? We've been collecting data, and we'll be able to use that data to solve this problem of how to efficiently cool the factory floor. What we need to do is that we have a factory. We have an air conditioner. And we need to figure out when to switch on the AC.

If this sounds familiar to you, this is kind of the same issue that we do in Google data centers when we need to basically figure out how to efficiently cool a data center. This is a common problem, right? So we need to figure out when to switch the AC on. We have a bunch of different machines, right? And these machines essentially heat up and cool down, right, the factory. So we need to basically figure out how to keep these machines and how to keep the people at safe temperatures. And what data we have is the inside temperature of the factory, the outside temperature. And the reason we have that is because we went to our chief scientist, Isaac. And Isaac told us that he actually has this formula, right? So Isaac has this formula that said, now, if you know the inside temperature and you have the outside temperature, you do all of this math, I'll tell you that if you have a bowl of soup, how long is it going to take that bowl of soup to cool down. So Isaac says he has the exact formula.

But we say, hey, Newton, hold your horses, right? We don't want to basically use that formula because real life is a bit more complicated than just a bowl of soup. Why is it more complicated? Because we can monitor the inside temperature. We can monitor the outside temperature. But there are other complicating factors– complicating factors like which machines are running, how many machines are running, right? What are the machines doing and how much are the heating things up, right? How many people are in the factory. And all of these kinds of things make applying that heuristic, that rule that we had, really hard to do. And if you think about machine learning in the real world, it's usually a replacement for you use to have a bunch of data. You used to have formulas. You used to have rules. You used to have heuristics. And now you change those out and replace them by something that learns from the data. So we have collected data over time about the factories inside temperature, the outside temperature, what was running in the factory at different points in time.

We have that data. And so we will basically show you that journey, right? But first, let's start where we are. And let's say that we know things about our factory floor. We know the inside temperature. We know the outside temperature. Let's figure out how to basically balance the cost of cooling versus the cost of the repairs that are going to result if we let the temperature get too high. That's basically the balance that we have to strike. So Reza, let's build something cool. REZA ROKNI: OK. So I am the data engineer, and we're a small company. And so in terms of the number of data engineers we have, just me. So we have some constraints in terms of how much stuff I am able to do. And one of the lines that Lak said very casually is we've been collecting data. Now, as a data engineer, that line is potentially quite a lot of work for me. So let's have a look about the kind of thing that we needed to put together to actually collect that data. So we have this factory.

And again, we have one data engineer. I want to make sure that anything that I'm writing is, first of all, production grade, two, it will scale out. So while we might be doing an experiment right now with one factory, I want to make sure that if we have factories across the globe, if we have many thousands of IoT devices in that factory, recording things like temperature, number of people, maybe tomorrow it's measuring things about the machines themselves and sending that information on, no matter how much data I have, I want to be able to always absorb this data coming from all of these IoT devices. And it's important I do this in stream mode. One, it's important for the data collection and the processing but two, as we come to see when we want to do some more inference on that data points. The next piece of collecting– after collecting the data there's a couple of other things I'm going to need to do. First of all, I'm going to need to do some processing.

And again, one engineer, I want to make sure whatever I write is very small amounts of code and the systems that I'm using are fully managed. So I don't have to wake up at 3 o'clock in the morning to restart servers or to do things like re-balance things because our loads have gone up. In that processing, I'm going to do two very simple things. First, I'm going to element-wise take that data to a repository for the propeller heads to be able to do their smart data science. The other branch of this is my very basic check to when to turn the AC system on. And that is a simple test that I'm going to run a sliding window across the streaming data that's coming in. That sliding window's going to be five minutes long. For the purpose of this demo, I'm just going to do it every– the period of that sliding window's going to be 15 seconds. So every 15 seconds that five-minute window's going to slide down. I'm going to do a calculation of a mean. And I'm going to do a really simple check, which is if the value is over 30-degrees centigrade, then turn the AC on.

I'm from UK, so it would be centigrade, right? So let's think about the components that I'm going to use. First of all, I want to use Google Cloud Pub/Sub API. It's a fully managed, published API. It meets the requirements for it's global, so I can see many factories all sending messages to the API. I can have a single topic. So I don't need to worry about today I've got 100 messages a second. Tomorrow I have hundreds of thousands of messages a second. I don't need to reshard. I don't need to think about partitioning. I just keep sending information into that topic. The next piece of the puzzle is the processing side. So I'm going to take this data that's coming from Pub/Sub, and I want to process it. One of the other advantages that pops up is that it is HA, and it will hold onto all information sent to it for up to seven days until a subscriber pulls the message and also acknowledges that that message has been removed. So the component that I'm going to connect to this is Dataflow.

And Dataflow allows us to do both batch and stream processing in our pipeline. In this case, we're just going to use the streaming capabilities. Dataflow will take care of spinning up machines, alter heating, all that other stuff that, again, I don't want to be concerned with. I just want to write simple declarative code on what my pipeline should do and then let the system get on with it. In that pipeline, I'm going to have two branches. The first branch is to go to BigQuery, which is where we're going to store our information. There are multiple choices of where you could land IoT data. Especially as it's time series, Bigtable could be another option. However, again, [INAUDIBLE] just want to run queries against this system, so BigQuery's a very good place to land this data. The other branch is the simple that I want to do. Take the mean of the sliding window. If it's over 30 degrees, then tell the AC to either come on or off. And in this case, we're actually going to use PubSub again, although we haven't got it in this diagram.

The reason we're using PubSub now is not as an ingestion mechanism, but as the glue amongst my data pipeline. So when I have a message, I put it back into PubSub. It can communicate with downstream systems in the factories to be able to turn this AC on. And just to see how this would look on our pipeline as it's running, could you connect to the laptop, please? Thank you. So this is the monitoring UI from Dataflow. And it's actually showing the pipeline in motion. We've got the IoT devices pushing data into a topic that I am reading from. I am then parsing the IoT data. So it's coming in JSON format. I'm extracting the bits that I need. And then it branches out. The first branch is that element-wise, it's sending data to BigQuery. And by element-wise I mean there's no micro batching or batching. So as data's flowing in, it's immediately available to any queries the guys want to run. On the other side, I am doing our sliding window. So I create a five-minute sliding window.

I will then extract the temperature from the data that exists within that five-minute window, pass it to a mean function. From there, we will check if the value is over 30-degrees centigrade. If it is, I'm going to send it to two locations. The first location, again, is that PubSub, so that it actually can talk to the factories downstream. The other location is actually to Firebase. And the reason I'm sending it to Firebase? Well, I'm going to show you here. So this is that basic test. It's that information being sent onto the Firebase engine. And this is essentially to help us visualize the tool. So as the data gets updated, that fact that it's gone yellow means that it was the latest update from Dataflow pushing to it. Now I can actually visualize this in things like graphs that connect to the Firebase engine. And as you'll notice, you actually know when my laptop was down because we've got these flat lines going up, because it didn't have any data at that point, and it's just jumped across.

So this system will scale. Could we go back to the slides, please? So the amount of code I had to write– this is the code for the Dataflow section. For Pub/Sub, I just created a topic. There's nothing other than a DevOps action there. In terms of the code, these are snippets. But just to give you a flavor, in Dataflow, data that comes in gets turned into a parallel collection of immutable objects. And what we do with that collection is we run a series of transforms against the data. The first transform I do is saying I want a sliding window. So here, those three lines at the top– with window into SlidingWindow, duration 5 minutes, with a period of 15 seconds. Of course, if we were doing this for real, we wouldn't use 15 seconds. We don't want the AC to come on and off that quickly. You'd put it at five minutes, but then you wouldn't see the nice flashing indication on the Firebase. Next, I want to actually do the calculation of the mean. Mean, sum, counts, all of these are standard things that people want to do with data in pipelines.

So there's a primitive built into Dataflow. That single line of code allows me to get the mean from every single window. Next, we do a very trivial test, the check threshold. And that checks threshold is nothing other than returning a Boolean if the value of the mean is over 30-degrees centigrade. So that's our very basic pipeline. Now we're going to hand it over to our very smart data scientists, who are going to do some magic with that data. ALEX OSTERLOH: So I didn't get the mobile microphone, so I have to stand here, which is great. Because otherwise, I run off. So from a data science point of view, what we just saw was we have a rule in place for deciding when to turn the AC on and off. And that rule was, is the temperature bigger than 30 or less than 30, right? So it's like Google and with many of our products, we've been doing this rule-based decisions in our products a lot. And only in the last couple of years we turned to machine learning to actually improve the way our products work, make predictions, and increase the quality.

So one example is if you do a search on giants, it depends on probably where you are. If you're in San Francisco here, you might be interested in the San Francisco Giants probably. If you're in New York, it might be different types of Giants, football team versus baseball team. So we used to write these rules where you say, OK, here's the query. Then based on the location, if it's San Francisco or New York, we give the user different search results that are focused on what the user is probably desiring. In my case, I'm from Germany. So if I search for giants, it might be the Dortmund Giants, which I didn't even know existed. But I was searching for a team, and, of course, there's a German Giants team. So RankBrain is a good example of how we switch from using rules to going to machine learning. Because instead of just looking at the search terms themselves and how often they appear in a text and how close they are to each other, we actually use natural language processing to determine the relevance of a document.

And this is the third highest ranking criteria on how we determine whether a search result is further up or further down. And this is the biggest improvement we've seen in our search quality in the last couple of years. So if we turn to our problem, we want to cool a factory. We have this temperature coming in of 27 degrees. And then we're trying to decide how do we react if the outside temperature's higher or lower. Then we have a different number of machines running. And then you have new factors, right? So, now, people give off heat as well. So you might want to consider that. But maybe there's different factories with different factory sizes, with different number of air conditioning units that you can set to different levels. These are all very different criteria. So you can see, turning on or off the AC, based on these rules, can become very, very hard to do or impossible to do. So machine learning has been around for, it seems like, hundreds of years. So Isaac is thinking now, could we use machinery for this maybe instead?

So just a little 101 on machine learning and, in this case, deep learning, which I'll show you in a couple of minutes and how it works. Basically, it's a supervised learning. We use deep learning here. This is one method of machine learning that works really well for the stuff we do in Google. And it works by taking a lot of label data that you know this is a cat, dog, whatever it is. You train a model in the first step. Then you evaluate how well is this doing. So you have training data set. You have test data set. You find out the quality, and you do this in an iterative approach until you have the quality that you desire. And then what you can do is you can take that model, put it on a phone, build a cool app to take a picture of a cat and decide, is this a cat or not, right? So this would be applying to the model. So we're going to also use this for our problem here. So we are– because this all live, we're training a model based on all this data that is already being captured in BigQuery, right?

So you have number of machines, inside temperature, maybe number of people, or other factors that you can use. And then you can apply this model to actually make a decision to turn on or off the AC. Whenever you do machine learning, you're trying to optimize for something, right? So in this case, we're trying to minimize the cost. And cost in our case is the cost of the wear and tear of the machines, right? Because we're saying we don't want the temperature too high because it is bad for the machines. They need to run at a certain room temperature. And we want to minimize the cost of maintenance of buying new machines, right? So this is just checking that the temperature's not too high. And on the other side, we don't want the AC running 24/7 because that's going to cost energy, and we don't want to do that. We want to be efficient in how we use that energy to power the AC, right? So we're defining a cost factor, which is a combination of those two factors.

And we're trying to balance this out. Turns out Google has its own factory floors, right, our data centers. And we're actually applying machine learning here to optimize how we run certain workloads. You can imagine a lot of the work notes that we do are actually around machine learning, right? So we're updating models and improving our recommendations on cat videos and stuff like this. And it doesn't matter if this is running in Oregon or in Finland or anywhere else. So machine learning is a good criteria to actually make that decision for us. And we've been able to reduce energy costs to up to 40% when we apply machine learning. And there's actually a cool white paper out. There's a URL on there where it talks about how we do this. And these are some of the features used to actually create a model that is optimizing our own data centers. So if you remember our architecture that we had just a couple of minutes ago, we're taking all that data. We're pushing it into BigQuery.

Now we want to take that data and train a model. So you might have heard of TensorFlow. It's great because I can just download it. It's open source, and I can do this here on my laptop, which I'm actually going to do. The cool thing is I can then, when I'm happy with what I'm having with the data size, the small data size that I'm testing with, I can actually flip a switch and say, now I want to train on the billion data set with all the factories on, right? So doing that switch is fairly easy– training with a smaller data set locally, but then going full production and not needing to be scared if the data set gets too big or the features get too big right. So to demonstrate this, I'm going to be using something called Cloud Datalab, which is a Jupiter notebook hosted by Google Cloud. And I will be going into that now, if that's possible. Can we switch to the demos again? Thank you. So this is Cloud Datalab. This is a notebook. It's pretty cool because I can share this with colleagues.

And we can try to find a cool way to work with this data, right? So this is cool for experimentation. I can run Python code in here. I can run TensorFlow. It's one of the easiest ways to run TensorFlow. But I can also pull in a lot of the Python libraries that are out there, like pandas, that we're going to use here as well. So I'm not going to run everything here. Obviously, you need to do some set up, right? We need to define the project and the Cloud Storage bucket and these type of things. We need to import some libraries, define a schema that is representing the schema that we have in BigQuery. Here we are actually going to pull the data using BigQuery. So you can actually see the query in here. And in that query, we are also defining the cost. So we're taking a factor, which is called unsafe, which is the time that machines ran in an unsafe temperature, which is above 30 degrees. And we also have a factor of how much energy was used, the power that we used on the AC.

So we have a cost. We have the temperature, when the fan was turned on, the number of machines, and the outside temperature. So we're pulling that into a Pandas data frame. If you remember, we have all that data to train on. Now we're going to split the data into a training data set and data set to evaluate our training. So this is happening. So we're using 90-10 split of training data and testing data. And this is the output of some of the rows right in here. So you see the cost, the fan on, the number of machines. Outside temperature. That's pretty cool. So I'm going to go a little further down to the part where we're actually going to train based on all that data, right? So we're pulling all the data from BigQuery, and we're going to train and model. This is using TensorFlow. It has two hidden layers. It says it right here. It says the number of iterations. So we're going to go through. And we have two hidden layers, one with 64 neurons and one with 4 neurons.

So I'm actually going to run that, and hopefully something will happen. So this should just take not more than seven seconds. So it says local training done. So that means it's created a model, and it's putting it in a directory that we specified, right? So now we have a model and a directory. And now we can do some prediction on that data, right? So we can give it some temperatures, some inside, outside, number of machines running. And it's going to give us back a cost factor, right? So here comes the important part, where we actually now determine and evaluate how well did our model actually do, right? So remember, we had the training data. We have the test data. And now we want to compare the two different data points– how well did our predictions actually do against some test data that we held back and that wasn't used for training? So this is actually the table we get right here. So we have the true cost that we know about, and we have predicted cost. Since this may be little hard to look at and understand, there's also visualization.

So that's one of the cool things, also, with Jupiter notebooks that you can use all these visualizations libraries that are there for Python. And you get kind of this heat map, where you, based on a certain true cost factor, you see the spectrum of what was predicted, right? So here you have the true cost. Here you have the predicted cost. And this is a pretty good visualization, giving you an idea of how well are we doing. And maybe this is an indication for you to go back up and maybe do some parameter tuning and try some things out, do a different split maybe on training and test and these type of things, until you're at a quality that is good enough for your use case. Now I'm at the point where we actually have a model. And what we're not doing here is we're not training this again on Cloud ML on a billion data points because we don't have the time for you all to see how long this takes. So we are going to keep the model as we have it. And we're going to push it in Cloud ML, which means I can give Reza, in a second, an API call that he can use to ping the Cloud ML API and give it some values.

And these are the values that are going to come in though PubSub and Dataflow, give him some values and make a prediction on what action should be taken. So what you could do, actually, is you give it two numbers. You give it maybe the current temperature indications. And then you can say, OK, the way it's going right now, the way the temperature's moving right now, what if the temperature's one degree higher or two degrees higher, what would my actual cost be? So you get two values back. And that could be an indication for you to take action now or wait two minutes because the cost, actually, of turning the AC on in two minutes is going to be actually lower. So this is the type of back and forth that I can now give to Reza that he can now, hopefully, without having to tear down his pipeline that he built, can actually use in the architecture that he'll be demonstrating in a second. Great. So I think we go back to slides, right? REZA ROKNI: Yeah, thanks. So the key component that I'm actually going to have to swap out is that check that I was doing against the temperature.

So before, if we remember our pipeline, we had IoT devices all sending streams of data to PubSub. PubSub was being consumed by Dataflow. We were splitting the data, sending it to BigQuery, which allowed Alex to be able to do his work. On the other side, we had that simple test, where I was doing a sliding window of five minutes, every 15 seconds, calculating the mean, and doing a check. Now I just need to replace that with the inference that we have now. And I also no longer need that sliding window. So that sliding window was just there for me to get an average, rather than go point by point. Here, the system can just take each point by the values that we have and then give us a prediction back. So the one thing I'll need to do is change from that check threshold, where I was just checking the temp value against 30 degrees, to the new code, which is to call the Inference API. And the Inference endpoint is the one that has been deployed with that version number. We have the package shape that I need to send to it.

So that packet shape is a JSON structure I'll need to put together. That JSON structure comes from the same data that I was collecting before. So in order to actually switch this out, I will just need to replace my checksum with a REST call to the Cloud ML service. And then from there, the Cloud ML service is going to give me a prediction back. That prediction back will be in JSON format. I need to just extract the piece of data that says whether the fan should come on or off. And then through the same part as the pipeline, I will actually send it on to the factory. And it will decide whether to issue the on or off. So if we could just go back to this laptop, please. So previously, the section that we hadn't shown was this middle bit, which was taking that data. And instead of doing the five-minutes sliding window, it's actually calling the protection service. We get the output from the prediction service. I have a simple JSON deserialization to take the values that I want out from that.

And then I send it again on to both PubSub and Firebase. If we go to Firebase, I'll actually go up a level now. We can see two data points. One was the original flow, which has got check whether over 30 degrees. The other one is now the inference being called from the Cloud ML service. And the reason, actually, we'd want to keep both running is, like any good data engineer, I'm not going to just switch over from production in a second, right? This is all good stuff. I trust Alex has done a great job. But I want to make sure everything continues working before I make that switch. And this is where we start doing the AB testing, right? In this case, we're AB testing a very old, basic method of checking when the fan should come on with the new inference model. But as time moves on, let's say this is all very good, successful, and I actually make the switch, we're going to have different models being built. So maybe Alex decides that actually given the data set, every week we want to rebuild the model.

Maybe it's every day. Whatever it is, the Cloud ML service will allow you to create different versions of the model as you do that. So you V1, V2, V3, V4. This allows me to continue to make sure that I can do AB testing of that new model. So if it's happening over the day and there's a new model landing, I can maybe decide that what I'll always do as a matter of course is 30% of the workload will go into the new model. 70% will continue the old model. Once I'm happy everything's in place, I can make the switch. Dataflow allows you to do in-situ updates. So you can actually update the data flow as it's running. It will take care of redistributing the code. And at that point, switch onto the new model until another model comes along and I switch it back out. And these are the kind of things that allows us to take the data science and actually make in production and usable very, very efficiently. So back over to Lak. LAK LAKSHMANAN: Thanks. Could we get back to the demo machine?

REZA ROKNI: Do you want the slides? LAK LAKSHMANAN: Yeah, I want the demo. REZA ROKNI: The demo machine? LAK LAKSHMANAN: Yes, the demo machine, please. Right, so just to recap, the first thing is, in order to do something like this, in order to do something like this to make it easy, notice that the first thing that Reza did was that he made sure that he took all the data. Before he did all the sliding windows, before he did all the aggregations, there is this nice little thing that says Write Raw Data to BQ. That's where everything starts. If you're not storing data, you're not going anywhere. It's not going to anyplace that's auto-awesome. The first thing is save your data. And once you have saved the data, we can then go, and that's basically, that is the data that Alex could basically use in his Datalab. And you'll notice that his Datalab started with BigQuery. He went into BigQuery query and said, let me go ahead and pull my data, right– bq.Query right?

That's essentially the basic first step that you need to be able to do. You need to be saving the data in order to go back and replace the kinds of systems that you are doing now with something better. I cannot emphasize this more. Make sure to save your data. Notice some of the other things that happened, right? As Alex built the Datalab model and he deployed it, it was a very simple Cloud ML deploy. It went onto the cloud. And he essentially got something that could be called with a REST API. And Reza knew that he didn't have to worry that this model that Alex had built– built by a scientist– was actually going to work in production. It was because it's basically running in the Cloud ML service. And Reza could simply invoke a REST API call from his Dataflow code. Now, these are the kinds of engineering improvements and guarantees that enable you to do good data engineering and data science on GCP, right? And Reza went ahead and pointed out a few other things, like the ability to do AB testing, to be able to have both the old service and the new service running at the same time, so you basically build confidence in your business users before you actually make the switch.

Could we get back to the slides, please? So bottom line, then, is that this is now an amazing time to be a data engineer and a data scientist. And the reason it's an amazing time to be a data engineer or a data scientist is that the amount of work that you need to do has gone dramatically down. If you're thinking about building machine learning models, the amount of work that you need to do to build a machine learning model fits in a single Datalab notebook. And if you looked at the code for that single Datalab notebook, as a first start, it was all like structured data, go ahead and preprocess, structured data, go ahead and train. It was single calls on the Python API, a very easy way to get started doing machine learning, starting with structured data. And a great tend to be a data engine engineer because, again, you can write Dataflow code and have all the DevOps and have all the auto-scaling and all of these things be taken care of by the cloud, right? So what exactly does auto-awesome mean to you as a data engineer or as a data scientist?

As a data engineer, I think what it means is that whether you're doing ingestion– notice that Reza started out ingesting data with cloud SubPub. And it could scale from tens of messages to millions of messages, from local to global. And he didn't have to lift a finger, right? He just set up a topic, and that was it. And then anybody could basically post into that topic. The ingestion was auto-awesome. Transformation, how did Reza do the transformation? He used Apache Beam Cloud Dataflow. He wrote his code in Apache Beam, which is totally open source. He executed it using Cloud Dataflow on GCP. And that essentially, again, enabled the transformation code to be automatic, right? And we talked about how you can basically deploy a new service, and you can replace a running pipeline. How amazing is that, right, to be able to replace a running pipeline, to not lose any messages in the process of replacing a running pipeline? That's the key thing, right? When we say you can replace a running pipeline, what we're saying is your old pipeline processes until a certain message.

And the new pipeline takes on at the next message. You've not actually lost any message in the process, right? You can replace a running pipeline with no transformation. And the other thing, we kind of glossed over it a little bit, but as the Dataflow pipeline was writing things out, it was writing them into BigQuery. And we could be querying the data, even as it was done. You saw the Firebase app. The Firebase app was making a query into BigQuery, even as the data were streaming in. Again, we made it look very easy, right? But this is extremely, extremely hard. Think about being able to do querying on streaming data, right? And we're doing SQL queries on streaming data to power that Firebase graphical thing and, finally, to be able to do your training on your machine learning model. And the good thing is that all of this will auto-scale to thousands of machines on demand. All we are writing is we are writing code. We're writing code. We are defining the logic. And everything else is just auto-scaling.

The other thing that we took a huge advantage of was that when Alex did his training, he was training it on batch data. But then when we did the predictions, we were predicting on streaming data. And it was actually a very easy transition. And the only reason that it was a very easy transition was that Reza was using Dataflow. And Dataflow is a programming model that lets you deal with batch data and streaming data in exactly the same way. So we could train a model on historical data, which is what you always do. But you predict on newly arriving data, which is what you always want. And we didn't get into this, but the model that Alex built, that was his first model. And over time we'll need to do hyper-parameter tuning. We'll need to tune the models. And all of that is also automatic. All of these things about tuning a model and making them better, those can also be all part of your whole process of making them completely automatic. So bottom line, then, what GCP offers to data scientists, to data engineers is that we offer a way to fuel your innovation.

You write your innovation. We give you the rocket ship. We give you the rocket ship to fuel your innovation. Whether you're talking about ingestion with PubSub, you're talking about transformation with Dataflow, you're talking about analyzing a data with BigQuery, doing your experimentation with Datalab, you're doing machine learning with Cloud ML, it all just works. And that's ultimately the thing that makes it all– it's all very well integrated. And it provides you a huge amount of innovation. So a quick shout out to related sessions– today at 2:40 PM there is an IoT solution on Google Cloud. And there is another talk on Apache Beam, How to do portable and parallel data processing. So both of those are very related to this session in terms of being able to do stream processing. And if you looked at the session and said, hey, these are all the kinds of things that I do these days, I strongly encourage you to go visit the certification launch. The Data Engineer certification on Google Cloud is currently in beta so it's discounted.

So please go ahead and give it a try. You can take the exam right here at Next. And thirdly, the shameless plug, there is my book, "Data Science on GCP," which basically goes from ingestion all the way to machine learning, kind of the same kind of thing that we talked about here, but using a use case that's around predicting flight delays– how to ingest data and then go all the way to doing streaming real-time predictions, using time windows, et cetera. So I step you through this entire process. The book is an early release. So you can go to the O'Reilly website. You can start reading the book now. So thank you all very much. [APPLAUSE]


Read the video

A key benefit of doing data science on the cloud is the amount of time that it saves you. You shouldn’t have to wait days or months — instead, because many jobs are parallel, you can get your results in minutes-to-hours by having them execute on thousands of machines. Running data jobs on thousands of machines for minutes at a time requires fully managed services. Given the choice between a product that requires you to first configure a container, server or cluster and another product that frees you from those considerations, the serverless option is always more ideal. You’ll have more time to solve the problems that actually matter to your business. In this video, Lak Lakshmanan, Alex Osterloh, and Rez Rokni walk through an example of carrying out a data science task from a Datalab notebook that marshals the auto-awesome power of Google Cloud Platform (GCP) — which includes Google Cloud Pub/Sub, Google Cloud Dataflow and Google BigQuery — to glean insights from your data.

Missed the conference? Watch all the talks here:
Watch more talks about Big Data & Machine Learning here:

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog