Google Cloud NEXT '17 - News and Updates

Open source data processing on Google Cloud Platform (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
[MUSIC PLAYING] JAMES MALONE: Hello everybody. I'm James Malone. I'm a product manager on Google Cloud. And one of the products that I am product manager for is Google Cloud Dataproc. And today I want to talk to you about using the open source set of tools that are in the Apache and Spark ecosystem at data processing on Google Cloud Platform. So we're going to talk about a few things today. First we're going to talk about the tools themselves. They're great, they bring a lot of benefits, but they also bring a lot of complexities and headaches. And we're going to talk about how we want to solve that using hardware and software innovations that are a part of Google Cloud Platform. I'm going to put my money where my mouth is and do a live demo. And we're going to create a pretty big Spark and Hadoop cluster to show you what we can do at scale, some of the speeds, and some of the tools that we have and that we bring to the Spark and Hadoop ecosystem. And then from there I want to share a bunch of really new exciting things that we just launched in preparation for Google Next.

So the first thing that I want to talk about is the complexity of the Spark and Hadoop ecosystem. These tools are really, really great, but they often bring a lot of painful headaches with them. So typically when you set up a Spark and Hadoop cluster, it's not a sort of simple or linear process. If you are familiar with setting up a Spark and Hadoop cluster on premise especially, there's a lot of steps involved. You have to get your hardware and configure that hardware. But that's only step one. Once you have the hardware, you then need to get the open source bits. Sometimes you might do this vanilla, or sometimes you might use a Hadoop or Spark vendor, like Cloudera or Hortonworks. But then you have to install that software and configure it, tune it, and only really after you're done with that are you actually ready to start processing data. Now, if you wanted to use Spark and Hadoop on day zero or day one, you're probably not using Spark and Hadoop on day one with this process.

It's probably taken days, weeks, or months to actually get this working properly. There's even more headaches if you actually want to set up multiple clusters, maybe cluster for production cluster for development. So the scaling makes your life really difficult. So say you start off with a cluster and you want to expand it because you ran out of HDFS space, which is used to store data and Spark and Hadoop clusters. Well, there's going to be a lag of time between when you have the capacity that you start with and have the capacity that you need. This also may require you to actually take your resources offline or have impaired functionality. Probably your business or the need to actually process your data is not stopping for you while you're actually scaling this cluster up. You're actually probably compounding matters, and things are probably piling up that you wish you could get to. You also have to really babysit utilization. And this is a really big problem, especially for accountants.

You're paying for a lot of capacity, but you're probably not running your big data infrastructure at 100% all the time. That means that you're probably paying for overcapacity. Conversely, you could pay for undercapacity, but that's not really a great solution either, because then you have less capacity than you actually need to answer any of your questions. So you have to really be very careful about how you plan out your capacity and try and smooth out your utilization as much as possible. Ultimately, you're not paying for what you use. This is a really big problem. And one of the big complaints of the Spark and Hadoop ecosystem are they're great tools, and oddly enough, a lot of it's free open source. But it can be very, very costly to run and maintain. We don't want you to have this problem. Spark and Hadoop are great tools, and they bring a lot to the table. But you shouldn't worry about the cost and complexity of running these tools to take advantage of them.

So I did mention Cloudera and Hortonworks. If anybody in the room is familiar with Cloudera, Hortonworks, or MapR and you're using them, just to throw out, all three are supported on Google Cloud Platform. They all have different ways to bootstrap clusters, so there's things like Cloudera Manager or Ambari or Cloudbreak. If you want to use those on Google Cloud, you totally can. And you'd get a lot of benefits from doing that. You're going to get low pricing for VMs, you're going be able to use cloud storage. You could possibly use Cloud Bigtable instead of HBase. So there's a lot of benefits for doing. That so just to say, if you are tied to these or really love these vendors, you can definitely bring them and lift and shift them to Google Cloud Platform. So we kind of offer a suite of ways to run the giant Apache and Spark ecosystem. And that's maybe actually worth a note. So Apache and Spark are sort of the two most well-known components, but it's a really big ecosystem of related tools.

And I've never really found a better way of referring to it, other than the Apache Spark and Hadoop ecosystem. So there's a lot of tools that you can run on these clusters. And we have a whole host of options to allow you to run this software on Cloud platform. As I mentioned, you could take your on-premise or other cloud Hadoop vendor and bring it to Google Cloud, like a Cloudera or Hortonworks. We started off the journey with Cloud Dataproc by offering what we call bdutil, which is a set of command line scripts. They're open source and they're on GitHub, and they allow you to bootstrap kind of a lightweight Spark and Hadoop cluster. Some of our large customers actually started using bdutil and have moved to Dataproc. But if you want to use bdutil and get started with a kind of lightweight Spark and Hadoop cluster, you totally can. But with Cloud Dataproc, we thought we could do better. With bdutil you're still kind of manually setting up the cluster. There's still a little babysitting you have to do.

It can be kind of difficult to submit jobs. For instance, you might have to use a command line to do a lot of these things. That might not be a problem for some users, but for your average business user, you probably don't want them to have to use an SSH window or terminal window to go interact with Spark and Hadoop clusters. So Cloud Dataproc, we really tried to do a little bit better and focus on everything except your data. So we take care of the infrastructure, we take care of the software. You bring your data and your jobs, and you can use Spark and Hadoop without worrying about too much else. So there's a lot of open source components, as I mentioned. And they map to different Google Cloud products in different ways. So a lot of these products you can run on Cloud Dataproc. So if you're familiar with the Hadoop ecosystem, Cloud Dataproc uses YARN. And a lot of the applications in that ecosystem will run on YARN. So if you're using them, you can generally come and use Cloud Dataproc.

We also have other services that are in their own ways, tied to some of these open source products. And I'll get to how they're tied in a minute. But for instance, if you're using HBase, you might want to check out Cloud Bigtable. If you're interested in Apache Beam or Crunch or Flume, Cloud Dataflow is a really great solution that you might want to check out. And additionally, if you're tied to things like Impala or Century or HDFS, we have products that map to sort of all of these technologies. So Cloud Dataproc handles, I would say the majority of them. But around the edges, we also have a lot of products that can fill in the gaps. So I mentioned that these products were all inspired by open source technologies in different ways. So a lot of the open source ecosystem is based on papers that Google has released throughout the years. So for example, the relationship between Bigtable and HBase. In other cases, the open source ecosystem has kind of evolved in and of itself.

At this point, Google's also very actively involved in contributing back to that open source ecosystem, a great example being Apache Beam, which allows you to create batch or stream pipelines in kind of one language with one SDK and one model, and run them on many different routers, rather it's Cloud Dataflow or Flink or Spark. So interestingly, a lot of the things that run on Cloud Dataproc were inspired by things that Google has done in the past. And a lot of things that could run on Dataproc in the future are inspired by things that Google is doing in the present to try and grow this ecosystem all together. So let's talk about Cloud Dataproc just a little bit more specifically. So Cloud Dataproc is a way to create Spark and Hadoop clusters that are generally fast, low cost, and easy to use. And the way that we do that is we rely on a lot of the things that the Google Cloud Platform brings to the Spark and Hadoop ecosystem. And I'll talk about what things we rely on, but we've both tried to focus on making the open source components run really well and tuning some of them, and I can talk about that, but also bring a lot of the strengths of the Google Cloud Platform to open source.

And the point I make here is, it's not about just installing Spark and Hadoop on some servers and calling it good. There's a lot of really deep integrations, optimizations, and thought that goes into making Spark and Hadoop fast, easy, and cost-effective. As an example, say you want a Spark or Hadoop cluster. I started off with that really long line of having to rack servers and configure servers. But even if you don't focus on that, you still have to get the open source software and tune it and customize it and debug it. With Cloud Dataproc, we want that process to be one click. We want you to be able to fill out a very simple form, and with one click create your cluster. Cloud Dataproc is interesting, and I'll show it in my demo. We don't abstract away all of the complexities and knobs in the open source ecosystem. And in some ways, Cloud Dataproc is sort of an anti-Google Cloud product, because we do expose that messiness to you if you want it. We don't necessarily try and prevent you from doing bad things.

We want you to know this is a full-on Spark and Hadoop cluster. If you want to tune specific properties or do very specific customizations, you totally can. We try and make it easy for you and optimize as much as we can for you, but with a lot of these tools, they're very knobby. And sometimes people want to tune and poke those knobs. And we totally let you. So if you want, you can configure your cluster. And we have a set of tools and API endpoints to try and make that very easy. So that could take 20 seconds. And by 90 seconds or so, you should have your cluster. So we go from possibly hours or days or weeks to less than three or four minutes. As I mentioned, what we're trying to do is bring the open source ecosystem to Google Cloud Platform. So we have a lot of products. And we don't want Dataproc to just be isolated, because having just Spark and Hadoop by itself isn't super useful. We want you to be able to use Spark and Hadoop with other products that we have in our platform, and I have some of them shown on this slide.

For example, we want you to be able to use Spark and Hadoop with Google Cloud Storage, which is part of the demo, as a replacement for the Hadoop Distributed File System. If you want to use Bigtable, we want you to be able to use Spark and Hadoop with Bigtable. Same thing goes for BigQuery. If you're using BigQuery as an enterprise data warehouse, we want you to be able to use that powerful enterprise data warehouse with the really powerful Spark and Hadoop ecosystem. We've tried to make it very easy, focusing on moving existing workloads to Cloud Dataproc. So generally, Spark and Hadoop have been out for a while. And in one way or another, a lot of customers have data and workloads that they want to migrate. We don't want that migration to be painful. Generally that migration should be pretty easy. You copy your data, step one, to Google Cloud Storage. Step two, you should be able to change just a few lines of your job. Really, the biggest change you should need to make is just changing the URI prefix of where your data is located.

Instead of being on HDFS, it's in Cloud Storage. You really shouldn't need to make any other changes than that. So moving your actual code should be very, very easy, almost as easy as just copying your data over. And then step three, you actually use Cloud Dataproc. So this is an example job. It's from the Spark examples. And it's a job that is reading from HDFS. In this case, you can see, and it may be hard to see, and I apologize. HDFS is just stricken and we replace the URI prefixed with GS for Google Cloud Storage. So if I wanted to move this example job to Dataproc, that's the only change I need to make, aside from making sure that the data I'm reading exists in Cloud Storage and has the same file structure. So again, very, very easy to move work, because moving work and recoating things is not a value add. You're not focused on your data, and we don't really want that. Under the hood a little bit. So Cloud Dataproc is, for all intents and purposes, a Spark and Hadoop cluster.

And you can create multiple clusters in a project and submit work to those clusters. Those clusters are built on existing Cloud Platform technologies. So it's built on Compute Engine. Cloud Dataproc clusters have one or more master nodes. You have a set of worker nodes. And then you can create preemptible worker nodes that take advantage of preemptible VMs. For anybody that's not familiar with preemtible VMs, they're Compute Engine VMs, they're a special type. There's a tradeoff. They have substantially lower cost, but they also have a maximum lifetime of 24 hours. They're basically built on spare capacity and Google data centers. And if we need that capacity back, you lose the preemptible VM. Well, preemtible VMs may be great for compute-intensive workloads with Spark and Hadoop. You can use them with Cloud Dataproc. So again, we're trying to not only just rely on the lower general costs of Google Cloud platform, but allow you to use the super low cost aspects of Google Cloud Platform to bring your data processing costs down even further.

Clusters in Cloud Dataproc do have HDFS attached to them, based on persistent disk. We generally advise customers don't actually store any type of persistent data on this, because A, you don't want your persistent data to be ephemeral with your ephemeral Spark or Hadoop cluster. And then B, you can actually get better performance often with Google Cloud Storage than you can with HDFS, that's based on persistent disk. And then with your clusters, you can read and write. Generally most customers will use Google Cloud Storage to do that. Going into a little bit more detail, Cloud Dataproc has an API, a REST API. And our client tools talk to that REST API. That's the same REST API that anybody can use. And the engineering team, which I have to give enormous credit to, they're a very passionate group of people and I would not be able to actually talk about Cloud Dataproc if they weren't dedicated and awesome, has put a lot of time and thought into making the API expressive, yet easy to use.

That API really has three endpoints– one for managing clusters, one for managing jobs. And there's an API endpoint for operations. So on all of these compute nodes, we have a Cloud Dataproc agent which talks to our control plane. And this is really how we kind of create the glue of the Spark and Hadoop cluster and allow for some interesting things, like being able to submit and manage work and interact with YARN on the cluster. We build an image, it's based on Debian. And the open source components are built using another Apache project called Apache Bigtop. So basically we build our own custom sort of open source distribution of the Spark and Hadoop ecosystem. If you haven't used Dataproc before, the team takes great pride in making sure that we not only version all of the versions of Dataprocs, you can actually select which bundle of components you want to use and their version set. We also generally try to release frequently. So for example, our preview image for Cloud Dataproc is based on Hadoop 2.8, which is late and breaking.

Our production image right now uses very recent versions of both Hadoop and Spark. If you want to use older versions, you can always go back and use older versions of Cloud Dataproc. The goal here is to allow people to use the set of versions that they really want to and not lag for weeks, months, or quarters behind, but also not just put beta software on clusters and force people to dog food. So as I mentioned, all of our client tools are built on a REST API. Generally there's three client tools, or three ways that people interact with clusters. There is the Google Cloud Console. There's the Google Cloud SDK, which is often seen or known as G Cloud. And then there's SSH. Since this is built on Compute Engines, you can interact with the compute nodes as if they were just regular VMs, because they are. So you can SSH into the master and worker nodes. Pricing, a very common question for Cloud Dataproc. So with your– excuse me, I should be able to pronounce my own product name.

With your Cloud Dataproc cluster, you pay for the compute and storage charges that make up your cluster. And then there's a very low $0.01 premium per virtual CPU on top of that, that we charge for all of the awesome things that we provide as part of Cloud Dataproc. It sounds scary, but we're actually very cost-effective, if you compare Cloud Dataproc to other providers. The source was actually moved in the slide, and I kind of have to duck here, but it's there. It's actually based on an O'Reilly article, it's a comparison between Amazon EMR and Cloud Dataproc. And you can see in general, Cloud Dataproc runs jobs faster. That's important because we bill by the minute. We don't round up to the nearest hour. So if you use a cluster for 13 minutes, you pay for 13 minutes, not one hour. We've also really been optimizing Cloud Dataproc to make processing faster. And this is a really important point, because the goal is not to just let things stagnate. We actually want to speed up your data processing workloads.

It does mean that you might use less Cloud Dataproc, and we actually see that as a victory. So this is an example of a feature which we'll be rolling out soon to Cloud Dataproc, where we changed out the SSL mechanism or provider used in our Cloud Storage Connector from the default implementation to one based on Google's fork of OpenSSL. And you can see that reads have gotten, or will get in most cases, substantially faster. Writes will also see an improvement as well. And I mention this to really showcase, a goal is to have you use less Dataproc because you're able to do more with the same amount, or with less. So the supported components– a big ecosystem, there's a lot of stuff that you can run on Cloud Dataproc clusters. By default, we install the most common packages. So that's Spark, Hadoop, Hive, Pig. Some of those packages are tricky because something like Hadoop includes– I don't know how many weird [INAUDIBLE] subprojects and tangents of projects. But generally those are the four core components.

We build a lot of the ecosystem, however, in Bigtop. We have a repository with each release. So if you wanted to install something like Zeppelin, that's built in Bigtop. It's actually in Bigtop. So if you went onto a cluster or used an initialization action, which I will show you in my demo, you could install Zeppelin, because we actually have pre-built it. The reason I call this out, it's a very common question of how much can we install on a cluster? Well, you can install really, whatever you want. We try to keep it lean because we don't want to burden clusters and install stuff that people aren't going to use, and is just going to take up resources and make their data processing slower and make them sad, and then that makes us sad. So I'm going to go to a Cloud Dataproc demo to tie together some of the things that I've mentioned thus far. In this demo, we're going to do a few things. We are going to create a cluster. And I'm going to create a fairly sizable cluster.

It's not the biggest cluster in the world, but I think it's modestly big. We're going to query a large set of data. And I'll talk about the data and the queries that we're going to run. We're going to see the output of those queries, and that actually hints at some of the really fancy tooling around Cloud Dataproc. And then we're going to delete the cluster, just to show you that you only need the cluster that's really designed as an ephemeral solution. You only need the cluster as long as you actually have data to process. So the data that we are going to query is based on the New York taxi dataset. So from 2009 to 2015, the New York City Limousine and Taxi Commission released all taxi trips and Uber trips in New York. Uncompressed, the dataset is about 270 gigabytes. It's in CSV format, and there's about 1.2 billion trips in this dataset. So I'm going to switch here. So this is the Google Cloud console, if you have not seen it before. And this is the form or the page to create a Cloud Dataproc cluster.

I could create this cluster through just a raw API call. I could also use G Cloud. I'm showing you through the Cloud console, just because it is generally much easier to demo than trying to type in a console. First thing that we're going to do is give the cluster a name. I'm going to call my cluster james demo. The second thing that you generally will want to fill in is the zone where you're going to create a cluster. Cloud Dataproc is generally available wherever Compute Engine is available. That is going to be true as Compute Engine is made available in more regions as Google Cloud expands. The Cloud Dataproc team wants to be available everywhere that Google Cloud is available. So right now, the list is here and the list will grow over time. I'm going to choose US Central, just because I suspect that's going to give us pretty good performance since we're in the US right now. You then choose how you want to configure your master and worker nodes. And these are very analogous to traditional Hadoop master and worker nodes.

From master nodes, we offer three different varieties. And one of them is actually a new feature, which is having a cluster that's just one node. That means your master is your worker– good for development, exploration, possibly education, lightweight data science. The default option is just having one master. We also offer the ability to have high availability, which is three masters. And that provides, it's based on ZooKeeper and it provides YARN and HDFS high availability. I'm just going to create a standard cluster here because we don't need high availability. You can choose the machine type that you want to use in your cluster. In the web UI here, we have all of the standard non-shared CPU machine types. You can also use custom VMs with Cloud Dataproc. So if you actually wanted to use a machine type that was based on six CPUs and 58.62 gigabytes of RAM, you could absolutely do that through the API and the command line. In this case, I'm just going to choose an n1-standard-16 for my master node.

And worker nodes, same thing. Something that is worth calling out here is in the very near future, Cloud Dataproc will be supporting 64 core VMs, which were recently announced on beta on Google Cloud Platform. And I called that out because often, a lot of the improvements that are made to the underlying services of a cloud platform, whether it's Compute Engine or Cloud Storage or our networking, Cloud Dataproc benefits from those. So as a lot of other components, can Google Cloud get better, it means that using Spark and Hadoop gets incrementally better as well. For my worker nodes, I'm going to use an 8 core and n1-standard-8, just in our testing. These work usually pretty well. I'm going to go ahead and create kind of a modestly large cluster here. Let's go ahead and add 200 nodes. That will give us 1600 YARN cores, and about 4.7 terabytes of YARN RAM. We're not going to use a lot of HDFS, so I'm actually going to lower the amount of persistent disks that we attach because we're not really going to be writing to HDFS.

And then you can see there's some advanced options that are optional, that you can configure. I could do things like use networking, and Cloud Dataproc supports a wide variety of networking options. You can select, as I mentioned, different versions of Cloud Dataproc. In this case, the configuration I'm going to apply is the one [INAUDIBLE] I'm going to install Presto in my cluster, just to show you that you can install a wide variety of things on Cloud Dataproc. And Presto is actually not part of the sort of Apache big data ecosystem, but I'm going to install it and show you that it works. And we're going to run all of our queries through Presto, actually. So we'll go ahead and create that cluster. All right. So the cluster is in the process of bootstrapping itself. Let's go ahead and fill out the form to submit our job. Now, with Cloud Dataproc, one of the things that we've paid a lot of attention to is trying to make it really easy to interact with your cluster.

And often when you want to submit work– in this case, we call them jobs– to clusters, you have to use of complicated tooling or you have to log in and open an SSH window. And that's just no good. If you have a lot of work to do, we want it to be very easy for you to submit your Spark, or PySpark, or Spark SQL scripts to Cloud Dataproc. The other thing that I'll show you here is the Jobs API and the tools on top of it allow us to not only submit work, but we can actually see the streamed output from those jobs. And the nice thing is, I could actually submit the job and close my computer and throw it in a fire, and that job is still going to run. So it's not tied to the persistence of that one computer, which is actually a really annoying problem if you use Spark and Hadoop a lot, that you open up a terminal window and if that terminal window goes away, unless you thought about it, that job is going to just vanish when you leave your computer. So first thing, I'm going to choose the cluster that I want to run the job on.

The next thing I'm going to do is choose the job type. Right now we support six job types. I'm going to run Pig, and people always give me a hard time about this, but there's a very good reason. The reason I'm going to use Pig is because I am going to abuse Pig. I'm actually just going to use Pig to shell out to call Presto. A little convoluted, but it's convoluted on purpose to show you that we're not putting up bumpers or guardrails. You can do really, whatever you want. The goal here is to allow you to have life be very easy. So for example, if I wanted to just run Spark SQL, I could enter a file with my Spark SQL on Cloud Storage, or just the raw query text, very easy. And me using Pig to shell out to Presto, that's a bit more complicated and interesting. But that's really to show you that if you want to develop with Spark and Hadoop, we're not going to try and artificially limit you and constrain you too much. So with my Pig script here, I am going to specify my query file, which is inside of a Cloud Storage bucket.

And this SQL script is going to essentially just query Presto with a few queries. Let's go take a look at the cluster that we created. I was talking a little bit, unfortunately, and just blabbering a lot. My cluster actually was ready probably two minutes ago. So our 200 node cluster with 1600 cores is ready to go. All of the VMs are up, and it's ready to take work. If I wanted to, I can edit this cluster and do things like add and remove worker nodes. I can add and remove preemptible nodes. And that's while the cluster is running. So if I wanted to add more notes to this cluster while it was running, I totally can. And just because it sounds like fun, let's go ahead and do that. So let's submit our job. And the job is now running on my cluster. And as the job gets started, you'll actually see the job output start spooling here live to the screen. So you can actually see that first thing that's going to happen if you have really good eyesight is we're going to create a external table and then start running queries against that external table.

This is an example of us trying to make life easy for you. So the job was sent to the cluster. We're interacting with open source components and feeding the output back to you. Job's running, let's go ahead and say oh, I'm worried about this job, let's go ahead and add 20 more nodes or 10 more nodes, just because we want a really big cluster. So we're going to go ahead and update the cluster while the job is running. It's not going to impact the job at all. And you can see that the job is starting to spool output to this screen live. It's going to update. You'd have the same experience if you used the G Cloud command. So if you said G Cloud, Dataproc job submit Pig, and then fed it the required arguments, you're going to see this output to the screen as well. This job is not writing anything anywhere. It's really just emitting the results in standard out. Obviously could save results back to BigQuery or Cloud Storage or a number of other cloud products.

You can see that while the cluster is resizing, it doesn't go down. There is no interruption of service. It just continues to work without hassle, without you needing to do anything much more specific than that. We just went and queried all of that data. It took a minute and 33 seconds. What's really impressive about that is this data is in Cloud Storage. So between Cloud Storage, where this giant dataset is stored and the Spark and Hadoop cluster, it was able to read the data and query it in one minute and 33 seconds. And that's really because there's really fast networking in our data centers in general. And Cloud Dataproc cluster is able to use that really fast networking to go query Cloud Storage in a jiffy. So actually we're just done with this cluster. That's really what I wanted to show. So let's go ahead and say, we're done. We actually don't need this cluster anymore. We'll just go ahead and delete that cluster. And the cluster is going to go away.

No fuss, no hassle. So what happened? We created a cluster that was 1600 YARN cores. It ran the Spark and Hadoop ecosystem. In this case, we took advantage of Presto and we did it in one click. To put a little bit more detail in what we did, we created an external table that was based on the data in the Cloud Storage bucket. And then we just ran a few simple queries against that data. These aren't the most complicated queries in the world, but it is reading a substantial portion of the data from Cloud Storage using Presto. Ultimately, and I've run this demo enough and run the costs on all these different clouds. We didn't have this cluster for longer than 10 minutes. So we're going to actually be charged the 10 minute minimum for Cloud Dataproc. Everything beyond 10 minutes is per minute. That 210 node cluster costs $12.85 to run. If you ran it in Amazon EMR or HDInsight, you'd be paying substantially more respectively, just based on either our rounding or just more expensive costs on a per minute basis overall.

So I've told you Dataproc is awesome. We have some just kind of random quotes from Twitter. If you don't believe me, you can search Twitter, these are definitely there. We've had both customers offload from other clouds to Dataproc or from on premise to Dataproc. And we've really tried to focus on customer feedback to make Dataproc better. That's a core part of how we improve the platform overall. Ultimately, we also try to share best practices with customers to make using Dataproc easier. And I want to share some of those tips with you. So if you are interested in using Spark and Hadoop on Cloud Platform. Or you are using Cloud Dataproc already, you have a better experience. The first tip that we recommend to a lot of customers is to split clusters and jobs. Canonically on premise, you most often will have one cluster, and everybody will just submit all of their jobs to that cluster. And they may try and do it intelligently and submit their jobs over time or schedule them, or they may just submit all of the jobs and see what happens.

With Cloud Dataproc, you can have multiple clusters. And the clusters can be right sized and shaped and conditioned and configured to run specific types of jobs. So if I have a bunch of jobs and my first two jobs run pretty quickly, and I know I can run them in parallel, I can send them off to Cluster A. If I have two slower jobs that are very intensive and I want them to run separately, you can run them on separate clusters. We also support the use of cloud labels with Cloud Dataproc. Labels are a key value pair that you can associate with cloud resources. With Cloud Dataproc that's really important, because if I have these three different clusters and I wanted to tag maybe organization equals A, B, and C, I can then do cost accounting and billing accounting based on those labels and actually filter and list by those labels separately. So if you are running Spark and Hadoop in an enterprise setting, and you actually wanted to do something like be able to figure out how much people were using or trace that back, labels make it very easy.

And splitting clusters and jobs makes it very easy. And labels can apply to both Cloud Dataproc clusters and jobs. Development and production. If you need to have a production and development environment, and you want to have maybe a staging environment, since you can have multiple clusters, it's very easy to have separate prod, dev, tests, whatever, experimentation clusters. You just create a new cluster. The nice thing here is if, for example, you wanted to do some development and you didn't want to break or impact other users or other jobs, just create a development cluster. Additionally, since you can read data, for instance, from Cloud Storage for multiple different clusters, if I wanted to have all of my production work occurring and then do some development, maybe I have a bunch of Spark and I want to test it in Spark 2.2. You can create a new cluster, use a preview image, and start testing against that Cloud Storage bucket. You don't have to do anything differently, you don't have to copy data, it's very easy to do.

And we've had a lot of success trying to help customers migrate to the idea that you can create clusters when you need them. If you don't need them, you can delete them. There's really no reason, honestly, to– I would advocate if you're not using a cluster, delete it. We don't want you to keep it around. That's not good for you, and ultimately that just makes it not good for us. So create and delete clusters often. If you have a bunch of work and you want to schedule creating and deleting clusters, you can. As an example, if you are using something like Apache Airflow, you can actually create Cloud Dataproc clusters, run your jobs using Apache Airflow, and then delete your cluster. Great example of an ephemeral use case where you can create and delete clusters often just based on whenever you actually need them. Cloud Storage. A lot of people that shift from on-premise to Cloud Dataproc ask us, what's the performance of Cloud Storage? How does it compare with HDFS?

Why should I use Cloud Storage? Generally speaking, Cloud Storage is going to give you a very high throughput. It has its own sets of features which you may find really useful, like being able to apply security controls or having auditing on Cloud Storage buckets. It also allows you to use data and Cloud Storage between many different Google Cloud products. You could do federated queries with BigQuery on that Cloud Storage bucket. So moving away from HDFS not only brings performance and cost benefits for your Spark and Hadoop use cases, but it may provide a lot of other benefits for data that would normally have just been living in HDFS, because that's where it would live. The job submission API. If you are submitting work to your clusters, you can submit work really in many different ways. There's probably an infinite number of ways when you look at it to submit work. You could SSH in and use spark-submit. You could use something like Airflow. You could write a Python script that's sitting on a crontab, just submitting work whenever you have it scheduled.

Generally we advocate that you use our job submission API, and there's a lot of reasons for that. It allows you to do things like apply labels to jobs, as you saw. You can see the output from jobs. It makes it much easier, so you only have to fill in a few arguments to actually submit work to a cluster. So with the job API, we've really tried to make it easy for you to broker work and interact with work and manage work with clusters. So as I've sort of shown you and talked about, you can scale clusters at any time. We scaled the cluster in the demo. As I mentioned, I didn't demo it, but you can use custom VM types for your master and worker nodes. And you can use preemptible nodes with your clusters if you want to use preemptible nodes. One thing that I ran to install Presto is what I called initialization action. Initialization actions are scripts which can be interpreted on the cluster, so these could be shell scripts, they could be Python scripts, they could be Ruby scripts.

These scripts are run after the services start on the cluster. They're really useful for doing things like staging jars or copying data locally that you might want to test on or in this case as we saw, installing Presto. We also have a mechanism through our client tooling to allow you to set properties in cluster files. So if you wanted to modify, say, properties in the core site XML file, we have a way for you to easily do that. So you don't have to go in and SSH in or write an initialization action to do it. So we've tried to create two really powerful but easy ways for you to customize clusters when you start them up. There is a bunch of new Cloud Dataproc features that I want to talk about that we've released in the last few weeks in anticipation for next. We've actually gotten a few more out that I can talk about as well. The first is restartable jobs. So in our API and our client tooling, traditionally when a job ran and failed, that was sort of it. The job would just sit and have failed.

In some cases, this may not be desirable. You actually may want a job to restart. For example, you may have a streaming job that just runs out of memory, and it was kind of a corner case. And you want that job to restart. With our client tooling now, when you submit a job to Cloud Dataproc, you can actually specify the number of times per hour that you want that job to restart. This is really useful for long-running jobs. We designed it with kind of an eye on streaming applications, as we looked to better support and provide support for streaming applications on Dataproc. So that way you can be confident that if your work fails, we're going to kick off that job for you. GPUs were recently announced on Cloud Platform and made available to Compute Engine. This is a good example of when innovations occur on other parts of the Cloud Platform. If they make sense to bring to the Spark and Hadoop ecosystem, we absolutely want to. So when you create Cloud Dataproc clusters now, you can specify the number of accelerators, which are essentially GPUs, that you want attach to either your master or worker nodes in your cluster from zero, which is the default, up to 8.

If anybody is really interested in GPUs and Spark, I think you would probably know this is an evolving ecosystem, and I think the Spark support and kind of other ecosystem support for GPUs is an evolving story. But there's a lot of cases we actually may want to use GPUs now or in the future, and it was very important to us to allow you to use them with Cloud Dataproc as soon as possible. I'd briefly mentioned single node clusters. A common request that we had from customers was they wanted to just run very small tests, and they actually didn't need a full blown cluster. So they just wanted to run a really quick test using Spark. They wanted to do really lightweight data science. With single node clusters, you create a cluster which is just one node, it's the master and worker node. That way you don't have to create a full blown Spark and Hadoop cluster, because for something like, if you're leading a class, for example, on how to use Spark, you don't need giant clusters to do that.

You actually just need probably a smallish VM that has Spark installed. And with single node clusters, we've really tried to make it easy for you to create these lightweight clusters for experimentation and development. Regional endpoints and private IP addresses. So with Cloud Dataproc, when we launched almost a year ago our API and the client tools that used our API, used one region, which was global. Recently we've added individual regions for each Compute Engine region and Cloud Dataproc. So in our API and via our client tools, you can now isolate to a specific region. So you could say, if you're using our tools, say use the US region. That means the calls are going to that region. We've distributed our control plane, and essentially we've stood up individual stacks of the control infrastructure for Cloud Dataproc and all of these individual regions. The benefit there is you may get better performance. So for example, if you are in Europe and you wanted to use the Europe region, you may get much better performance by isolating to that region for applications that are working entirely in Europe.

Clusters and cloud resources have– going to private IPs. Clusters and resources have often had public IPs attached to them. One very, very common request from large enterprise customers especially was, I want clusters but I don't want them to have a public IP attached, because that makes my security people very sadface. So with the recent launch of Cloudpath last week, we now support creating clusters that do not have a public IP attached to them. And an important differentiator here is we try to make it very easy. So this isn't something that's unique to Dataproc. But the way that we've tried to innovate and make life better for all Google Cloud customers here is to make it very easy to create clusters that don't have a public IP attached to them. Both of these features are in beta, and they were both commonly requested features. And that drives back to, we try to respond to customer feedback. So honestly, if you are interested in using Dataproc, or you are and you have ideas, requests, gripes, loves, complaints, send them our way.

We try to be very, very customer-focused and hands on. How to get started with Cloud Dataproc. There's a few different sessions which either directly or indirectly mention Cloud Dataproc. Some of these sessions have unfortunately already passed. But the good news is, I believe the videos will be on YouTube. So if you are interested, this is really kind of the introduction to a lot of these concepts. A lot of these other tools go into very specific ideas and concepts. If you want to get hands-on literally right now, there is a ton of Codelabs downstairs. You can get started with Cloud Dataproc Codelabs. We also have a set of quickstarts and tutorials to show you how to create clusters and submit jobs. And just to sort of go back to the initialization actions, a lot of what we do that's customer-facing, that's open source, we have in various GitHub repositories. So we have our initialization actions, for example, for installing Zeppelin and Presto there. If you need help, there's many different ways you can get help with Cloud Dataproc.

We have documentation online. We also have our Release Notes. I again have to give the engineering team credit here. We release updates to Cloud Dataproc fairly frequently, which we really take pride in because we're trying to bring the latest and greatest of this large ecosystem with a ton of innovations to make it better. So if you have been or are interested in Dataproc, watch our Release Notes pretty closely, because they usually will change about every week for the better. If you actually want hands-on help, a really good place is Stack Overflow, using the Google Cloud Dataproc tag. You could probably indirectly meet many members of the Cloud Dataproc team based on just answers from Stack Overflow. And then we have our informal email list. And if you are a Google Cloud support customer, that's also an option as well. And with that, if there's any questions, I'm happy to take them. Other than that, thank you guys very much. [MUSIC PLAYING]


Read the video

The great power provided by open source data processing tools has often come with the burden of great responsibility. The open source data processing ecosystem, including Apache Spark and Apache Hadoop, is robust but frequently hard to administer, use at scale and manage. Google Cloud Platform (GCP), as the open cloud, can help you utilize these OSS tools at scale, cost effectively, with hands-off management. By combining Google Cloud’s strengths in both software and hardware, we offer Google Cloud Dataproc. With Cloud Dataproc, you can quickly, easily, and cost-effectively use the Spark and Hadoop ecosystem, instead of running your own infrastructure. In this video, James Malone covers the basics of using Cloud Dataproc to get started with managed Spark and Hadoop within minutes.

Missed the conference? Watch all the talks here:
Watch more talks about Big Data & Machine Learning here:

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog