Google Cloud NEXT '17 - News and Updates

Moving your Spark and Hadoop workloads to Google Cloud Platform (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
[MUSIC PLAYING] DENNIS HUO: Good afternoon everybody and thanks for coming. To set the stage a little bit, I want to begin with what you probably already know. Hadoop and its ecosystem tools have become increasingly crucial for businesses of all types. And this is in large part for its promise of providing a way to unlock value and insights from your data at massive scale, and in particular to do so with unprecedented agility, so that you can quickly react day to day to changes in the landscape of your industry. Now since you're here today in this session, you've probably also encountered, and maybe dealt with yourself personally too, this unfortunate fact that your administrative overhead and operational complexity of managing your Hadoop infrastructure is also scaling up, and perhaps superlinearly with the increasing sophistication and the diversity of your workloads. And this is something that threatens to undermine the very value proposition that it brings in the first place by eroding that agility.

So that's where Google Cloud Platform and Cloud Dataproc come in. We'll see how a migration to cloud and an incremental reorganization of the way you think about your Hadoop infrastructure and your Spark workloads can let you preserve all the power of this open source stack that you've built your systems on already, but also to remove those brick walls and let you get back to focusing on your data, focusing on your insights. So my name is Dennis Huo. I'm tech lead of Google Cloud Dataproc. And in today's session, we'll be going over deep dive, end to end of what it looks like to migrate your Hadoop and Spark workloads to Google Cloud Platform. Now along the way, we're going to find that no steps really have a one size fits all answer for everything. But rather, we'll be focusing on some overarching guiding principles and reusable building blocks that you can use to build the right migration strategy for your business needs. So let's begin with, what is Google Cloud Dataproc?

Dataproc is Google Cloud Platform's managed Hadoop and Spark service. Its overarching goal is ultimately just to help tip the scales away from having to think about Hadoop as low level infrastructure. So to help illustrate a little bit of what it means to be fast and easy and cost effective, imagine, if you will, the next version of Hadoop or Spark comes out. The shiny new version promises something like a 10x speed up of all your workloads or, say, you can simplify one of your 10 stage pipelines into a one liner in Spark. So just consider, what might it take to go from that time where you've heard about these versions and want to try it out to being able to deploy this, to canary this and be ready to switch over your production pipelines to the latest and greatest at full production scale. Well, with Dataproc, it should be no more than a one liner on the command line, or alternatively, just about four clicks on the Cloud Console. And about 90 seconds later you should be up and running with the latest and greatest, ready to canary your workloads even at full production scale.

And this same kind of speed and simplicity is also what lends itself to cost savings, ultimately, because you can more easily ensure that you're only deploying the resources that you need when you need them. So right off the bat, you can significantly reduce your cost just by making sure that you're actually paying for what you actually use. To put this in perspective in the big picture, ultimately there is a whole spectrum of approaches to running Hadoop on Google Cloud Platform. And this ranges from thinking about GCP as your infrastructure to thinking about GCP as a platform. Now along this whole spectrum, Google works with third party Hadoop vendors and provides open source connector libraries and tools. So in general, your migration doesn't have to be an all or nothing proposition. Now some of the concepts that we're talking about today are going to be applicable no matter what kind of approach to Hadoop on GCP you take, while others are going to be highlighting new architectures that are made possible by moving to this fully managed Dataproc side of the spectrum.

Taking a peek under the hood at how Dataproc is made, Dataproc layers in features in between the layers of the underlying Hadoop stack, and adds these features for easy visibility into status and monitoring, adds control hooks and configuration hooks for things like managing your clusters, resizing your clusters, optimizations. But it does so on top of the open source interfaces of the Hadoop stack, ultimately. So what this means is Dataproc adheres to the principle of letting you drop down one level into the lower level stack whenever you need. So you don't have to abandon any of your low level, advanced customizations if you don't want to just to get the value added features of Dataproc. You can bring all that with you and still keep those low level customizations. So as an example here, we have Dataproc clients and raw Hadoop clients. So if you want, you can interact entirely with your Dataproc clusters using the standard underlying raw Hadoop clients, connecting over to TCP.

And if you don't want to, you never even have to acknowledge the existence of Dataproc's Jobs API. On the other side of spectrum, you might choose to only interact with your clusters through Dataproc Jobs APIs, through these standardized web-based REST interfaces, and never even have to be the wiser about VMs existing under the hood at all. But both of these options are available. Dataproc's Hadoop distribution is based on the open source Apache Big Top project, and natively will install and support Pig, Hive, Hadoop, and Spark. Other ecosystem tools like Tez, Zeppelin, and Kafka can all be activated from the Dataproc distribution using something called initialization actions, which are just scripts that you can use to customize your Dataproc clusters for your own needs. So that brings us to planning a proof of concept, or a POC. Thinking back to that on-premise static cluster, you may find yourself becoming increasingly dependent on increasingly monolithic Hadoop clusters.

And in a lot of ways, some of the key strengths of Hadoop as a great data platform are also the source of its weaknesses. And what I mean by that is there are all these components in the Hadoop ecosystem that make it so easy, they're so well integrated with the rest of the Hadoop stack that it's easy to augment your existing Hadoop cluster with powerful tools like Oozie for scheduling or Zeppelin or Jupiter for data science, maybe various HiveServer2 clients. And along the way, you don't really notice, but then one day you'll look back and you'll see that somehow you've dumped all this systemic complexity in one place on your cluster and loaded it down in such a way that it's that much more difficult, in aggregate, to now reshape or rearchitect your workloads when it becomes necessary to do so. So what this means for your POC is, as you're thinking about how to migrate, you want to make sure you're not just thinking about forklifting your entire Hadoop stack onto GCP, complexity and messy hacks included.

But instead you want to think about keeping your eyes open for opportunities to simplify your architecture as you go. This is something that, as you keep in mind that clusters aren't an end unto themselves, that no single cluster should really– you don't want single clusters to constitute your entire data analytics universe in a sealed, hermetic way anymore. You think about the broader platform on the whole. So for example, you'll use clusters as building blocks towards your greater data platform at large. And perhaps you'll use BigQuery for the bulk of your data warehousing needs. Maybe you'll find that Stackdriver is the easiest way to unify your logging and monitoring across all your GCP services. Now as one of the first steps towards your POC and subsequent migration, you'll want to make sure that you outline your high level goals, the driving forces of what you're trying to get out of your migration to GCP. And typically these are going to be high level goals that can be achieved in any number of ways.

So for example, if you're thinking about cost savings, you'll want to make sure you're thinking about price performance and total cost of ownership instead of getting stuck on comparing CPU for CPU or byte for byte storage. Another example in terms of scaling elasticity– if you're unlocking this elasticity, GCP isn't just a way to speed up the hardware requisitioning cycle. It can be. It can be a faster way to requisition hardware in a traditional way. But really, you get the most mileage out of larger architectural changes. So for example, by separating out your storage and compute and scaling them independently, or separating out different workloads and scaling them independently, this is how you can achieve the best scalability for what you've paid for doing this migration in terms of making changes to your architecture. Now these are just some examples of goals that you might come in with and how GCP features might relate to them. Ultimately you know best what your driving goals are.

And this is how you will shape your overall migration plan. Now it's probably also worth mentioning right now– I know I've been talking a lot about change, change this, change the way you're thinking, change your architecture. And admittedly it's scary to talk about change. Change comes with cost in itself. It can be a source of complexity in itself. And it's definitely a valid concern to think about the cost of change. But what's important is both you should neither write off change altogether, nor should you blindly accept change as good. So it can be good, it can be bad. But it doesn't have to be a significant cost. And there are definitely opportunities here, during your Hadoop migration especially, for low hanging fruit where the cost isn't that significant and you have significant benefits for the long term viability of your architecture. So as long as you're weighing those costs and benefits, that's what's important, even if it doesn't always land on the side of choosing that big architecture change, as long as you know what's going into that decision to do so.

On a more concrete level, you'll want to identify as much of the characteristics that go into the requirements of your long term architecture as possible. So really characterize your existing stack and figure out what you're doing. And I focus on the characteristics less so than the requirements because ultimately you're not just coming up with a checklist of requirements. By characterizing your stack, this is a great opportunity to identify what components you really love and which ones are really providing you value long term and will end up in your long term architecture, while you'll find that a lot of things are really just workarounds or a means to an end, that actually bring you pain but were a necessary evil in the meantime, and something that you can really slice out complexity with by doing your cloud migration and adopting something a little more cloud native for those pieces. On a more practical level, the logistics of planning out your timeline and capacity planning, that's something that you'll want to do early on as possible in your workflow.

So coming up with your estimates of what you'll need over the course of your migration to make sure that you're not suddenly blocked by something silly like a quota request when it comes time to actually ramp up your migration. So that brings us to data migration. This is one of the first concrete technical items that you will be tackling during the course of your POC and migration. The cornerstone of all your migration is likely to be the Google Cloud Storage connector for Hadoop. So this is a connector that Google provides and implements the Hadoop file system interfaces on top of Google Cloud Storage. So we strive really for out-of-the-box compatibility. So this means that, for the most part, your migration will take place using familiar Hadoop tools that you're already using inside your Hadoop stack, running simple DistCp jobs into GCS. Other options that we provide include HDFS on persistent disk, which comes standard with Dataproc. Or alternatively you might consider using local SSD as a higher performance alternative.

So good for things that require– where you might want to pay a little bit extra in order to guarantee very low round trip latency for things like business intelligence or online data analytics. Slightly larger architectural change involves potentially replacing your entire HBase cluster with Google Cloud Bigtable. So in the same vein as GCS being a file system replacement, Google provides HBase compatible clients for Cloud Bigtable so that you can transparently migrate over any workloads that normally run against HBase and use Google Cloud Bigtable instead and treat it like an HBase cluster. And this is something that will drastically reduce your operational complexity by no longer having to manage your HBase cluster yourself. Another option is using BigQuery for your data warehousing and SQL based analytics. So in the same vein we provide connectors for BigQuery, for Hadoop, so that you can still interact with your storage in BigQuery while doing the bulk of your data analytics in BigQuery itself, with zero ops, without managing your clusters.

Timeline-wise, some of the data storage plans that you take on will help shape your migration strategy in itself. So as an example, one recurring pattern is you might find that you're archiving data into GCS. Or you might find that regardless of how you want to migrate your workloads, you're starting to do disaster recovery backup into GCS. So in these cases, it's very natural to start adopting ad hoc use cases inside of Dataproc. So now that your data is already there, you might find that an unexpected backfill comes up. And you can use Dataproc to get that done quickly. Or maybe new pipeline development where you don't want to occupy space on-premise, it's a natural fit for using ad hoc Dataproc usage. Digging a little bit deeper into the compatibility of GCS and HDFS, like I mentioned, the GCS connector is designed to be compatible for the vast majority of Hadoop use cases, so a drop in replacement. Anything that uses file input/output formats, Spark text, binary files, all these things just work without you having to think about it.

That said, there are some caveats, right, because Google Cloud Storage is an object store, not a file system. And this is something that's becoming also increasingly common within the Hadoop ecosystem itself is to use these object stores. So different cloud providers and private cloud solutions like OpenStack Swift also have similar object store semantics. And this is something that's recognized and becoming better supported within the Hadoop community itself day to day. So that will get better. But in the meantime, it means that there are some more specialized HDFS features that might require still using HDFS. This includes directory semantics. So an object store doesn't actually have directories, so you don't have things like directory access times. Additionally, objects are immutable once they are created. So one use case that might still do well on HDFS instead of GCS would be if you have a workload that requires appendable files. In general though, a lot of these constraints come from architectural designs that are actually benefits.

So for example, some of this increased round trip latency comes as a result of no longer having any single point of failure or any single bottleneck. So you can have arbitrarily many files in a single GCS bucket if you want, a single GCS directory, because it's all massively distributed on Google infrastructure, right? And as an implementation detail, that means there's an extra layer of indirection, which is why latency might take a little bit longer round trip. In general, the same principles for optimizing your HDFS workloads applies to GCS as well. So using small files can hurt in terms of round trip latency. So avoiding lots of small reads, avoiding lots of sequential small metadata requests, using bulk reads of data whenever possible all still equally apply to GCS. So to visualize what it might look like to actually run this data transfer from your on-premise cluster into GCS, there are generally two categories. You can have push-based or pull-based transfer of your data.

The most straightforward way here is push-based where you simply install the GCS connector on-premise and use the same public APIs to access GCS as any other client. So this is a very straightforward approach as long as you can modify your on-premise cluster and you have some spare capacity for running, say, DistCp MapReduce jobs. And this will just push everything over your public internet gateway into Google Cloud Storage. And in-transit, GCS requires HTTPS. So everything is protected by that SSL layer. As an alternative, if you can't do that, you can set up a Cloud VPN gateway to bridge your on-premise network with a Compute Engine network. So Dataproc supports deploying your clusters into custom subnetworks that have all the IP address and firewall rules that you need to make your VPN work. So once you've bridged your networks, Dataproc cluster is being built on raw, native, compatible Hadoop clients can, out of the box, interact with your on-premise cluster and pull data out through that VPN tunnel.

So you simply specify your HDFS colon slash slash on-premise name node path. And everything just works. And one benefit of running Dataproc clusters like this to pull the data out of on-premise is that you can use Dataproc's cluster sizing features in order to adjust your transfer rate as needed to. And this also doesn't have to be putting data into GCS. You might be performing these transfers into HDFS on Dataproc instead. So that brings us to some overarching cloud architecture concepts. Like I mentioned before, there's no one size fits all answer. So again, these are just some building blocks that you can assemble into your own migration strategy. But these are recurring patterns that are well utilized in a variety of places. So to begin with, I want to think back again to that on-premise cluster. And traditionally, Hadoop places a lot of emphasis on data locality and bin packing and things like this. And it makes a lot of sense in that it's common to get to come in and provision a cluster– and plan out a Hadoop cluster based on storage size.

So you'll say I need like an x petabyte cluster, and that's a certain number of disks. And you can adjust your CPU and RAM a little bit. But in large part, you're still driven by saying, you need X petabytes of storage. And you just have to plan to then make do with what compute comes with it, for the most part. And ultimately, this is a leaky abstraction. This is letting storage implementation details start to dictate application layer decisions of how your workloads are shaped and how your workloads are sized. So on Google Cloud Platform, you don't have to worry about bin packing. That's Google's problem to worry about. And instead, you can separate out your storage and compute. And not only do you get your cost savings by only having to pay for what you're actually using and having independent scaling, this actually also simplifies your workflows, because you don't have to worry about shaping your workloads to the right size for your hardware anymore.

But rather you do it the other way around, pick the right hardware for your workflows. So this is a concept that actually unlocks a lot of other architectural concepts as well. And to put it in terms of number, imagine if you have a job that takes say, 10,000 vCPU hours. On cloud, it costs the same whether you provisioned 20,000 vCPUs for half an hour, or if you provision 100 CPUs for 100 hours, if the math works out– yes. So in that case, it costs the same, but the data is so much more valuable to you now rather than later. So you don't have to try to slip it into some spare capacity and let it run between the cracks of some existing infrastructure. You just burst into the cloud and get the job done immediately so that you can have the data as quickly and as valuably as possible. One nice feature of Compute Engine that Dataproc natively supports is preemptively VMs. So preemptively VMs are just like any other VM. They operate just like any other VM. The only difference is that Google reserves the right to reclaim these VMs, to reclaim them, preempt them if other on demand or production workloads make it necessary.

And as a result, they only cost 20% of the normal cost of a VM, so a significant cost savings. And normally it might be difficult to manage these kinds of things, joining and leaving your cluster by yourself. But on Dataproc, that's Dataproc's job to make sure that it replenishes these VMs if they leave the cluster and make sure that they smoothly join and leave the cluster over the course of your workload. So that means that it's basically just fire and forget. You set your number of preemptible VMs on your data cluster, let the job run, and you get better price performance. Another Compute Engine feature that Dataproc natively supports is custom VM types, which Urs talked about in today's keynote as well. These, again, go back to the notion that you don't have to worry bin packing. That's Google's problem. So you don't have to worry about stranding resources because of an oddly sized workload or having to underprovision for a given workload, right? Pick exactly the right mix of CPU and RAM that you need, and Dataproc will work it out.

Dataproc will dynamically calculate the right Hadoop and HDFS configs based on the actual requested amount of CPU and RAM. One of the bigger architectural concepts that is enabled by that disaggregation of storage and compute– and I mentioned this a little bit before too, is that you don't want to think about single clusters as your entire universe. You don't want your Hadoop cluster to be your hermetically sealed data analytics universe. Instead, the broader platform as a whole is your data analytics platform. And so in this concept, you'll separate out different workloads into different clusters. That way this eliminates a lot of this multiplicative complexity where you try to bin pack a lot of different things into the same place. So you can significantly simplify things that way. And what's more, you can independently optimize these. So for example, you might find that some workloads, like say your business intelligence or your online analytics, prefers paying a little bit more for very high performance and very predictable performance.

So you might opt for something like HDFS on SSD and cache a lot of your data assets inside of that HDFS for fast performance. And on the other hand, you might have some overnight ETL pipelines, batch jobs, that you just want to optimize for cost. So in that case, they can be completely differently shaped. You might mix in a lot of preemptible VMs to sacrifice a little bit of that predictability of variable job run time in order to get much better cost savings. In the same vein, not only can you chop up clusters based on the shape of your workloads, but also chop them up in terms of your lifecycle. So what this means is on the far end of the spectrum, if you have ephemeral clusters, this really gets rid of a lot of the headaches that come from dealing with high complexity, long lived clusters, so single use clusters that are made possible by Google's minute granularity billing and fast deployment time. And this makes it trivially easy to ensure that your hardware is able to evolve in lockstep with the evolution of your workloads.

This also means things like even migrating to a whole different region are as trivial as flipping one configuration value in your ongoing ephemeral cluster workload. In the middle ground, there's also this notion of using semi-long lived clusters. So in this case, instead of using a single very large and forever lived cluster, chop it up into a small handful of semi-long lived clusters where you can still preserve much of the look and feel of an always on cluster. You always have clusters available there. But instead, you're sort of load balancing your jobs into individual clusters inside a group, transparently. And by doing so, what you can do is, for example, if you want to canary a brand new version, you get a nice hermetic environment for performing this A/B testing. You can canary one entire cluster unto itself without any risk of leaking over and breaking your other production clusters. But you can canary that with a brand new version, naturally let some amount of your production workloads go to that cluster.

And if it's all green, it's very easy to now upgrade. You're not trying to change the tires on a moving vehicle anymore. You get to do this in chunks one cluster at a time. Similarly, you don't have scaling bottlenecks anymore, or single points of failure, because now you can keep adding more clusters to your group in order to increase your scale without worrying about the single bottleneck of scaling. So that brings us to managing your clusters and jobs. So I want to take a look at what it actually looks, what it actually feels like to manage your clusters and jobs on Dataproc. And I have slides here for posterity. But we will do this live in demo so you can get a better feel of what it actually looks like. So I mentioned before, a one liner on the command line. And this one is a little bit more complex. The basic command that you'll run is– whoops– gcloud dataproc clusters create. Give it a cluster name. And actually you can hit Enter right then if you want. In my case, I'm going to call this dhuo-demo-1.

I'm going to give it five workers. I'm also going to turn on this property for enabling Stackdriver monitoring. So you can see how we can build some custom dashboards. Also I'm going to attach a tag. You'll see how that plays into everything. And importantly I've also put a time command here. So you can keep me honest. We can see how long this cluster actually takes to deploy. So these are all clients that are based on Dataproc's web-based REST interfaces. That means that you get ubiquitous and consistent access from everywhere. Excitingly, one of my colleagues is also concurrently using this demo project to do another live demo next door. And while that is doubly tempting for the demo fates, no reason to be worried precisely because we're using different clusters. It doesn't matter if we're sharing the same resources. It's completely safe. We can do anything we want. We won't break each other. As I mentioned, so kicked of that cluster from the command line.

And sure enough, it's deploying in this web interface. So these are different clients all talking to the same REST interface. And it also means that third party vendors are able to integrate with these same open source REST APIs using auto generated libraries. So we have native support in things like Apache Airflow, in Luigi, in Python mrjob. So in all these open source projects, you can get Dataproc support out of the box. Also show what it looks like to create a cluster entirely from the UI. You just select the dropdown, give it a name, and hit Enter. It's that easy. So there, I created another cluster. In fact, let's create another one, for fun. In this case, I mentioned you might want to use the latest and greatest versions of Hadoop. And what might that entail? Well we always provide a preview image version, which has the latest and greatest software that's going to be targeted for the next release. So right now Hadoop 2.8 hasn't actually been released yet. But we build again snapshots of Hadoop 2.8.

So we can select this preview image. And in those four clicks– I think that was four clicks– we will have a Hadoop 2.8 snapshot. Not official Hadoop 2.8, but a snapshot cluster that lets us be ready for that next release. So going back to here, our time command says that took one minute and eight seconds, so actually less than 90 seconds. In 68 seconds we've now deployed a cluster, and it's ready to go. The second that it says that it's done, it is indeed ready. And I'm going to go ahead and run an SSH tunnel. So it sounds tricky, but it's really just another one liner. I'm going to do dynamic port forwarding on port 12345. -N, so it's not taking standard input. And I'm going to start up a new browser, point it at that cluster. And let's see, I have to turn this off. So sure enough, this cluster's ready. This is your standard familiar Yarn UI. So we're not locking you out of any of your familiar Hadoop idioms by adding Dataproc. So we have our five nodes.

This cluster is ready to go. I mentioned being able to submit from raw client, so let's go ahead and SSH into that, demo-1. So on this cluster, I have access to all my low level clients. Let's go ahead and run hadoop jar lib/hadoop-mapreduce. It's examples.jar. And we can teragen some data. And like I mentioned before also, drop in replacement for HDFS. Instead of specifying HDFS path, I just say gs:// and everything just works magically. teragen, let's call it tmp1. So this is not using Dataproc's Job API. So this is just using the raw Hadoop client. And here's your familiar driver output. It will tell you the status of your MapReduce job as it goes and everything. And sure enough, as this is running, we'll see it pop up in Yarn. So we can see the status of this. And in this case, there's no application tags. This was one raw, not through Dataproc's Jobs API. But at least we can see, it just works against GCS without any extra fancy configuration.

Alternatively, you can use Dataproc's gcloud command. So jobs submit. Specify your cluster. Specify your JAR file. And I just have the Hadoop MapReduce example's JAR local here. I'll just open it and [INAUDIBLE]. So [INAUDIBLE] that JAR. Standard Hadoop JAR file, it's all compatible. I don't have to recompile it to run on Dataproc or anything that. Pull it out of any Hadoop distro. And I have this local. In this terminal, I'm not SSHed into the cluster. And if I want, I can completely seal off this cluster from the outside world too. We recently just launched the ability to deploy private IP only clusters. So you can create clusters that don't have any public IP addresses, and you can still interact with it through all your Dataproc Jobs API. And it's just full access to all GCP services. So let's go ahead and run this. And I'm intentionally running this without any arguments here just so you can see. There's no SSH session here, but we funnel out the driver outputs.

So I made a mistake. I didn't give any arguments. And so the job will fail and print out this help text. It will tell me, oh, I need to specify an argument. So even though I'm not SSHed in, but Dataproc's client, job submission client preserves the look and feel of interacting with these drivers so that you can up arrow and you can have a fast development loop to react to any errors that are coming from your job. So let's go ahead and also specify teragen demo/teragen_temp2. In this case, we are submitting through the Dataproc Jobs API. So let's see. I want to show these side by side, but my screen's a little smaller. In any case, this look and feel again preserves what it feels like to run from your standard Hadoop client. So if you have something that likes to stash away the driver output as you're running, it's easy to switch over to the Dataproc jobs client without changing the way you think about it. And I just did Control-C there. Normally I'd be out of luck if I did Control-C after running a Hadoop JAR.

But with Dataproc jobs, you can rejoin waiting for any job. So I Control-C it. But I just have to say wait for that job, and it just rejoins that driver stream right where I left off. I could have thrown away this laptop, opened up a brand new laptop, and continued without skipping a beat. And this also means that that job– so here was that job that we intentionally failed. It gave us help text. And here is that job that succeeded, that we just ran. So ubiquitous access from different job clients. You can see the job output here. So that makes it very easy. You can check in on your jobs with your phone if you want. It's very convenient. Now, let's see, so going over this Stackdriver monitoring. We also enabled Stackdriver monitoring. And one concept I talked about is using this to provide overarching monitoring of multiple different Dataproc clusters in one place. So let's create a group here. We tagged our cluster with– I believe it was dhuo-live-demo. So let's call it dhuo-live-demo here.

And you can filter it by tag. So let's just say that if the tag exists, this constitutes a group. So this way I can manage multiple Dataproc clusters with the same tag and put them in the same monitoring group. So let's see what it takes to create a brand new monitoring dashboard from scratch and see if we can do it live. So I just created a group. Now all I have to do is go in, create a brand new dashboard. It's untitled. I'll call it dhuo demo, and start dragging and dropping to add these charts. We'll see, there are built in host side monitoring values. And so let's see, network traffic, let's make it aggregate. So we use the stack area. And we can filter by group. So let's choose our dhuo-live-demo group. Great, so now we have a chart. Data's going to start coming in. Let's add another chart. We'll add a couple of these. Let's see, outbound is good, filter. So it's that easy. All I have to do, point and click, and I have myself a nice monitoring dashboard.

Aside from using Stackdriver in itself, you can also just click through to your cluster– oh, that was the wrong cluster– click through to your cluster, and there's already some amount of built-in stuff. So this was our teragen job running. So you can see CPU utilization, network bytes, and it pops up. So that we have something more interesting to monitor, let's kick off something a little bit more beefy. So a bit of a throwback here. For old times sake, let's sort of bunch of data. So I think I have some data here. And let's put the output in terasort. So let me submit this job, and we're going to sort a bunch of data and see what monitoring looks like. We can also take a peek at logs. So we're well integrated with Stackdriver Logging. And what you can do here is, going through cloud Dataproc cluster, you can search for dhou-demo-1. And sure enough, we have one cluster there. And as that job kicks off, we have these different categories of logs. We have daemon logs, so you can see about the health of your overall cluster.

There's also user logs. So user logs will be associated with each of the jobs. And this is all funneled into Stackdriver so that you can search here. You can actually create log-based metrics too. So that's something that we recently released as well with a bit of a refactor of our logging. So you can set up monitoring charts and alerts based on what kind of logs are coming through. So here we have the job that we just started, that we just kicked off. So you can see how it's kicking off a lot of individual map tasks, all your usual stuff usually. You don't have to dig through these unless something goes wrong. Now taking a look, this is kind of taking a long time. I'm impatient. It took like a minute to get a third of the way through. So what does that mean? Is that 30 minutes to complete this job? And yeah, we don't have that kind of time. So, we're impatient. We want to make it run fast. What can we do? What if we make this thing have 20 times as much resources.

How hard is it to scale that up? Well, that was about four clicks. I lost count. But about four clicks later, we resized our cluster up to a hundred nodes instead of just five nodes. It's not quite done yet. It's not 100% instantaneous. But the second we clicked that, we saw these GCE VMs start to pop up. So you see these green checked boxes. They're starting to come in. They still have a little bit of initialization to do. But as they come in, we're going to see that we have this running job. And in Yarn we're going to see this active node count start going up. So even in the middle of a job, we felt that it was slow. So we can immediately do a couple of clicks in the console and immediately start seeing new nodes come in. Well we'll take another look there. We're also going to– let's go ahead and add a few other monitoring charts. So I use some host side monitoring. What we did, I clustered deployment time, was we install the Stackdriver agent, which reports guest, reported utilization metrics.

So this means we can even look at fork rate– sure, why not? We can look at CPU, according to the guest, so that you can see if there's any kind of discrepancy. And this one, I guess we'll split it out by line instead of stacked. So that way we can see sort of the variability across different nodes. And another thing is that since we created this dashboard based on this tag that's going to apply to all the VMs, as we add more VMs, it will automatically become part of this chart. As we have more clusters, it will also become part of this chart, if we're targeting those other clusters with the same group tag. So let's see, what else is fun– memory used. Memory used is likely something that you'd want to set maybe an alert on or something like this to see– or when debugging a job, this is where you can potentially look at some skew. So just by hovering over it, you can see what kind of skew you're dealing with. So some workers might be using a lot of memory and OOMing.

And you'll find that you need to chop up your data differently. So this is also a development tool for your pipelines. I should have come back here earlier. We missed them coming in. But sure enough, we have a hundred active nodes. So it was that quick. And actually it comes in faster than it takes to deploy a cluster. We resize it up. In less than a minute, we should have our extra capacity. And also sure enough, this whole thing sped up. Should have seen it in action, but the second we added all those nodes, this thing sped up. And suddenly instead of waiting 30 minutes, it's done within the span of basically our demo. So let's see, once you're done– so right, I also created a Hadoop 2 cluster. So just for fun, let's take a peek at that. I think I called that 3, Hadoop 3, or demo-3. So here is the shiny new Hadoop 2.8 UI, subtly different from Hadoop 2.7. But if you click on About, you can see this is indeed built from a Hadoop 2.8 snapshot. So this cluster will be running Hadoop 2.8 with Spark 2.1.

And in general, this is a great way to canary. If you're doing development, a fast development cycle, clusters are so throw away anyway. If something goes wrong, throw it away and you only last a couple of minutes. So generally, you can use this preview image to stay abreast of all the developments in the open source ecosystem and have very agile development against the latest and greatest. So we can wait for that to finish. Once you're done with your clusters, it's just as easy to delete your clusters as it was to create them. You can click some checkboxes, click Delete, delete multiple clusters at once if you want. And we're just sorting data. We don't have to watch it finish. So, yeah, it's that easy. So that about summarizes it. Let's go back to the slides and we can recap what we saw. So again, to recap where we saw, Dataproc is built on these web-based REST APIs that allow consistent and ubiquitous access from a variety of clients, both from Google-provided tools like that gcloud command line and Cloud Console, as well as third parties for integration.

So there's a thriving ecosystem here of things like Airflow adding orchestration first class for Dataproc and Luigi and Python mrjob that all integrate with these web-based APIs. We also saw how job submission preserves the look and feel, whether you're using low level clients or gcloud dataproc job submit. But by submitting through the Jobs API, you get this value add of logging, of job tracking, of status tracking, of viewing your job driver output from the Cloud Console, and more to come as well. We also saw that it's trivially easy to update your cluster, even in the middle of a job. So if you underestimated how much resources your job would take and you want to get it done fast, no big deal. No harm done. Click a couple of buttons and you're good to go again. You can speed up that job. And finally we saw integration with Stackdriver, both on the logging side where you can see your user logs, you can see your daemon logs, and on the monitoring side where it's easy to just point and click in order to create customized monitoring dashboards in the same place across different GCP services, so that you can monitor everything, track everything, set up alerts all through Stackdriver.

So let's recap what it looks like to put together everything we learned together, just to illustrate what this might look like at various stages of your migration. So to begin with, you might have some amount of archival data. You might be looking to free up space on-premise. So you're trying to archive some data to GCS. Maybe you're trying to set up disaster recovery, so backing up your data to GCS already. In such a case it becomes very natural, even if you're not forcing any kind of compute migration, to start adopting Dataproc ad-hoc. So oftentimes what happens is some kind of unexpected backfill that you otherwise wouldn't be able to accommodate on-premise, no big deal. You can run it and you can run it quickly on Dataproc bursting into cloud. New development– why constrain yourself to working on your development cycle according to whatever scraps are available on-premise when instead you can sort of burst into cloud and feel free to break things as much as you want and not worry about production as you're doing it?

In the interim, you might find yourself with somewhat of a hybrid cloud setup. So by using a VPN tunnel to bridge your on-premise network with a GCE network, Dataproc natively supports deploying onto these same subnetworks so that they can talk natively to your on-premise clusters. And they might be pulling some of the data. They might be helping mediate some of your data transfer. They might also be doing ETL or analytics directly against your on-premise cluster. So either way they're augmenting your on-premise cluster. And in the meantime on the other side, you'll find that you still have some subset of things as more of your data starts making use of GCP's storage options. You'll find that you can burst your development pipeline still. So same idea as purely bursting into the cloud, but these will be augmenting your overall platform, even with Dataproc clusters that aren't directly accessing your on-premise cluster. And finally we see a cloud native architecture where everything is based on cloud storage.

And again, instead of any Hadoop cluster being your universe of data analytics tooling, GCP as a whole is your data platform. And you'll be able to unify all of your logging and monitoring across different tools in one place using Stackdriver. You might find third party tools to be useful for performing your scheduling and orchestration and productionisation of your Dataproc jobs. And you'll likely find some mixture of completely ephemeral jobs that are very hands off, easy to manage, while others that might not have come along so easily but still are amenable to semi-long lived clusters. And again, with semi-long lived clusters, you still eliminate the headaches of worrying about this two-year-old cluster where you're worried that a gust of wind might blow it over and you don't want to touch it. Here, you can canary brand new versions of software. And if a new guy comes in and breaks an entire production cluster, no big deal. You throw it away, create a new one, let it join the group.

And it's that easy to manage your infrastructure. So that pretty much concludes today's session. A couple of related sessions include this afternoon, there's going to be BD205, which is a more higher level view of Dataproc with some neat demos, less migration oriented. Machine learning on Spark MLlib is going to be tomorrow morning. And IO321 was concurrent with this one, but you can catch it on recorded sessions if you have the time for a deeper infrastructure level dive into GCS performance. Just some doc links– you can find a lot of good reading materials under You can also see those initialization actions I talked about hosted on GitHub and maintained by the Google team, along with a lot of ecosystem contributions. And various sources of help– naturally you'll have your Cloud Support contracts where you can contact high level escalations of Cloud Support troubleshooting. There's also a very active StackOverflow community. And of course under our normal documentation, you can find a lot of helpful links.

So thanks. [MUSIC PLAYING]


Read the video

For those who use popular open source data processing tools like Apache Spark and Hadoop, there can be several steps involved before being able to really focus on the data itself: creating a cluster to use the open source tools, finding a software package to easily install and manage the tools, then finding people to create jobs and applications and to operate, maintain and scale your clusters. In this video, you’ll learn how you can use Google Cloud’s managed Spark and Hadoop service, Google Cloud Dataproc, to take advantage of your existing investments in Spark and Hadoop. You’ll have the opportunity to see how easily your existing data and code can be migrated to Google Cloud Platform (GCP) and how within Cloud Dataproc clusters can be right-sized, run ephemerally and cleanly separated to maximize your invested resources.

Missed the conference? Watch all the talks here:
Watch more talks about Big Data & Machine Learning here:

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog