Google Cloud NEXT '17 - News and Updates

Apache Beam: From the Dataflow SDK to the Apache Big Data Ecosystem (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
[MUSIC PLAYING] FRANCES PERRY: So today we're going to be talking about Apache Beam, which is an open source project that focuses on batch and streaming big data processing. Now there's a technical talk on Apache Beam later today. This talk is really about the growth of the Apache Beam community. And in particular, this is the talk about a journey– how we went from being a Google-owned project into an open community-driven project under the umbrella of the Apache Software Foundation. Now this transition drastically changed the way we work, from the day to day tooling that we were using to the assumptions that we can make in our code. And along with the new community that we're building, it also drastically increases the scope of the things that the project can accomplish. So with us today we have four members of the Beam community to talk about this journey. My name is Frances Perry. I've been an engineer at Google for a number of years, working on the technology that lead into Beam.

And I'm also on the Project Management Committee for Beam now. Tyler Akidau is an engineer at Google who primarily focuses on our internal real-time and streaming data processing, internally and in the cloud, and he's also a PMC member. Thomas Groh is a Google engineer and a committer on Beam. And he works right in the core of the project, building sort of the heart of the Apache Beam abstractions. And finally, Jesse Anderson is the managing director of the Big Data Institute and also a committer on Apache Beam. So we're going to divide this journey and the story of this journey up across the four of us. So Tyler will start by introducing the history of this technology at Google and why Google was interested in donating this product and this code base to the Apache Software Foundation. Next, Thomas is going to talk about the transition itself– how the code had to adapt to this move and also how the people and the engineers had to adapt to this move. Jesse will talk about what attracted him to the Apache Beam project and how he got involved in the growing community.

And then finally, I'll be back to wrap this up with a discussion of what's next for the Beam community and how you can get involved. So before we dive into this, though, I'd love to get a rough feel for the background of the people in this room. So very quickly, could you raise your hand if you're already familiar with the Apache Software Foundation? Awesome. How about the current big data ecosystem– so Spark, Flink, Hadoop– some? OK. And Google Cloud Data Flow– great. And when you're interested in learning more about Beam, are you interested in becoming a user of Beam, either on GCP or elsewhere? Couple. And who's interested in contributing to the Beam project and joining us? Excellent. I'm thrilled to see that. So I hope that as we talk through some of this today, you'll really get a feel for how important the Beam community is for this project. So with that, we're going to leave it Tyler to talk about some of the initial history. TYLER AKIDAU: Thank you.

All right. So as Frances said, I'm going to talk through a little bit of the history of Apache Beam from Google's perspective. Specifically, I'm going to give a brief history of four projects that led up to the creation of all the technology that's in Beam. And I'm also going to talk through Google's motivation for being a part of the creation of the Beam project itself. So first up, a brief history of data processing at Google– so our journey with large scale data processing began with MapReduce back in 2003. So before MapReduce, anyone within Google, or without really, who wanted to perform large scale data processing had to create a custom system. This custom system consisted of two things. It consisted of the data processing logic– so the actual code that described the type data processing you wanted to perform– as well as this complicated set of distributed systems logic to give you the scalability and fault tolerance over unreliable machines that you needed.

And this was a real hassle, because everybody was constantly rebuilding these custom systems and dealing with both their data processing logic and their distributing systems logic. And so what the folks who created MapReduce realized was, if we could split those two apart, it would make lives a lot easier. And so that's what MapReduce did. It really provided a simple API for describing your data processing logic. And then that was coupled with a scalable fault tolerant execution engine for running it. And this was very successful at Google. And we thought it was so great that we really wanted to share it with the world. So we wrote a paper. As MapReduce continued to be successful within Google, some common patterns evolved. People wanted to solve more complex data processing problems than you could express with just a single MapReduce, so they would write large pipelines of MapReduces. You'd have these sequences and trees of MapReduces. And since they had all these MapReduces lined up that needed to be run, everybody started writing custom systems to orchestrate the execution of these MapReduces.

And additionally, to make these run really well– really perform it really efficiently, they started hand tuning the code. So they'd take this nice, clean logical code that they'd written, and they'd start pulling pieces of it out into other MapReduces and performing these hand optimizations that would obfuscate what the actual logic was supposed to be, but make the MapReduces run faster. So FlumeJava aimed to change that. And it did so in three ways. One is that it expanded the API for MapReduce and generalized it into something that could express an arbitrarily complex and large pipeline in a clean, clear manner. Additionally, it was able to take this clearly expressed pipeline and optimize it under the covers and come up with an efficient and performant execution plan for actually running this thing on real hardware. And then additionally, it provided orchestration of these various MapReduces that were actually the physical representation of the code that users had written.

And as before, this worked out really well. It was very successful Google, and we desperately wanted to share it with the outside world. So we wrote a paper. Around the same time, there was a class of problems that were being solved with MapReduce and Flume but weren't really being served well by it. These were the types of problems that dealt with unbounded data– so essentially, infinite streaming data sets that needed to be processed continuously– and also typically really benefited from providing low latency results– so getting answers quickly. And so the MillWheel project was created to try to address those use cases. And what it brought to the table was a true streaming engine that could operate on the level of record by record. So as data flow in, one record at a time, it could process it and give you your results. And in doing so, this was able to provide those low latency results that folks were wanting over unbounded data sets. But it also did so without compromising correctness.

And this was even in light of things like machine failures, fault tolerance, and out of order data. And as before– as the trend you've probably seen so far– we really wanted to share this with the rest of the world. So again, we wrote a paper. So finally, having built these three systems, we looked back at what we'd done and said, we've got these three systems that touch different aspects of the area. It would be really nice if we took all of the great ideas from those and combined them together into one system. And so that's what Cloud Dataflow was. And Cloud Dataflow really is two things at a high level. It's this combined, unified batch plus streaming programming model and SDKs for actually writing these pipelines, as well as a fully managed execution engine for running them. And as our history has shown, we would love to share this with the rest of the world. But we finally got it right. And we finally shared it, the full system itself. And that's what Cloud Dataflow is.

So you can take this SDK and download it, write pipelines with it, ship them off to the cloud service, and it will be executed for you, which is fantastic. And just for good measure, we also wrote a paper about it. So that's a history of all these systems that led up to Beam. But where does Beam fit into the picture here? Well, like I said, Google Cloud Dataflow is really these two things. It's this programming model and SDKs for writing your pipelines and this managed service for executing them. So when Google, along with a few partners– Talend, Data Artisans, Cloudera, others– decided about a year ago to create Apache Beam, really what we did was we took the programming model in SDK and donated that to the Apache Software Foundation. That's what became Apache Beam. And Google Cloud Dataflow itself remains this managed service for executing Apache Beam pipelines. And yes, just to be clear, Batch plus Streaming equals Beam. That's where the name came from. But what were the motivations behind Beam?

Well, there are really four that are worth touching on here. One is the technology. All along, we've loved this idea of sharing awesome technology with the rest of the world. And that's part of the joy of open source software is truly being able to share not just the technology, but everything that creates that technology. And that leads then to communities, which is the second, and perhaps most important, part of this is instead of just being this project that's built within Google with whatever resources Google is able to throw at it, with whatever ideas Googlers may have, now it's developed externally with the rest of the world, incorporating a diversity of ideas. We can basically enlarge the number of people that are actively working on it and really push forward the state of the art in ways that it never could otherwise. Just look at the amazing diversity within the Apache Hadoop ecosystem and what's happened there. Thirdly is the portability aspect. We really think that Beam has a chance to be the programmatic equivalent of what SQL is to declarative data processing.

So you can write your pipeline once in Beam. And then you can choose to run it wherever makes sense. You can choose to run it on Spark. Or you can run it on Flink, or you can run it on Cloud Dataflow. The core concepts are the same. And also, it's a chance for us to draw a line in the sand and not just give you a lowest common denominator of what you could write for a pipeline, but really say, this is the state of the art, this is what things should be, and then try to help ensure that runners everywhere move up to that high standard. And lastly, of course, there's a business reason for Google for participating in Beam. The more people we have running Beam pipelines, the more people there are that might want to run it on Google Cloud. And our goal as Cloud Dataflow developers is to make Dataflow the absolute best place to run Beam pipelines. But then wearing our Beam hats, we just want to make sure that Beam is the absolute best way to write pipelines in general, no matter where you run them.

So with that, I've talked about the history and the motivations behind Beam. So I'm going to hand off to Thomas now. And he's going to talk a little bit about the mechanics of bringing the Beam project to life. THOMAS GROH: Thank you. So like Tyler said, I'm going to be talking about how we took the the Java SDK for Cloud Dataflow and taken that and our development process and moved all of this to the Beam project and the Apache Software Foundation and what that's changed with how we develop and the process and what we've actually had to implement to make that happen. So I'm going to start at the first Beam pull request. So this is taking all of the SDK for Java for Dataflow and dropping it into the Beam repository. And as you may be able to tell, we came here with a bit of history– more than 1,500 commits. We're adding 165,000 [INAUDIBLE]. And this is all after we've taken a lot of effort to tease out a lot of the Dataflow-specific parts of the system.

But that's not all. We still need to integrate runners that are provided by Cloudera and data artisans for Spark and Flink. We still need to reorganize all of our directories, because we built in a lot of platform-specific assumptions when we think we're always going to be running on TCP. And we need to just get Dataflow out of the name. It's everywhere. It's in all our packages. It's in all our namespaces. And we need to move that, put it in Beam. And these are all pretty significant backward compatible changes. So anyone who's jumping ship immediately is going to break. And while we're doing all of these changes and we're breaking all of our current users who are trying to jump ship and we're moving everything to Beam, we're also, as engineers, parachuting into this new distributed community. And when you're working in distributed community, it's very, very different than working in a local community. When you're working locally, you can wander 10 desks down, grab a coworker, grab a whiteboard, and just go at it until both of you agree on what you're trying to do.

When you're in a distributed community and the person is 10 time zones away, that's basically impossible. So we've had to develop a new kind of distributive process that includes this whole set of contributors and puts everyone on equal footing. And that's meant we put a lot of work into working on the mailing list first. If you're not working on the mailing list, you're not really working. It doesn't really count. And that's built a lot of process where we have documentation. We built designs, and we proposed designs. And then months or years later, we can look back at those designs and we can say, well, why did we do this? What was the context? And we can pull all of that back out. But all of the stuff we dropped and we contributed has a lot of that history behind it. And as a result, it's really difficult for us to say, well, why did we do this, because that's all from years and years of internal experience. So it's not really easy for us to share it.

So we've combated this by writing blog posts. We've taken things we've determined internally like style guides and our design principles, and we brought those to the community and said, here, we have experience doing this. We think these are great. And then we've worked with the community to publish that kind of design principles and those style guides alongside some of the more mechanical parts of contribution, like here's how I push code, here's how I think about the code I'm writing. But now we've got a mailing list and a community and a repository. And the repository's filled with code. But all that code is really mostly still just the Dataflow SDK for Java. And the vision of Beam is that we have a single model. And that model works across languages. And that model is extensible. And it's a platform for other people to build on. But when we have just the Dataflow SDK, that's not so much the case. So when we want to say, well, we've got this Beam model, then we need to really codify, well, what is the Beam model really?

And we've done this by building a couple of new things like the Runner API and the Fun API, which are kind of an extensible core which you can build new SDKs on and which you can build new runners on. And they're the extension points for what Beam as a project is. It's not just a single implementation of our programming model. It's the model. And all of these implementations build on top of that. And then we've got a vision. And we've got our code base. And we've got all these features that we're developing. And we've got a bunch of contributors who've managed to start building on top of this. But beyond that, when you've got a repository and all these cool ideas in it, what you've really got is a bunch of text files in a repository. And they might be impressive text files, but they're still just text files. So we've built the infrastructure to enable these contributors to go, write test, build their code, run their code, and try to make sure the community gets access to everything at every point in this process.

Designs go on the mailing list. And the community reviews them. Code goes on GitHub, and the community reviews it. And then we've also brought along the culture of testing and making sure we know what our code is doing and what's happening. And then because Beam is this thing that we don't just want to run, but we want it to run well and we want it to get better over time, and we're in the process right now of building benchmarking tools and performance testing tools, again, for the entire community. So it's, again, not one implementation, but all of us. And with this infrastructure and this extensible platform, we've seen a lot of contribution from the community and new people getting involved and adding things like I/O connectors, adding new runners, adding new libraries. And the runner authors have been really valuable, because runners are doing a lot of development by themselves. They're saying, we've got this cool new idea. We're going to bring it in.

And then they come to us on the mailing list. And they're like, we're trying to do this. Why don't we put it in the model? And what that means is that we're all together, getting better, not just individually we've got this cool new thing and it only works on Spark, or it only works on Flink, or it only works on Dataflow. We're saying, build it once. Run it anywhere. It's just going to get better over time. And then we've just also got a diverse community who has their own perspectives. They see how Beam is used, how they want to use Beam, and they bring it back. And they say, well, I want to contribute. I want to do the thing. And it's very exciting to see new people come in and say, I've got this great idea. I'm trying to get this new thing to work here. I want to work with you and develop that. And Jesse has been with us for a pretty long time. And he's one of these community members who come in and said, well, here's how I see Beam being used.

And I'm going to give you to him now. JESSE ANDERSON: So why was I looking for Beam before I knew about it? When Frances was asking of you about why you're here, hopefully you're thinking about the same things and you're seeing some of the same things that I am, because I teach pretty extensively. And I was seeing some issues out in the big data community that I was looking for a solution. So there is a picture of me. You can see that I have a new body now. It's kind of like "Futurama." But there I have that new body. And that was me walking around Strata San Jose 2016– about a year ago– asking everybody I knew that I was saying, I'm seeing this. I'm seeing this. I'm seeing this. And I'll go through the things that I was seeing. But I was really trying to find a solution. I'm seeing these issues. And you're probably seeing those issues right now. And then I bumped into this guy at Google named William Vambenepe. And I said to William, everyone's having to rewrite.

How many of you are sick of rewriting every time you went from maybe MapReduce or you went from one framework to another? Are you tired of that? I know I am. This is absolute waste of our time. And then I don't know about you, but since I trained in big data, I have to learn every single one of these APIs. There's value for me to doing that. There isn't value for you to do that. You trying to learn every single one of these new APIs that don't work with each other, that's a big issue. That's a total waste of your time. And then I put my CTO hat on– hopefully you got enough of that during the keynotes today of CTO speak– well, here's my CTO speak. Guess what? Some of these projects are going to die. There is not enough room for all these frameworks to completely live. There is not enough market share. Inevitably, we're already seeing some of these products dying. And you know what's going to happen? Those companies are going to be stuck in the lurch.

They're either going to have to prop that product up, or they're going to have to move again. And guess what? When you make that case to your CTO or your manager and you say, oh, guess what, we just rewrote to this one last year, that project died. We're going to have to go over here. Those are not the kind of conversations you want to have with your CTO. So I said, what are people doing, because I knew I was going around. Everybody was saying, I don't know. I'm seeing these things, too. Or they may have been seeing them. And then William said, Apache Beam. And I said, OK, well, what's that? Give me some more information. And there we have it. So here's what I want you to take away from this session. If we're learning framework-specific APIs, every time a new framework comes out, we have to completely change. And they completely change their existing API. That doesn't create any value at all. Have you ever seen an SEC filing where somebody said, hey, we moved from MapReduce over to Apache Spark.

You ever seen an SEC filing that says that? No. Companies do not care about what framework you are using. They care about what business value you're creating. And if we can have a system like Beam that allows you to create that business value faster, better, not having to rewrite code all the time, guess what? It's a total waste of your time to do rewrites. It's a total waste of your time to learn all these new framework APIs. There's just a whole bunch of problems there. So it looks like a lot of you were familiar with the big data ecosystem. And this is a diagram I often share of the big data ecosystem. So looking at this, we see all sorts of different things. We see Real-Time clusters. We see Pub/Sub clusters. We see Real-Time SQL clusters, Hadoop clusters, NoSQL clusters. And you see all these edge things around consuming them, working with them. So here's the problem with this. How many different APIs do you see on this slide? There's 20, maybe 30 different APIs on this slide.

And you know what's the fun kicker about this? The API for that Real-Time cluster to reach into that Pub/Sub cluster is actually different. So there's an API that that Pub/Sub cluster has when you directly access it. Oh, and then there's a brand new API when, let's say, Apache Spark accesses Apache Kafka. That's a brand new separate API– separate one you have to learn. Who's excited about this? Lots of learning, lots of value being created. So these are some of the things that I was seeing. Here is a visualization of this where we're looking at this and we're saying, oh, my goodness, there are way too many different APIs, way too many different abstractions in these APIs. Wouldn't it be great if we had something that could allow us to go through these– not have to learn every single one? So let me talk about why I personally am excited about Beam. And this leads into my journey of becoming a committer. But I want you to understand, and hopefully as you're sitting here, this is why I put the effort, I put my time into it to become a committer.

And hopefully, these are issues that you're seeing now or will see in the future that I think Beam is correcting and fixing. One of them– the slide we just saw– we saw 20 or 30 different APIs. You know how many APIs you'd have to learn with Beam? One. We've got one API to rule them all. There we go. Just one API to learn. And you can deploy that API onto Apache Spark. You can do that on the Dataflow. Another big one– so putting my CTO hat back on, I can move between frameworks now. You don't know how important that is until you've ever had to move frameworks and do a reroute. That's actually pretty important to be able to sell this internally. If you're going to say, by the way, we just finished that Apache Spark rewrite, or whatever rewrite we did, and now when we go to Flink, we're going to have to do another rewrite, I don't really want to do that. We'll obviously have to retest. We'd have to QA, perf test– all that fun stuff. But it ideally won't be a bunch of development time involved in that.

Out of curiosity, how many of you are familiar with Spark Streaming and Spark APIs? So hopefully, you've looked at that from an API point of view. Beam is actually the most unified API I've seen, in terms of batch and stream. If you've worked with the streaming API and the batch API, they're somewhat similar. But they're different enough where you actually have to learn– I just wrote a class on this– you actually have to learn, here are the nuances to the streaming API. Here are the nuances to the batch API. It's actually not as coherent and not as similar in API as you'd like to think. And that's a big issue, because we would ideally want something like Beam, as Tyler showed. It's Beam. It's batch, and it's streaming. And when you start to go through that API, you'll actually see, oh, OK, this is how they do that. I don't have to write very different code for my streaming versus by my batch analytics. Another one– unified API to all the ecosystem.

So that slide that I showed before, if we were to do that with Beam, we'd have one API into Kafka, whether we're running on Spark. Whether we're running on this or that, it's that single API. Or if you're going to go into some other ecosystem project, it's all that single API. And that's where a lot of our effort's going into in the project. Once again, CTO hat on– that risk mitigation, it's very, very important. If you're going to talk to your suits back at work and say, hey, I heard about Beam, you talk about risk mitigation of not putting all our eggs in one basket and being able to move. And then they'll understand it much better. So these are some ways in which I've contributed. If these are helpful to you, I won't go into them in great depth. I wrote some positioning pieces. Once again, if you back to work and you say, hey– the suits and you talk to the suits once more, and you want a positioning piece of this is why somebody should actually be looking at Beam or a company should be looking at Beam, you can go there.

I've written two pieces on positioning. I also brought some extensive point of view from training where I've actually written a class on Apache Beam. I've done some pretty significant evangelism on Beam. Tyler and I spoke at Strata New York. I'll be going to QCon Sao Paulo later on, talking about Beam, talking to people about Beam. If you want to see some code, this talk will obviously be pretty devoid of code. But there are some more example code. If you go to that link, it will send you on to my GitHub. And then there is the interacting with a team as a non-Googler. So this is an interesting point of view. That's why they put the token non-Googler up here on the stage, because they wanted you to hear from somebody who isn't from Google. So ideally, you won't think I've drank Kool-Aid or something. And so what do I do? Why do I like the team? What's it like interacting with the team? They're over there, so I'll kind of go like this so I can't see them.

But what are they like to interact with? And I actually really enjoy interacting with the Beam Team where we've all gotten along very well. It's actually been a great journey to talk about. So I'm actually going to talk about that with my first commit. So if you're thinking that from maybe a little bit of trepidation or maybe, I don't know about interacting with them, that's kind of how I started that. I was like, all right, let me dig into this. Let me throw out something. Let me try some code and see what that's like, because I personally have contributed to several different open source projects. I've contributed to Hadoop. I've contributed to HBase– quite a few– even Kafka. So here's a commit. This is my first commit that I added some type descriptors in. And it was pretty simple thing. But it was something that I saw in the API that hey, this is really good. This would be something that would be interesting to add in. It's going to save some lines of code.

It's going to save some conceptual overhead of trying to understand Beam. So those three lightning bolts was boom, that the interactions with the Beam Team– if you want to go and read those, it's some good bedtime reading to read GitHub interactions about, pushes and pulls. And then Ken from the Beam Team said, hey, looks good to me, I'm going to merge this. That was my first commit. It was pretty painless actually. If you've ever done a pull or a push into– a pull request into GitHub– it's not too bad. It's not terrible. So when Frances was talking about, hey, find either a JIRA or find something– look in the API. I saw a few people with your hands up that your Dataflow SDK users. Maybe there is something you'd say, there was always something I really wish was part of Dataflow SDK. Well, on the Beam side, we've actually had some people come in and say, there is this thing I wanted. Boom, there's the push that they did. And they said, here, I've already written this and contributed it.

So eventually, I became a committer. And this is the actual announcement. I was one of the first committers. And you can see that they're at the bottom. There's Thomas. Thomas was made a committer at the same time as me. So that wasn't a significant amount of time. But it was an amount of effort. It was an amount of effort that I decided and I realized, yes, this is something I actually want to do. This is something that I see Beam doing what I want it to do. So going back to that initial talk that I had with William, some of the things that stuck out to me about Beam that I hadn't heard before is that it wasn't somebody's idea, it wasn't somebody's theory, and it wasn't somebody's PhD thesis. This was code running in production. There's a very big difference between that. If you're coming from the open source community and you've seen some of that, you might have seen somebody's PhD thesis and tried to use that in production. You're sitting down.

You may fall down otherwise. PhD theses don't [INAUDIBLE] well in production. Oh, my goodness, it happens. But here, what we have is we have real code. As Thomas showed, there was a lot of code that was contributed. It was mature code. It was code running production. And I thought, OK, this is exactly what I can do. I can get behind a project that's really getting started. It actually has some miles on it. It's proven in production. There are some other things around it that we're going to expand on and improve on. But the core of it– the really, really hard parts– are pretty much done, and they're pretty much proven. So when you look at products and projects out there, that's a really important thing to look at in open source is, is there miles on it production? Are there people using it? That's a good metric when you put your CTO hat on and say, should I use this technology or not? So here's a graph that I wanted to show you. And the thing I like about this graph is a few things.

So the blue line is Google. So these are contributions to, now, Apache Beam, to the left of the incubation period, Dataflow SDK. So to the left of that, we have the pre-incubation period, mostly Google. But now what we're seeing is after the Apache donation there on the right side of that yellow line, we're seeing a significant uptick. And these are unique contributors per month. So I'm obviously one of those contributors. I usually try to contribute something at least every month, if not more. And so what we're seeing here is a significant uptick of people like you. It really warmed my heart to actually look out and see how many of you said, I'm looking to become a contributor. If you didn't see me, I clapped, because I would love to see you all participate, because it's a great community to be in. And what we're seeing is the manifestation of a great community. Actually, hey, there's a chart for that. And this chart is showing we've got some upticks in terms of people contributing, people continuing to contribute, people actually creating things that are very interesting about them.

So here we have this gradual uptick. And that uptick is obviously not from Google. And we also have Google continuing to contribute. So that's really some of the things that I look for in open source is, is there a commercial company behind it with a vested interest? As he talked about, Google does have a vested interest. And that's what Tyler talked about– that Google has that vested interest in continuing to promote this, continuing to run this. And then there's other people like me who have a vested interest in building Beam and making Beam something even better. So I wanted to give you some code, just in case you got bored and were looking for code. There's this thing. If you've ever been to big data summits or looked at big data technologies, there's word count. Well, there's also this other thing of, how small can you make your word count? So I actually wrote some word counts. And I said, this word count is too big. And it was like the "Goldilocks and the Three Bears." So I made word count a little bit smaller, and then I made it as small as I possibly could.

So this is actually word count. This is a word count that you could run right there on Google Cloud Dataflow. This word count is pretty small. These are actually a few of the commits that I've done personally. The regex– I don't know if you've ever done a regular expression in, let's say, Spark code. More than likely, you did it inefficiently. Let me just tell you that. If you don't know what I'm talking about, come out to me afterward and I'll tell you exactly why. But more than likely, you did not [INAUDIBLE] incredibly inefficiently. So I created one. I said, here's an efficient one, because we often do regular expressions. And then we have our count. And then there's this other two string. So now we have a built-in way of taking whatever object you have– let's save out the strings of it and write that to the file. So these are just some examples. These are obviously pretty simple examples. But these are some examples I hope to hold out to you as, this is how you personally could start contributing to Beam.

If you're already a Dataflow SDK user or you're looking through the API and saying, hey, I'd really love to see something like this or something like that in Beam, you can do it. You can do it. You can actually go through. You could look at some of my previous commits, see that they're interesting. So now we're going to hand it back over to Frances. She's going to talk about what's next for Beam. FRANCES PERRY: Thanks, Jesse. All right. So I wanted to talk a little bit about the state of the project, where we're at, what Beam's next steps are. So I think many of you who know the Apache Software Foundation know that projects start in a phase called incubation. This is basically a bootstrapping phase where the project is showing that it can act in the Apache way. The Apache way is very much community before code. You're building an open meritocracy where anyone can come and develop. So Beam graduated– oops, those went the wrong way. There we go. So Beam graduated in December, which basically is a statement about the health of the project.

So Beam is now a top level project at the Apache Software Foundation moving forwards. And along with that great community that we've built, though, we really need a solid technology. So one of the large focuses going forward is this portability structure and infrastructure that Thomas alluded to. This is what lets users mix and match the different SDKs and the runners so they can really choose the language that meets their needs in the runtime environment that they're currently after. One of the other things we're doing is continuing to treat Beam as the glue that integrates other projects. So we'll be adding support for additional storage formats, new DSLs that let you write Beam-style pipelines in maybe SQL, Scala. There's a lot of work out there, because again, people want to be able to use the technology and the power it brings in the environment that's familiar to them. And finally, one of the things we're committed to is clear promises about backwards and compatible API changes.

We've all seen what a pain it is when the API you're using is constantly changing underfoot. So the Beam project is moving towards its first stable release where we'll be able to use Sem Versioning and make some very clear promises. So those discussions are underway on the mailing list right now. Looking like maybe early April, we'll hit that milestone as a community. Now again, all this technology that we're doing needs a diverse and open community. And so I really encourage you to come and start looking at Beam and what we've got to offer. So you can go to our website. We've got a very active user and dev mailing list you can join. The website has a lot of information, podcasts, articles, videos about the Beam model that you can learn in more detail. There's a Quick Start. So if you've got a Flink cluster in your back pocket right now and you'd like to run Beam on it, we've got a Quick Start to help you do that. And we have JIRA issues. Some of them are tag with Starter.

So feel free to find an issue that you think you could tackle and dive in and start contributing. We really welcome new voices in the community. You'll find if you post a welcome message and just introduce yourself, you'll get a huge bunch of people just plus oneing that, welcoming you to join us. Later in the afternoon, in a couple hours, there's another talk on Beam that I'll be giving that focuses more on the technology. So there, what I'll do is introduce more of the details behind this unified model and how that works and also give a demo of the same Beam pipeline running on Cloud Dataflow, Apache Spark, and Apache Flink to show you that this portability that we're building out is really coming to fruition. So with that, I think I'll invite the other presenters up here. We'd love to take some questions. [MUSIC PLAYING]


Read the video

Early last year, Google and a number of partners initiated the Apache Beam project with the Apache Software Foundation. The core technology came from the Google Cloud Dataflow SDK, which evolved from years of internal research and development at Google. Taking a project with over a decade of engineering momentum behind it from within a single company and successfully building a thriving and diverse community to own it and drive it forwards has been a humbling and rewarding experience. Learn how and where Beam fits into the Big Data ecosystem and our goals for the future.

Missed the conference? Watch all the talks here:
Watch more talks about Big Data & Machine Learning here:

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog