Google Cloud NEXT '17 - News and Updates

Data Modeling for BigQuery (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
[MUSIC PLAYING] DAN MCCLARY: First of all, I'd like to say I'm kind of blown away, Daniel, that we've got so many people at 4:30 in the afternoon, who've come to talk on data modeling. DANIEL MINTZ: I know. DAN MCCLARY: It's really telling. DANIEL MINTZ: Yeah. Welcome to the Daniel and Dan show. DAN MCCLARY: Yeah, and again thank you all for coming. We'll try and make data modeling as interesting as it can be. So I'm Dan McClary I'm part of the BigQuery team at Google. I'm joined by Daniel Mintz from Looker. He's the Chief Data Evangelist. And like we said, we're going to talk about data modeling. And we're going to try and make it as exciting as possible. Now, the title of a slide is a little bit misleading, because we're not going to talk about data modeling just for BigQuery. We're actually going to talk about data modeling, to some extent, it's history, what's changed, how we should challenge our assumptions. And similarly, not just how is data modeling maybe changed for massive analytical systems, but also for business intelligence and reporting.

And we'll talk a little bit about how BigQuery is different from the systems for which some of our original assumptions around data modeling originated. And then similarly about how modern VI tools can take advantage of some of these differences. So hopefully that's interesting to you guys. And we'll try to keep it light, and we'll try and take some questions, and talk through some common cases as well. So, for me, as I was thinking about how we were going to put this together, the thing I sort of asked myself is, where did we learn to model data? And some people didn't learn to model data, they just have an ORM, or something modeled data for them, or they inherited it, and that's sometimes tough. But where did we learn to model data? And I think for most of us the answer is actually quite simple, Codd. Oh, no, not that Cod, this Codd. This is Edgar Codd. If you guys haven't seen Edgar Codd before, he is due a great debt in the data and analytics space. So Edgar Codd is the man who created the relational data model.

And he started this with a paper he wrote in 1970 called "A Relational Model for Data in Large Shared Databanks." Quick history note, back in the '70s, we didn't have databases, we had databanks. Same thing, but that's what they were called. Now in this paper, Codd outlines the foundation of a relational model, right? So he defines a system of tuples, and importantly, an algebra on those tuples. And this is the thing that leads us to languages like SQL, where we can have declarative languages that are very easily able to produce execution plans that we can use to analyze data. And this gets us a long, long way, right? Like this gets us from the very early databases of the '70s, to all the things we can do with big data processing and systems like BigQuery today. Now, in that, Codd had two really primary concerns. Data independence and data inconsistency. So in his mind, these were the things that had to be addressed, right? And part of the value of the relational model is that you get data independence, right?

Your data looks a little bit like relationships in the real world, it's easy to add fields, it's easy to do stuff with it. But it also helps guard against data inconsistency. And data inconsistency can lead to wrong results, and miscomputed bills, and all kinds of pain. And if you guys are familiar with Codd, you may have heard of these things in your university class, normal forms, right? And this starts with the first normal form, which is introduced in the paper in 1970. And with the help of others evolved all the way up through the fifth normal form. Now, many of us who work in the database industry may remember taking schemas and saying I've got to get to 3NF, 4NF, or BCNF, which is Boyce Codd Normal Form, or even to fifth NF in the most extreme situations. Now there are lots of good reasons for normalization, right? So increasing data independence is actually great. Because it makes this easy for us to extend our models of what the world looks like. Our data can well represent what happens in the world.

And again, reducing data and consistency is great because anomalies are confusing and problematic. So what do we mean when we say easy to extend? Add new types, mirror the real world, I've said that a few times now. Avoiding anomalies can be different things. One of the things that Codd was really, really concerned about was avoiding duplicates and conflicting values. Conflicting values in particular can remain a challenge, but the other thing that Codd was dealing with were operational constraints. I mean, Codd was a man of his time, he was a man at the cutting edge of technology, and he was making sure that the system he was describing fit well with the operational constraints of the day. And this was the constraint of the day. You may have never seen a picture of this before. This is the IBM System/360. This is the premier database platform from 1964 to 1978. And it ran a hierarchical database called IMS. IMS well, IMS is now 51 years old, and I think still runs in production in many places.

But so Codd's primary argument was hierarchical databases, like IMF– IMS, sorry– can have these problems of consistency, have these problems of data duplication, can be really difficult to make model lots of general purpose situations. And so his relational model made it much simpler for us to capture and represent information in a queriable fashion that could answer a broad set of constraints. Things though have changed. Most of us don't come to work every day in sort of a burnt umber kind of room, with spinning tapes, and looking very fashionable, sometimes I wish I did. But things have changed. And because things have changed, its worthwhile when we think about modeling data, to say, yes, the things that Codd was after were important, and they and they matter to how we represent our data and how we analyze it. But some things have probably changed with respect to our operational constraints, and as such we should think about maybe we could do things differently. So again, I don't I don't want to make anyone think that normalization is not important.

Normalized data models are really important, they still matter. Data is moving fast, it's changing all the time, and so this notion of data independence is huge. It still matters quite a lot. And nobody wants wrong results, wrong results are just unacceptable. Now access and simplicity of queries though, I think, matter far more than they ever did. And part of the reason of this is that more and more people are analyzing data. More and more people are analyzing lots of data. And so making it simple for other people to write queries against data, or to build reports against data, is really, really important as well. So the question I keep coming back to, is how do we challenge these sort of normal assumptions in modern environments? What are the things that are different? One of the things I think is really, hugely different in modern distributed data processing systems, is that duplicate records don't cost money like they used to. If we go back to our picture of the IBM 360, that data was stored on tapes, or on giant giant magnetic hard drives.

The cost of a megabyte of storage in the '60s and '70s was tremendous. And so the graph we have here is basically average cost of a gigabyte in a year, starting from 1980. It didn't go all the way back to the '70s. And if we see, in this, is it's basically dropped to zero. So when we think about normalization, when we think about our data modeling, one of the real specters of denormalization kind of goes away, because the price just drops out. Having a duplicate record is fine. Now if a duplicate record causes inconsistency, that's an issue. So this allows us to kind of relax some of the constraints that we might have heard of when we were going through our university courses. So the old Codd's rule, you know, "the key, and nothing but the key, so help me Codd," right? We can reduce that a little bit. The notion of these things like non-prime attributes, we can have more stuff in tables than we used to, because it doesn't cost us a fortune to store data, or to store small amounts of data.

We also have the ability now to think about complex types stored inside rows in a way that we couldn't. And this is a thing that BigQuery is actually quite good at it, and somewhat natively designed for. The other thing is that distributed systems are the new normal. When Codd was writing his paper in 1970, networks weren't a thing. They weren't a thing that query processors had access to. Nowadays, when we look at big data systems like Hadoop. Spark, things like that, and systems like BigQuery, distributed systems are the way we process massive amounts of data. Now, that means that there are new costs to consider. Now, just a quick show of hands, I don't even know what a shuffle operation is? All right, so a few people know what a shuffle operation is, a few people don't. Just to make sure people understand, if I have a multistage distributed process and I have to pass data from one set of workers to another, that effectively amounts to shuffling data around. Now this is costly because I have to take those bytes, I've got to serialize, I've got to put them on a protocol stack, I got to send them down to the nick, across the wire, and bring it all the way back up again.

And that's a cost we didn't used to have in single monolithic systems. So this means when we do things like joining data, there a cost because there's transport that's happening. And so, while joins are incredibly useful and they're a key part of the way anyone analyzes data, we do have to ask ourselves, is a join really improving my workload, or is this join just something I've incurred because I've blindly normalized data according to rules from the past. So the other thing is that columnar access has become a big deal for analytical processing. So back in the '70s and then onward, we talk about OLTP systems. These are usually row oriented, and this means that when we access data, we're accessing in a row at a time. And this is one of the reasons that Codd had this notion of like, well, look at the key, nothing but the key, and facts related to that key. Because when you go and fetch that data, you're fetching the whole row. Columnar storage, such as what BigQuery uses, allows us to say, I'm interested in column one, just go get column one.

And we don't pay a cost for accessing null values, we don't pay a cost for storing null values, and we don't pay a cost for accessing the other columns, because we don't have to. We only access the column itself. And this kind of raises the question in my mind of, well, we can relax Codd's rules around keys a little bit, because in a columnar system, many fields can be key-like. So I think this, for me, sets up some priorities for data models and modern data warehouses. When I think about systems like BigQuery and other cloud data warehouses, I think we need to re-evaluate what our priorities are. And I think one of the biggest ones is make your queries easy to write. Because you want more and more people to analyze data, you want to build more systems, so make your queries easy to write. All right? If you can denormalize in a way that helps you get to an easy query, go for it. Make that data easy to join. So don't have things that require you joining data six ways, maybe just the most important joins, the most obvious joins, or things that users can get a lot of benefit out of.

Now, if you can aid performance by utilizing things like complex types, using things like nesting, which I think Daniel is going to talk about a bit, this is great. We want to get performance out of our queries, but the first order of business is to make sure people can use the data. All right, and then make data easy to update if it's required, right? So don't go overboard on the performance or on the normalization if it means that your updates process is going to be tremendously complicated. And finally I think parsimony is a huge piece of this. And when I think about the spirit of what Codd was writing about in the '70s, I think that's actually a huge piece of this. The relational model, in some sense, is all about parsimony. It's about making the data model something you can look at and say like, yes, that's like the real world. And we have the opportunity, with modern systems, to reflect that even more closely I think. So, Daniel, do you want to tell us a little bit about what was going on in business intelligence as it evolved, and what's different now?

DANIEL MINTZ: Yeah, so I'm going to come at this from the entirely different angle. So Dan started at the bottom of the stack with Codd, and the databases, the IBM System/360, and the tapes that drove it. And I'm going to come from the top of the stack, from people who are working in business and trying to understand what the data means. What it means to their business, how do they make better decisions with that data. So, when we talk about business intelligence, this is a thing that's been around at least since the early '80s. And the systems that people used for business intelligence back in the early '80s, were monolithic systems. And they were monolithic systems where you bought the BI System that sat on top of the database, or the data warehouse, and you bought all those things together. You had systems like Cognos, and Business Objects, MicroStrategy, Informatica. Lots of those are actually still around today. Lots of Fortune 500 companies, if you go into them, you're going to see these systems still in place.

And they sat on top of the very first columnar stores like Teradata and Oracle. And the reason that you had to buy a monolithic stack was pretty simple. It was because that was the only thing there was. Systems, because they were slow and because they were expensive, the only way that you could query data to get information about your business, was with these monolithic stacks. You bought all in one. Why? Well, you had slow expensive hardware. That was the defining constraint that decided what you were going to buy and how it was built. IT, which is a word that has come to mean I think a different thing among the newer breed of companies. It's like IT is the people you go to fix your laptop now, but IT used to be the department that owned all technology in the company. And when you were talking about these monolithic stacks, IT was a department that owned that stack. And they owned that stack, they probably owned the server farm where the semi backed up with your new Hyperion data warehouse, and trucked it into the data warehouse that was actually a physical warehouse.

So that was another big constraint was that IT had to own it, because they were the only people with the technological know-how to make it work. And the other constraint was we didn't have data flowing in from every device in our pocket, and every server in the cloud, and just there's so much data now. But that was not true back then, because data entry was actual data entry. There were data entry clerks who sat and enter data. If you were managing hundreds or thousands of stores, there wasn't a point of sale system that automatically sent data about what had been bought back to the mothership. Someone sat down every day, or every week, or every month, and typed that in. So you just didn't have that much data that you were dealing with. And so these systems grew up, these first wave systems grew up, and they did have some advantages. First of all, because IT owned them, the answers that you got out of them were reliable. There was a single department, a single person, in a lot of cases, who owned the system, and they said, yes, I stamp this with my stamp of approval and I said this data is right.

And so if you were trying to produce an SEC report, you knew that the data and it was right because somebody who really knew the data had blessed it. They also made it really easy to make pixel perfect stuff because you were still printing these reports out in a lot of cases. You didn't have touch screens that you could interact with them on, they had to be pixel perfect because you were printing them. And for 1980s, 1990s, they were pretty fast. You could actually get access to the data in a reasonable amount of time, you know, days, not weeks or months, and you could get your reports that you needed to run your business. But there were some disadvantages, not surprisingly. First of all, they were terribly inflexible. You'd put in a request for a new report or a new dashboard, and first of all, you only had the authority to put on that request if you were probably a C-level exec, and you only could do that if you were in a giant company that could actually afford one of these systems.

And so you'd put in that request, and then you'd wait two or three months to get your new dashboard. They were locked down. People couldn't access the data, because, my god, if you access the data, it would overload the data warehouse, and that would be crazy, because we don't want to have to spend another million dollars to buy another appliance to run more queries. And they were pretty low resolution. Luckily, there wasn't that much data, so low data resolution wasn't that big a deal, but they were. And so as the scales of data grew over time, people were not seeing the raw data, but they were seeing roll ups and summaries. Because the only way to make these systems continue to work on larger sets of data, was to reduce the size of that data by summarizing it, and accessing those in the dashboards. So not surprisingly, people started to get fed up with these first wave tools because they weren't living up to their evolving needs. And so in the late '90s, early 2000s, you start to see the second wave of Vi tools, and these are all about self-service.

You don't have to wait for IT anymore, you can self-serve to the data. And you get this proliferation, this explosion of different tools. Visualization tools, and you have a governance tool, over here, and your ETL tools over here, and you still get your data engine up here, and you still have your Legacy system, which still runs your SEC reports because those have to be perfect, and when we close quarter, we use that, right? And so you just have all of these different tools. And, you know, again, why? Well first of all computers have gotten a lot faster. You now have PCs and even laptops that can hold a reasonable amount of data. Not nearly as much data as what lives in your warehouse, but you could maybe carve off a chunk of data that you really cared about because you were in finance, and you could carve off some of that finance data, you'd get a hold of it, you wouldn't give it back, and you'd load it into a server under your desk or onto your laptop so you could do some analysis.

And your computer was fast enough to actually slice and dice that data in some reasonable ways so you could get some meaning out of it. People went to the second wave because they were fed up with the IT bottleneck. They were sick of waiting three months for a change to their dashboard. I want to change the title on dashboard, great, get in line, we'll be with you in three months. That didn't fly anymore. And, because the amount of data that people were trying to work with was growing, and they were sick of having to do roll ups, they said well, but I only care about this little slice of data, and I can fit that on my computer, so I want high resolution access to just that data. And so they could do that with these new tools. So in terms of advantages, this gave people a lot more agility than they'd ever had before, and that was a big win. They could actually self-serve to the data, and that gave them much faster time to insight. It meant that they weren't waiting months and months to get any explanation, any response.

And it also gave them that higher data resolution. But it had some disadvantages. You know, I think the tools of this era, partially by necessity, necessitated by the technology of the day, they made some choices about things that they would hold onto and keep, and things that they would throw away. And one of the things that they threw away was a shared model, a shared understanding at the business level, about what the data actually meant. That thing used to live, in the first wave tools, in the monolithic stack and in the way that the data was rolled up. You get these second wave tools, people are slicing off a little bit of the data, analyzing it in a workbook, now all of a sudden the meaning of the data, that shared understanding of what the data means, that gets left behind. You're also really dependent on data extracts, and data extracts are really troubling. Because I extracted my data last month, and you extracted your data last week, and now they're not in sync, and we don't know why.

And so rather than spending the meeting discussing what our strategy going forward should be, we spend the whole meaning trying to debug why we're getting different results out of our data. And then the third disadvantage was just that you had this tool explosion. You had so many tools that people just had trouble keeping track of, how am I supposed to get this data from here to there? And so that that created a lot of problems. So that's from the BI perspective what was happening. At the same time, that Codd and the data warehouses– or the databases– that turned into columnar data warehouses were growing underneath. So let's jump forward to today. DAN MCCLARY: So we can actually talk about the fact that maybe things are different now. Maybe we can do some new things. I think it's interesting to point out, and worthwhile, that the consumerization of data, that business intelligence and reporting enabled, actually really challenges the fact that we need good models. We need a good data model, largely relational.

But we may have to think about ways to make it easier for consumers to get at it in ways that allow them to have a shared understanding. Real shared understanding both at the warehousing level, and at the tool level. So it's a talk about BigQuery, and other things, and so would be– I would be remiss if I didn't talk about BigQuery. In part, because we think of it as a really good example for a modern data warehouse that's cloud native, that requires us to challenge our modeling assumptions. Now for those of you who don't know about BigQuery, you don't know what it is, BigQuery is Google's enterprise data warehouse. It's the warehouse we built starting with the Dremel project which we wrote a paper about 10 years ago. It's central to the way Google analyzes data. And in BigQuery, we make it available to GCP customers. And the thing about it that's really different is, unlike systems in which maybe I was designing for a fashionable 1970s INS system, or even the warehousing appliances of the '80s and '90s, Dremel and BigQuery were really designed to be query engines that operate at a data center scale.

And this means that we can take advantage of things that are somewhat unique to GCP and unique to Google. We separate, specifically, storage and computing. Now in many big data systems, this would seem like anathema. This would be a real problem, because I want to keep close to the data. However, because Google's networks are so fast– and I encourage everybody to take a look out on the show floor at the Jupiter switch that we've brought, because this is really some of the stuff that makes GCP super powerful. We have a petabit bisection in our network. Which means all of a sudden, I can have vast amounts of compute, that are separate from storage, but access that data as if it were effectively local. Now this means that our query engine can do a lot of stuff. At scale, that's really, really different. It also means I can start to play some really interesting games with the way I model my data. Now, the Compute Engine itself is Dremel. It is our Dremel architecture. And this is effectively a dynamic serving tree of many, many shards of compute workers that can apply thousands of cores to problems.

This is a dynamically organized tree, and we are able to use the resources per the query. Now the nice thing is that there's an economic advantage and, so much as you only pay for the CPU cycles you're using, but at the same time it also means we can apply a tremendous amount of compute to to some of these problems. It allows us to have multistage queries that go really fast. Joining's OK, joining's safe, but at the same time it's a distributed system. And so we need to make sure that joins are things that we want to do, that they really make sense in the context of the problem. And the other thing, I talked about shuffle and the cost of shuffle. Well, shuffle and any distribution system costs. However, BigQuery does a unique thing when it shuffles. It shuffles to remote memory. We wrote at a really interesting blog about this earlier in the year. And by having a remote shuffle layer, this allows us to handle shuffle operations much faster, and at a higher scale than most systems.

So this means that we can join data even in situations in which the normalization might help us. But it does mean that because there is a shuffle stage, things like [INAUDIBLE] can cause problems. So this is just an example of how something like a select BI with a WHERE clause in a group, would actually move in and out of remote shuffle. And the pipeline that you get from a remote memory shuffle, is such that again you can do really interesting joins. But it also means that there is a tax when you do a join. Now, our managed storage, as I mentioned earlier, is Columnar. And we take care of things like durability by default. So when you put data into BigQuery, not only is it translated into our proprietary calendar storage format capacitor, which is another great blog on, we store columns in our own files, we compress these, and encrypt them on disks. The data is always encrypted. And then we replicate it across data centers. Your data is available in multiple zones and resilient to failure.

Now this Columnar storage allows us to do some really interesting stuff, because not only is it Columnar, and allows us to have many key like attributes, it also allows us to take advantage of the fact that we can filter on these things when they're still compressed. So the notion is, I don't have to go and pay the CPU cost to decompress things, just to figure out how the WHERE clause applies, I can apply it to the data before I've decompressed it. Now, all of this means, I think, a few things. Again, we can think about having many key attributes, we can think about having complex fields and types– because in Columnar storage engine, Columnar storage system, that supports complex types, allows us to model data in a different fashion. But it also allows us to say, you know what, normalization matters to me. I have a relational model that represents my business, that represents my world, and I need to run joins. That's OK. We can bring that, and I can move whole hog into BigQuery, and have exactly the same kind of querying experience I would expect.

So Daniel, what was happening and what sort of new tools have emerged in this space to deal with the cloud experience, if you will? DANIEL MINTZ: Yeah, so we talked about the first two waves. So let's start from the sort of middle layer. So we've talked about the bottom layer, the actual hardware that's driving these, the top layer of the business side, but in the middle, you've got this way of accessing the data in it's sort of raw form. And so, I'm sure this won't come to news to most of you but, programming language development looked a little bit something like this. You start with machine code, people realized that writing ones and zeros is kind of tedious, so they're like all right, let's do assembly. You get Fortran, and COBOL, basic, then C comes along in the early '70s, this is a huge revolution. '95, it turns out, was a great year for programming languages. PHV, JavaScript, Ruby, and Java all appear in 1995. But you get this wide range.

And really, what's happening here is that the level of abstraction from the actual processor that is doing the work, is increasing. Go is this amazing language that very pithily lets you run incredibly fast programs on tons of distributed processors. Machine code, you can't do that because you wouldn't want to write that in ones and zeros. On the data side, this is kind of what it looked like. So we start with data written to files. This is before you have random access memory at all, you roll your own b-tree. Codd comes along. Then in the late '70s, we get SQL, which is really an operationalization. Wow, that's hard. Operationalization. I'm curious what the– yeah, the robot got it right. The robot that is transcribing got it right, amazingly. So you get SQL shows up in the late '70s, that's an operationalization of Codd's rules, Codd's laws. And then you get Oracle and IBM, which turn them into commercial software. In the late '80s you get the T-SQL, which is still SQL.

And the '90s, you get MySQL and PostgreSQL, which are still SQL fundamentally. And really, there hasn't been very much development on the data side. So at the same time that programming languages have been evolving enormously, data language has been kind of stuck. And so in terms of what we actually want, we wanted to define relationships and definitions once. We don't want to have to do that repeatedly. Computers are good at doing tasks repeatedly, humans like to do it once and be done with it. We want to retain all of the agility that SQL gives us, which is amazing. We want an easy way of translating what we actually mean, the business question, the question that we care about, into a data query. And sending that off to the database or the data warehouse so we can get that data that answers the business question back. We want to state performant, which is becoming less and less of a problem. But still, when you're querying petabytes of data still is an issue. And we don't really want to worry about syntax.

Because, you know, the proliferation of SQLs, means that you have to keep track of little bits of, oh, this is how this one handles date reformatting, and humans shouldn't have to worry about that, right? Computer programmers figured that out. They said, we don't want to worry about that stuff. I don't need to worry about memory management except in 1% of the cases, so I won't use C most of the time, I'll use something that takes care of memory management and garbage collection for me. Analysts are still stuck writing SQL where they're worrying about syntax. They're doing a lot of these things. And so, lots of analysts I talked to say, but I love SQL. And I love SQL, too. Dan loves SQL, too. Does anybody else love SQL? Yeah, SQL's amazing. I am getting a lot of like, meh. Yeah, I have a love hate relationship with SQL, too. That basically describes it. So SQL's amazing. It's proven it's powerful, it's versatile, it's everywhere. SQL and C fundamentally run the world, right?

But, SQL's really easy to screw up. In my first job, where I was writing SQL, in the Orders table there was the status field. And you had to say Orders. Status equals Completed. And if you didn't, you would get all the failed credit card transactions returned as part of your query. And you go, oh my god we had an amazing day, but you didn't, you just had a lot of failed credit card transactions. And so keeping track of, oh, I have to put the group by there, and I have to– what dialect am I in, how does it handle date conversions, that's stuff that programmers have gotten away from and analysts are still stuck with. And so what Looker has done– and I should say I was a Looker customer for far longer than I have worked for Looker. And so I've stumbled on Looker early on in Looker's history, and they made this promise, and I said, oh my god, I want this thing, but it doesn't– it kind of exists, maybe it won't really live up to it– and it totally has lived up to it, and that's why I sort of jumped over the fence.

But Looker started with this kernel of this idea of LookML, which is SQL, evolved. It says, let's stop worrying about all that stuff. And so it makes this data language reusable, which is not a thing that SQL is. I know when I go look at SQL I wrote two weeks ago, I give up immediately. I'm like, oh, I'll write this from scratch again, because I had no idea what I was doing. It makes data language collaborative. Because if I try reading something I wrote two weeks ago, something Dan wrote is probably even worse and harder to understand for me, because he has his own style of writing it. No offense to the way that Dan writes SQL. I'm sure it's great. It makes data language flexible. It makes it easy to organize so you don't have untitled, underscore, 43, dot SQL, untitled, underscore, 44, dot SQL. These little recipes on your desktop. I definitely have those. And it gives you version control, which is another thing that programmers a long time ago were like, hey, maybe we should keep track of what we did in the past, so if we need to go back, we can.

Data people, not so much. And the way that it does this is it says, well, really any query can be decomposed into four things. One is a set of relationships between tables. The second is the fields that you actually want. The third is what filters you want to apply to the data. And the fourth is how you want to sort it. And if you have those four things, you can really compose any query. A computer could compile those four things down into a SQL and send it off to a database. And so that's what LookML is, and that's what allows you to build a sort of third wave of BI. Which is a data platform that says, you know what, we'll have the data platform take care of those annoying things that we've always worried about, and allow people, whether they speak SQL or not, to come in and write queries without looking at the SQL, without needing to look at the SQL, write queries against the data. And that allows you to access the data directly. And it's only possible because of this data infrastructure revolution that's happened underneath.

I think big data is a lot of hype. The big data revolution has been a lot of hype, and not that much delivery quite yet. But one thing it absolutely did deliver on, is enormously fast databases. You go back even five or 10 years, when Hadoop was going to be the answer, how far we've come from then of like, well, maybe we should shard MySQL a lot. Ooh, that's really ugly. Let's use Hadoop. Oh, boy– you know? Hadoop was kind of going back to the old days where you'd like type in the program at night at 11:00 PM and then you come back the next morning, and aw, man I had a typo. And you'd have to rerun it the next night. It was very much going backwards. So these incredibly fast databases, that mostly live in the cloud, have made this revolution possible. Because all of a sudden, you didn't have to do all the preparation. You could actually set a data platform on top of these databases and just access the data, leveraging all of their power. And so, in this third wave, well, first of all.

It was only possible because the databases were that fast. Without these databases, it doesn't work. One of the other reasons that people wanted this was because they were just sick of having a million tools. If you could centralize all this work in one tool, that would make life a lot easier. And the third, as Dan talked about, I don't know if you guys noticed, but the x-axes of those two price drop graphs, one started in 1980, the other only started in 2000. And that's because this idea of shuffling data across a network, simply didn't exist in the old days. But now that it does, you can access the data in these incredibly fast databases which live in the cloud, and then bring it to your machine. And so in terms of advantages, well the third wave means you're getting reliable answers. And not just reliable answers in the sense that the data is consistent, but reliable answers in the sense that someone who knows what the data is supposed to mean, went in, and said this is the definition of our KPI.

This is the definition of average contract value, and it's a plus b, divided by c. And so when you want to access that, you don't have to remember oh, wait, was it a plus b, minus c, or over– that's in the system. The system worries about that for you. It gives you agility because you're not constrained by the data that lives on your machine, you're accessing the data right where it lives. And so you can access all the data. It lets you pick the tools that are right. So you choose what is the right amazing cloud database for you? What is the right ETL tool for you? What is the right visualization tool for you? You're not locked into these monolithic systems. And it gives you full resolution. You're not rolling up the data and summarizing it before you access it, you can always drill down right to the lowest row, because you send a billion row or 100 billion row query to BigQuery and BigQuery says, all right, I'll be right with you. There are some disadvantages.

First, is that it's a major shift in thinking, especially for the people who worked in IT when it was called IT. It's like a big deal. They have to let go. And that's hard for a lot of businesses, because these business processes are deeply embedded in the organizations. And so for organizations to think of how they can use data in these new ways is actually pretty hard. You do need a powerful data warehouse. If you've got an IBM System/360, the third wave of BI is not for you. Don't do that . And this is something that we've only begun to see, but all of a sudden you have insight coming from everywhere. Because the people who are closest to what the data actually means, the business people who understand what it means, are able to explore freely, and so they're finding stuff that they couldn't find before. And so going from information scarcity to insight abundance, is actually hard in some cases. It's the same thing of going from a bunch of word docs on your machine, now you're collaborating on stuff with a bunch of people, you need Google Drive because you need a way to keep all that stuff organized.

Otherwise you're stuck. Oh, no, you're working on that version. I already made changes to that version, I emailed– did you not see the email? You don't want to go down that road. So that is a new way of thinking, but this is possible. So now that we've bored you all with the very high level abstract, you've lived through the data model section, now we're going to get practical. And talk about what it means to model your data in this brave new world. DAN MCCLARY: Wait, wait, I had one more academic point to make. DANIEL MINTZ: Sorry, I lied. DAN MCCLARY: I know, I know, I just can't give it up. Which is though, when we think about the spirit of Codd's original work, and the notion that a relational algebra is a powerful tool to apply to modeling real world situations, the business user needs it, too, but for the business user, semantics and context matter as well. It's not it's not simply the set theory or the set theoretic operations I want to perform, it's that these have specific in-context meaning.

And so I think a lot of what we see and the interplay between classic data modeling and data modeling as it evolves for the BI user, is that context is brought in line with the relational model. DANIEL MINTZ: And doing that I think is enormously valuable, and people sometimes miss that, because the people who actually in the store, running the POS system, they're the ones who could look at the data and go, something's wrong here. A data analyst back at HQ is going to miss that, because they don't have that tactile feel for what the data should be, what it should look like. They're going to miss the insights if the data has to be shipped off to HQ, and then an analyst has to groom it and put it in the system, if they can actually access the data themselves, they're the ones that are going to come up with those insights. DAN MCCLARY: Outliers look different the further away you are. DANIEL MINTZ: That's exactly right, yeah. DAN MCCLARY: OK, so now we can be now we can be a little bit more practical.

And unfortunately, I think the one thing to sort of tell everybody, is that there's no one hard and fast rule for, thou shalt model your data this way. All we can do is talk through some of the things that we see, some of the questions that Daniel I have been asked, and what we think are reasonable rules of thumb to apply. All right, so one of the things I get asked a lot is, I'm really worried about denormalizing my data. I've heard systems like BigQuery really like denormalization. I don't know, I have this dimension, should I put it in the multiple fact tables? The answer is, of course, maybe. Everybody's least favorite answer, maybe. Now the questions you have to ask yourself are actually about the dimension. Is this a static dimension, is it way smaller than the fact? Is the dimension just a list of the 50 US states? You could probably denormalize that pretty safely. Now, if the dimension is something that changes, or is used independent of the facts, if I have a dimension table it's quite large and used by many, many different fact tables, the challenge of denormalizing it, whether as a nested structure, or into every table is a flat sort of columns, is quite large.

So it might be better left normalized, because that actually models the world better. It models the world. And large scalable joins are entirely possible with systems like BigQuery. So again, I think this notion that I have a dimension table and I really, really want to know whether or not I should put up with the fact table, should I denormalize it, well how is it used? How is it updated? How big is it? These are the sort of fundamental questions we have to ask. So another one I get asked a lot, and I think comes up a lot when people come to BigQuery, they look at our type system and they say, OK, wait a second, you've got a arrays and structures, you have these nested and repeated fields. What do I do with those again? Is everything now an array of struts? What am I supposed to do there? Now, the reality of it is, I think Daniel, you've got some thoughts on this that you want to share in a bit, is that nested and repeated fields are hugely powerful ways of preserving the logic, or the logical relational view, while getting the physical benefits of dennormalization.

So if I have a dimension table that represents a one to many relationship, I can take that dimension table that is the many, and embed it, as nested and repeated fields, into that table and maintain the logical view of, oh, yes, this order has a list of items in it. OK, that's reasonable. But there is a question. So that repeated data needs to be updated frequently, that could be a real problem. So to say the shopping basket had these items in it, that's fine. Because that happened once. To say that I'm going to have a nesting of all of the transactions going on forever, becomes a real challenge. Because now I'm saying, oh, I have an array that I'm going to modify on every row, every day, with all of the new data. That's probably not the right thing to go and nest. DANIEL MINTZ: I mean or even to go even simpler than that, items are something that belong to the order, but orders aren't really something that belonged to the user. The user can exist separate from that.

And so the idea that, oh, I need to go into the users table, which then contains the orders, which contains the items, every time that this user makes a new order– yeah. DAN MCCLARY: Well it creates it creates a real problem. Then also in terms of trying to reflect the world. If we want the people who were writing reports and people were analyzing data to be able to write queries or build reports that model the world, we have to make sure that we're not sort of overly denormalizing for the sake of performance, and as such, breaking our view of reality. DANIEL MINTZ: Yeah. DAN MCCLARY: And also if you have to, as an analyst write SQL, you would like it to not be terribly complicated. Why do we have to do an unnest on the orders in the user to get to the thing? That's not what we want. So when do we think about nesting? Daniel, you've got more thoughts you're going to share, as I said, but again, sometimes it makes sense if we're trying to preserve the logical relationship, because it's important relationship.

But we shouldn't just do it because we're blindly seeking performance or because I read the BigQuery doc and it's like, oh, well they have Nested, I better figure out how to use Nested. So one of the things I see a lot, particularly with enterprises who are coming to big worry from older systems, is that you may have a query that has many, many stages or you're used to running it on your teradata machine, and it runs for two days, and then it shows up. I feel like this is one of those situations in which you really do have to kind of challenge your assumption, but not necessarily your assumptions about the data model. The model may be fine. There may be tweaks you can make, but you have to think of the workflow a little bit. So sometimes when we run these queries where we say, I'm going to submit it, I'll come back on Wednesday and we'll see what happened. Some of that's because you're actually operating in a resource constrained environment. You may have gotten used to it, but it is kind of resource constrained.

I'm only getting a little bit of batch work, and I can submit it in these time periods. And I think the challenge that I ask people to push on that workload is, well, what if you broke it down. Are there materializations that can preserve work? Remember, at the top of the talk, we talked about the fact that the cost of storage has plummeted, such that duplicate data doesn't really cost that much, particularly if it doesn't live for very long. So if you intermediate materializations, you can not only sort of preserve your work, and maybe speed up the way you compute things over time, but you could also think, hey, maybe there's a broad intermediate form of either denormalized one, or a set of normalized relationships, that are actually really useful for a broad set of people. If I have a number of analysts who are looking at the same kind of things, maybe I can break my work down into sort of a large materialization that many analysts can then go to their leading edge computations on.

And so, one of the things I also see is this notion of like, oh, well I'm coming from the big data space. I suffered through MapReduce, we got to Hive, well it's OK. It's OK. But one of the things that really worked for me in Hive was how we partitioned our data. We had multiple levels of list partitioning, it was OK. We were able to go a little bit faster. But one of the things that I think is worth pointing out is, not only does BigQuery offer the ability to date partition tables, which is pretty common in data warehouses, but we have this sort of table sharding which allows you to effectively have prefixes that are common amongst tables of common schema, and then wildcard seachable suffixes. So you can use that to emulate something like partitioning, while providing yourself a lot of flexibility from a sort of partition management standpoint. So I think that's the thing that people can kind of look at if they're used to coming from something where list partitioning was really key to getting a little bit performance out.

Or just even organizing large amounts of data. A good example of this, actually, is a lot of the things we've done with the ground station observation data and the weather public data sets. This is a great example of like, oh, this is we got a ton of data from 1929 to today, gosh, we really ought to figure out a way to deal with that in a somewhat partitioned fashion. So, Dan, do you want to talk about the Nesting stuff. DANIEL MINTZ: Yeah. DAN MCCLARY: Because I think it's interesting to watch someone from the outside think about our nested structures. DANIEL MINTZ: Totally. And I'll be honest, this challenges me. I'm a deeply relational thinker. I want things to be up and down, side to side, and just– DAN MCCLARY: Tables are cool. DANIEL MINTZ: Yeah, tables are good. Tables are good. Tables are good. So let's use a really simple example. I can print it on the screen simple example. I've got sessions, or orders, or some data that is conceptually nested. In the real world, there is a nesting pattern going on here.

And so normally, no pun intended, normally, the question that you would ask yourself is well should I normalize this data? Which will be more efficient on the space front, and the answer, as that graph from before showed us, is don't bother. That's not a good reason. There may be good reasons to do it, but space efficiency is not a reason to do it when you have an infinitely scalable cloud. OK so should I denormalize it so I don't have to worry about joins. OK, normally the answer would be, well, maybe. That might be more performant for certain types of queries, but maybe not for others. And, in BigQuery at least, there is another option. There's a third option, which we never had before. So here is my really simple example. I have three orders which happened in January. This was very hard to come up with because I had to figure out fruits and vegetables that started with e. But I did it. And so we've got three orders. They each contain some number of items. Each of those items as a unit price.

Super, super simple. So one thing that we could do with this data is we could turn it into a snowflake. We could have our orders table, our orders items table, order items table, and our items table. Great. Normalized. Yay! I feel so at peace, because it's like– DAN MCCLARY: I'm pretty sure did that in my first database class. I still hearken back to those days. DANIEL MINTZ: I realized that I organized my CDs. Some people may not know what CDs are. They're these physical disks, that we– never mind. But I liked to do it in order when I was a kid. Anyway, so we could do this, and this would be totally fine. Another thing that we could do is we could denormalize. We could stick it all in one table. We've got some repeated data here. We now know the unit price of the banana twice. Unnecessary, space inefficient, but, hey, no joins. Great. In BigQuery there is a third option, which looks like this, which no one scream from horror. This is really weird looking and not OK for us who like tables, but we can just leave the table nested inside a column.

DAN MCCLARY: Daniel, we heard you liked tables, so we put tables inside your tables. DANIEL MINTZ: Is it turtles all the way– never mind. I'll give everybody a second to compose themselves after looking at this. All right, so what we have here is a table in BigQuery, which still has an order ID, still has all the top level information about an order. Order ID, order date, order total. And then inside the order items column, which is a column, we have, fundamentally, another table in each row. They all have the same schema. And I'm not going to go into the struts of arrays of struts of arrays of struts, but– DAN MCCLARY: That's a thing we could talk about in Q&A. DANIEL MINTZ: But this is totally kosher, legit, in BigQuery. You can do this. Now, why would you want to do this? Other than to drive your Databases 101 teacher crazy. Well, let me go back. The reason is actually because if I want to query something just about orders, remember that BigQuery is columnar. So that crazy thing on the right, don't need to worry about it.

Stays in memory, I never touch it. Right I can get a count of orders really simply. Select count of all from orders. It's going to give me the answer. That's great. Now, when I do want some information about the items in the orders, I can unnest that structure, go in and get the information, and then, and only then, unpack it and look at it. In the other examples, I have to choose one or the other up front. I have to either say I'm going to normalize because that makes it really easy to query stuff about orders, but kind of a pain to query stuff about the things in the orders because now I have to do two joins. Remember again, this is an incredibly simple dataset, so if you're dealing with real data, this would actually be a pain, not just kind of a fun exercise. In the denormalized one, sure, I can get stuff about the items and the orders really easily, but just to find out how many orders I had, I have to do select count of distinct order ID. And so now and scanning a bunch of stuff repeatedly that I don't need.

In this, I put off the choice about how I want to structure the data, until query time. Which is great, because I can choose the structure that makes sense for the type of queries that I want to run, This is a big deal when you're dealing with billions of rows of data. So the one problem, remember, we said, well, in our list of things we want out of our data language, Is that writing the SQL for this is going to look a little funny. We're going to use this idea of left join unnest, which is weird, and a little weird, right? So, but with a data platform, with a third wave data tool, we're not writing the SQL. So I grabbed some look LookML, which is our markup language, that deals with this. It says, yeah, there's this table at the top called orders, and it has this join called order items, and here are the dimensions inside the views, inside orders and order items. I can compose things like item total by multiplying quantity and item price. I can specify multiple time frames.

And this probably looks a little bit weird to people who haven't seen LookML before, but the fun part is I can learn this real quickly because I already speak SQL. And this is fundamentally just SQL. And then I can expose this to my business users. And so all of a sudden, they can then explore this data set freely and leave it to the tool, to the data platform, to write the appropriate SQL. They don't worry about how orders and order items are related. Whether I did the normalization, the denormalization, the nested structure, tool worries about that. So that stuff only gets written once, and now all of a sudden, they can say, well, I want to know orders by dates, I want to know how many bananas were bought on the 26th. They can just freely move around in this environment, and ask questions. We send out those queries that the platform writes off to a BigQuery, and assuming that there are more than seven rows of data, BigQuery, then, can very powerfully churn through that data and return an answer.

DAN MCCLARY: Well you preserve, importantly, you preserve the relational model of the world throughout, DANIEL MINTZ: That's right. And so the relational model now spans all the way from the bottom, from the way that the data is actually structured in memory, all the way to the top to the way that business people are accessing the data. And so we have this sort of unified model that is both how the information is organized, at the most basic level, and how people are accessing it. DAN MCCLARY: So at least logically, we're actually accomplishing a lot of what Codd set out to do. DANIEL MINTZ: Codd, would be so happy! DAN MCCLARY: He'd probably be freaked out. I think cell phones would freak him out. DANIEL MINTZ: So I'm going to hand it over to Dan to wrap up. If we're downstairs, Looker's downstairs at E14, so as you're drinking a beer, if you want to put our engineers to the test and make them actually explain LookML to you, on bigger data sets and this, come on down to E14 and you can make them do that.

And I'll enjoy watching them have to do it, so– DAN MCCLARY: Okay so I'm going to try and wrap up really quickly so we maybe have even one or two minutes for questions, because I know there's also a happy hour that's happening and I'm sure people are quite thirsty. I think for me though, the core of this is really– the relational model matters, normalization to some extent matters, it's at the foundation of systems like BigQuery you don't get relational query processors, SQL driven processors, without that model. And the logical continuity of it, I think, is really important, all the way from how we built our distributed processing engine, to the way you can model schema. Using not only the tools you have learned, but also things like nested and repeated fields. And it passes through tools like Looker into the reporting space as well. But, I think, again, it's important as we think about trying tools BigQuery, trying tools like Looker, we can't blindly accept the rules of the '70s.

We have to know that they're important, and know that they have given birth to the point we're at now, but anytime we're modeling data, we should just continue to ask ourselves did this actually increase independence, data independence, independence from our users, and their ability to analyze data? Is it helping users fundamentally do more, or find insights? Is it reducing the inconsistencies? Is it making it easier for me to actually figure out what happened? Is it making those outliers come closer to the fore so I actually know what's going on? And then finally, if we're making a modeling choice, because we believe performance is the reason, is it really meaningfully enabling performance? Or is it really kind of premature? Well you know it goes a half a second faster, and that's really important. Is it really important, or is it better to help the users? So again, all of the things we learned from our friend the fish and from Mr. Codd, are important. We just need to make sure that we're aware that the systems we work with now allow us to potentially do quite a lot more.

And with that we've got only a couple of minutes left, but on behalf of myself and Daniel, I'd really like to thank you guys for sticking around even as happy hour started. We really appreciate it. Who at data modeling could be so fun and fishy. [APPLAUSE] [MUSIC PLAYING]

 

Read the video

BigQuery is a different data warehouse, permitting new approaches to data modeling. To get the most out of this system, Dan McClary and Daniel Mintz examine where old assumptions of schema design come from, as well as how BigQuery allows them to challenge those assumptions to produce data models which are easier to query and more performant. Additionally, they examine this parallel evolution of assumptions in Business Intelligence, and how modern tools such as Looker can take full advantage BigQuery’s flexible data models.

Missed the conference? Watch all the talks here: https://goo.gl/c1Vs3h
Watch more talks about Big Data & Machine Learning here: https://goo.gl/OcqI9k


Leave a Comment

Your email address will not be published. Required fields are marked *

Loading...
1Code.Blog - Your #1 Code Blog