Google Cloud NEXT '17 - News and Updates

Find and Manage your Video Catalog with Machine Learning (Google Cloud Next ’17)

NEXT '17
Rate this post
(Video Transcript)
RAM RAMANATHAN: Welcome everybody. Looks like we- I hope they know to let more people in, but welcome everybody. Ram Ramanathan, Product Manager at Google Cloud. Along with me, I have Juhyun Lee, who a software engineer in our research organization. And also with me, we have Lynn Urwitz, who IS a developer at one our favorite customers, So the goal for the session is to help you find cats in videos. That's really all we're going to talk about video intelligence. How can you find your next cat video as easily as possible? Because as you guys know, the internet is all about cat videos. And so, the thing that we're trying to talk about is, jokes apart, the goal for this is really think about how can we make sense of dark content, as [INAUDIBLE] put it. How do you find a favorite, your relevant entities, and video content? How can we help you find potentially, if you're thinking about a media organization. Like an example is how Google Play can automatically detect appropriate actors, actresses in videos, et cetera.

How can we help you do those kind of scenarios? Video content it's exploding every day, right? I think it was an article last week talking about how every day people watch over a billion hours of video content on YouTube. Every day. And can you imagine- and there's an article in Cisco- Cisco actually had an article sometime back talking about how by 2020, 60% to 70% of all traffic on the internet is going to be video content. Well, how do you make sense of that? How can you, as an organization, start harnessing all this video content that's flowing through? Sometimes content that you're publishing, content that you have access to, and sometimes it's content that you're distributing. How can you make sense of this media content, so you can get metadata and do things with that content. Clearly, people have been doing things with structured data for a long time, right? You know VI and data warehousing has been in the marketplace, people know what to do with VI and data warehousing.

How do I think about querying data? How do I find relevant entities? How do I walk through a lot of these pieces? Largest aspect of something that people do already in structured data. But for the most part, video content as being something dark. People have not really played with it, outside of consuming it. And really, the goal for this session is how do we enable you to do that. For example, how do you find your favorite dog in a consumer video? Or how do you find a video of a sports event across your petabytes of video content if you're a media organization? How can we enable you to do that? And that's really what we're going to spend some time talking about. Another aspect that you guys can imagine more and more, is more and more video content and not actually professionally created content, but it's actually crowd created content. Same way as images have exploded on mobile cameras, people are now creating more and more videos on mobile devices. So we want to start understanding this content from these devices.

And so you can start driving better recommendations, better [? hat ?] targeting, et cetera, et cetera. And that's really the goal is, how do we do that with video content? So Google provides a pretty broad platform for machine learning. We'll spend a couple of cycles on this so you guys get the context of what are we talking about today. So as an organization, one aspect is we provide you a platform that you can build your data to Google Cloud. You have a set of data scientists in your organization. They can then build their own machine learning module using TensorFlow, and then we provide you a managed service runtime using our cloud ML engine. So the case on the left is really your data, your model graph, running on cloud ML using an open source framework. The one on the right is really thinking about how we can use some of our own Google internal models- like our apply the [? rate ?] targeted scenarios, and in how you can use that to harness that. So it's really targeted more app developers.

So basically in the case for vision, we've used models that power things like image search, models that power things like Google Photos. We've exposed that as an API. And so basically you consume that with minimal customization. And so the new [INAUDIBLE] API we want to spend some time talking about is our Video Intelligence API. So Video Intelligence API is all about how you can help use Google models that power all first party Google products like YouTube, Google Photos. And how you can harness the same technology for your own unique scenario. So what are the key use cases that we're trying to solve for Video Intelligence? Really we're thinking about three main things. How do you find the main topics of what the video is about? So for example, what is a video about? What are the key relevant entities within the video? So those are really what– just kind of thing of the fundamental use case that we're trying to solve for first. The second big use case is, if anything vision has taught me, is many organizations really want to monitor inappropriate content.

Especially if you think about your crowd-sourced platform and there's a lot of people uploading content into that device. How do you monitor and manage what inappropriate content that flows through? And that's a key use case that people have to deal with, especially in Cloud [INAUDIBLE] crowd-sourced content. And the third part is how do we find the most relevant moments in a video so you can start publishing it? Unlike a document, unlike an image, if you have a 10 minute long video, how do you entice somebody to watch that video, right? So really thinking about highlights and thinking about video previews. It's a pretty key problem in the video use case. And that's another scenario we'll talk a little bit about. So really key, the three part things we are going to talk about, is how do you find the right entity in the video or what the video is about, how do we moderate inappropriate content, and how you think about finding the best moments within the video. So let's start off with the first aspect, how do you find entities?

So really we have a feature in the Video Intelligence API called label detection. So label detection is basically– if you have label detection on vision, this applies primarily on images. I mean, now we have label detection on video content. So really its goal is to find thousands and thousands of different entities in video content. These entities can be things like everyday objects from chairs, to things, to activities like running, jumping, falling. Or even things like scenarios like how do you think about everyday objects like cars, transportation, et cetera. And we'll talk a little bit about those entities later. And how do you call this API? So really the thing is you're passing the URI, which is basically a GCS location, a Google Cloud Storage location for where the video is. And then you can also tell it where do you want to write the output to. So it's really– the one caveat I want to make sure you guys leave with, in V1 we're really focused on back scenarios and not real time streaming scenarios.

So really it's whereas you have a piece of video content on Google Cloud Storage that we read from and then we write back to. So there are lots of other streaming scenarios that you can imagine that we're not solving for that in V1, still it's something we're thinking about down the line for future versions. And the other part that we want to spend some time on is I'll talk a little bit about is what do you want to see in a video, and when do you want to see the video. So we give you a couple of aspects. We give you things like video level annotation that tells you this video is about blah. Right? So what is this video about? And then the second part is when in the video did something occur? And for that there are different ways you can do that. There are things called– what we have called shot mode that tells you tell me things relevant to the shot, and the things that are relevant at a frame level, which is basically just applying Vision API on a frame by frame perspective.

And we'll talk a little bit about the differences in when you would use A versus B. So the response that you get back, and I'm going to spend some time on this. Really, the thing that you get back from us, it will tell you the entity, so what the entity is about. Right now all the entities are written in English. Our goal over time is to support additional languages. And then we tell you where in the video something occurred. So we tell you first the level, whether it's a video level entity. Meaning this entity really defines what the video is about, or whether this video occurs at a certain shot. In which case we tell you the start and end time, or just we tell you– if you want to do a frame level entity, we'll tell you it occurs at a certain frame. And one of the things I want to spend some time more on over the next few slides is why do we give you so many options. Because understanding video content can be a little complicated, and that's the reason why we give you all these different permutations of content coming back at you.

So the first thing we talk about is how do you separate signal from noise. That's one of the reasons why we do this. Because the challenge with video content is if you run an hour long video with, let's say, a shot changes every minute. That's meaning that the camera angle goes from looking like this or looking like this. So can you imagine that's a shot change, the entire camera changes. Those kinds of scenarios, if you think about an hour long video, which changes every hour, every minute, so that's roughly 60 shots, right? The question is, what do you want to understand out of that? So it depends on level of granularity, depending on use case– you guys, people think of different things. So the first thing we do is we give you a video level entity. So what is it good for? So it's really focused around this is kind of truly where we talk about more than just a frame level annotation. Meaning that we just don't use Vision API to look at something and give you results.

We're actually doing additional machine learning logic on top of each frame. And Juhyun is going to spend some time about that. But one of the caveats is this really applies only when the entire video is about a certain thing. And it's really optimized for scenarios where it's shot from content minimally one to five minutes. So if you give us a hour long TV show and you say, hey, give me this video level entity, it's going to give you a very high level generic entities. And that's something you guys should be aware of, because video level annotations it really makes sense more for shot from content. The other aspect is shot level annotations. So think of a shot as a mini video, meaning that's kind of grouped together based on the shot change and so almost as a mini video. And this is kind of a first level temporal entity. So will tell you when in the video something occurred and will kind of group it at the shot level. So it will tell you just start and end time of the shot, and within that shot we'll tell you these are the most relevant entities in that shot.

So where does that kind of come into play? You can imagine especially when you have like, let's say, you have an hour long TV show. You really want to– there's going to be a lot of context switching in that TV show, right? People are going to be– some people are going to be suddenly talking inside, people are going to be going outside. Each shot has a different set of relevant entities and what you really care about tell me exactly what's relevant within a shot. Because otherwise you're going to be dealing with a lot of noise coming at you. Because, for example, we'll talk a little bit about an event later where, for example– we have it in a video– but say I am talking right now. And if somebody does a frame level annotation they literally are going to get the same set of entities every frame every second. That's a lot of noise coming back at you, right? So that's really where shots really help you there because it's going to basically aggregate that information and say, hey, guess what this is what's really relevant within this video.

So you're not dealing with all that noise coming back at you. That's why it kind of goes back to separating signal from noise. And the other value is that over time we also can start adding additional signals. So you can start adding more than just visual signals, additional signals can flow in and tell you what the shot is about. And Juhyun will cover a little bit about that later. So this is where we can– so this is really thinking about things about context switching. So think about new shows, TV shows. Think about aspects, like for example, even like basically the video we showed at the keynote demo Sarah did. Like an ad as multiple different context switching. A shot based entity would be very relevant in that scenario. And finally we have the lowest level of granularity, which is frame level. So frame level is really thinking about there's no additional ML logic. It's really just banging, taking every frame at one FPF. And you're banging against a image detector and saying, hey what do you see in this frame of the video.

So there's a little less logic, a little less intelligence in this aspect, but there are a lot of scenarios where this makes sense. For example, there's a lot of scenarios in, say, surveillance where the video is not moving, it's always pointing in a certain direction. And which case if I do a shot change, there is no shot change there's just always one shot, right? So at that base, frame level entities make a lot more sense. The other thing is consumer videos. For example, if it's a long form video where you really want to get a different– you want to understand exactly what's happening at a much more granular level than even at the shot level. Some of these frame level entities make sense. So this is cloud. And I kept reading a lot of tweets yesterday saying there was not enough demos of the keynotes. So the goal today is to show you guys some demos. So if we switch to my demo machine, and if I can actually log in. OK. So by the way the whole goal for this session is how do I get back to BigQuery.

Because I love BigQuery so my goal is in the end I'm going to end up back in BigQuery. So let's first go with this video. So in this case this is a video about– I'm sorry, let me just refresh it. This is not a good sign. I can't believe this. OK. While that's coming up, so for example, let's go to this one. So this case, this is an ad for one of our Google Firewalls. We're basically just pulling ads from our Google channel. As you can imagine, videos have a much higher rate of copyright infringement, et cetera. So we're primarily using our own Google content. So in this case, we're just showing you an ad of– this is an example where context is switching a lot, right? It's an ad, so you can imagine every three or four seconds you switch context pretty significantly. And we basically– the key aspect is we're showing you how this changes. So for example, what you can do is basically draw a list of all the entities that are in this video. So, for example, in this case this is primarily our entire JSON dump of every single entity that we detected.

We tell you the shot level, video level about this entity. The one gap we have here is, for example, because it's changing context so much, our video level entities don't add a lot of value. Does it make sense? Because it's such a– you can imagine which things from outside to inside and a lot of things changing around. Over time you can imagine as we improve our video models and we start taking in additional signals through our reader models, we can pick up additional signals to make our reader classifier more smart. And that's something we are constantly working on. So in this case, for example, we find a car. We tell you exact– there's a shot boundary where the car is. And the other aspect is, for example, we see homes, a lot of different pieces that's flying around here. The other thing is, for example, we know in that aspect we're actually in the car because the scene is changing so fast. So the video will actually think we're in a car and kind of driving, that scenario.

So [INAUDIBLE] so you can imagine common scenarios. So our use case is where we drive that. So let me see if I can come back to this case. So in this case, this is a video about what's going on inside a zoo, right? The entire video is actually happening inside the zoo. It's a nice friendly video about animals. So in this case our video level entities actually makes a lot of sense because we're not changing context for the entire time. Our context is a homogeneous video about a tour of a potential wildlife thing. So we can actually find out it's about a zoo, it's wildlife, it's tourism, it can give you much more relevant video level entities in this case. But we also can do things like shot level entities. So in this case, we can look at these pieces, and you can imagine each of the slices are shots. Again, this is an ad, so the shots in this video are changing constantly, right? You're going to go every five/six seconds is changing a lot on you. And so you can imagine as we go from shot to shot we start telling you all the different pieces that flow in.

And a couple of things that's relevant if you think about digital marketing scenarios, right? It's a zoom video, but there are a lot of things about mobile phones and devices in there. So if you really kind of mine the data to see what are all the entities in there. This is a great way to kind of capture all that information. Because you would never have thought, why would I see a mobile device in my zoo video? So here is another aspect of a long video. For example, in this case, it's actually a video of [INAUDIBLE] giving a speech at a conference. And this is an example for where the number of shots– and this is actually not that many, right? There's a lot of shots, for example, in this case she's actually standing in one place for pretty much for all. She's just– the video follows her for a full minute. And so you imagine there's really no shot changes in that. So it's actually just about her talking. So you can imagine where a frame level annotation in this context would be extremely noisy, because you're just going to get bombarded with 60 frame annotations that tells you exactly the same thing.

It's not really relevant anymore. And that's why shot level aggregation is going to be useful in that aspect. And that's also a unique proposition from the video perspective, we think, compared to some of our competitors in the marketplace today. So that's one aspect. So what can you do with all this metadata? And as I said, my goal is always to come back to saying if you go back to the scenario on how we talked about structured data, how do we enable these things in really structured data? All of this is just JSON that's flowing through, right? So you can imagine we can start basically making this JSON and flatten this JSON. It's as simple– you can take as JSON out, flatten it, and then load it into BigQuery. And now that it's in BigQuery I can do things. I can just write a regular SQL query like I would do everything else. So now I want to understand, for example, in this SQL query all I'm trying to do is, hey, tell me off of all the shots what's the most common.

By scene, what's the most relevant entity in all my shots. So I can do a simple SQL query, come back, get that information out. So really what I'm trying to do here is I've basically taken all this video metadata, taken all that, plugged it, did the video API, and brought it back into BigQuery and run SQL queries. As I said, the world revolves around SQL. Nobody got that joke. So I've been in SQL for a while. Here is another example of that. So this is kind of things that you could imagine you can easily do with the video intel. That's really the value that we're trying to do, is we're trying to start making sense of all this metadata that you have in your organization. If you think about it, you now have petabytes and petabytes of media content. You can run it through Video Intelligence, start harnessing those entities, flow it into a database, and start to actually think about how you can now index your videos more intelligently, feed it to your media portals, fitted to your query engines, do analytics.

Think about scenarios where you can now correlate your entities in the video to potentially ratings. So you can figure out when certain shows are going bad based on ratings, or can find which entity were people looking at. You can start doing more interesting scenarios. Now that you have this as metadata you can do a lot more interesting scenarios on the correlation, driving the downstream, mission-only models, et cetera. Back to slides. So with that, I want to ask Juhyun to come on stage. Juhyun's going to talk a little bit about some of the underlying technology that we're doing, and also talking about where we're thinking of taking it down the way, too. JUHYUN LEE: All right. Hi. Let's go back. So hi, I'm Juhyun from the Video Intelligence team. I'm a software engineer in motion perception, which is an organization under Google research. And our mission is to make machines understand unstructured data such as video, audio, images, and et cetera. I'm here to explain what's going on under the hood so that it doesn't look like the wall of Google is selling black magic to consumers.

So given a video in your GCS bucket, the first thing that we do is we decode the video. And usually a video container contains like audio stream, video stream, subtitles, and all of the metadata. And we are first interested in the video stream. And a video stream is often a sequence of images, a lot of them. For a five-minute video at 24 frames per second you will have like 7,000/8,000 images in them. And they will probably in sequence look a lot similar. So classifying each of those is not too efficient, let's say. So we shot sample the images, and say at one FPS. So one frame per second we apply an image classifier, an image classifier that is very similar that you can get through the Google Cloud Vision API. And we do it for every image in the entire video. And for each frame, you will get image features such as this elephant with 97% accuracy, or animal, or river in that case. And if you keep going on, you will get like other labels for crocodiles and lions. And at the end of the video what we do is we have an aggregating video classifier, and this is really what sits at the core of this particular API.

And given all these features, throughout the entire video it will then say, A, this is a documentary about animals in Africa. Another similar example would be if you have a video of people in costumes or pumpkins and candies, this classifier will be able to tell you, hey, this is a video about Halloween. And this is only possible because we have trained this model on millions of YouTube videos, and human rated and also like verified. Ram also talked about also the shop level annotation a minute ago. And those are essentially treating each shot as a mini video, and then running with the same logic. Of course, there's more machine learning and computer vision techniques involved in this, I am just very briefly summarize this. But you are essentially getting what's behind YouTube and photos with this release. So far I've only talked about the visual features, but we are also considering adding audio or motion or speech or subtitles to this classifier, and do a final fusion in this aggregating classifier.

So that you don't assume that you have a video of an animated music video. It will be sad if we give you a label saying that it's an animated cartoon. You definitely want your audio signal in there. So we are looking at various audio signals to improve the quality of the model. And one thing that I wanted to emphasize is with this release, we are not keeping the models as is. We are constantly improving the model, especially when Google internally releases a new model for the YouTube photos, then we also update the models for Google Cloud video API. And also your feedback is important. Your feedback may make me happy or not so happy. Those are incorporated in the next releases. So yeah, feel free to reach out to us to improve. All right. Yeah, that's about leaking the secret recipe that we have. Now I'm going to talk a little bit about what's coming next, a preview on video previews, also known as a video summaries or thumbnailer. So assume that you have a bunch of customer videos and you have to show it in a list, you need to make a listing page.

Now you need thumbnails for each of those videos. How do you pick the right video frame for the video that you don't know about? You could pick the first frame in the video, or you could apply the Wadsworth Constant, and skip to 30 second– 30% HD video and then show that frame. At Google we use machine learning. And what we do is we have a machine learned classifier that tells what's interesting-ness of this particular video frame as a thumbnail. So we ran it through a lot of YouTube videos, human rated and also verified, these are actually positive with AB testing. And, yeah, so here on the screen you can see videos, the link to the YouTube video actually. And in the middle column you see a static thumbnail. So the single frame that has the highest interesting-ness score. And on the rightmost column you see moving thumbnails. What's the most interesting three second clip of that video based on the machine classifier? For example, let's show earth view in Google Maps. So this is the video when Google Maps released the earth view, and that's the corresponding static thumbnail and the dynamic thumbnail.

So this is when we announced the Cloud Vision API in December of 2015. The whole video is about presenting what Cloud Vision API can do, and a demo on this small little Raspberry robot. So it correctly picks up on the interesting thing, the only thing in the video, or like the moving thumbnail. One thing that you may ask is why is the static thumbnail not a subset of the actual animation on the right. And that's because the two classifiers are doing completely different things. For example the right-most column also incorporates motion in deciding should this be the moving thumbnail or not, therefore you cannot do that in a static thumbnail. So with that, let me hand it back to Ram for more use cases. [APPLAUSE] RAM RAMANATHAN: If you just go back to one second on the right. I think a couple of things that I really loved about this demo is that this Cloud Vision API was really all about the robot. And the fact that it picked up that the key thing about it is actually about the robot, the Raspberry Pi [INAUDIBLE] robot that shows up, is actually pretty unique.

Because it kind of somehow figured out that this video is really about how we think about Vision API applying the robot on a little Raspberry Pi. And it figured out that that's the most relevant thumbnails to pick up. It's actually, I mean, if you think about this problem– if you go back to the slides– if you think about this problem for a organization that's building media portals and people are dealing with millions of video content. I mean, if you have petabytes of content in your media archive, how do you consume it? And that's really what we're thinking about, how do we make life easier? So the two use case I'm going to spend some time talking about. First is thinking about content recommendation. So what I mean by that? So for example, you think of us, think of this as a mini YouTube, right? It's really not YouTube in a specific infrastructure, but think of a YouTube. Basically, you've got a set of videos on Google Cloud Storage, you want to start deriving some metadata on this video.

It then flows into potentially BigQuery to do some analysis, and then does a bunch of data analysts to look at this content, do some queries on that. And then the drive creates some features, they feed it into a cloud ML model to drive recommendations. At the same time they're collecting things like fixed stream behavior to drive the scenario, and it kind of flows through that thing. So that's really thinking about a very common recommendation scenario. Do people actually do that? So here is a public research paper that YouTube published some time back. So in this case, think of this video [INAUDIBLE]. This video [INAUDIBLE] is based on billions and billions and billions of videos that people load into YouTube. And you and me as a user come into YouTube to figure out what to watch, right? So the first thing people actually do is something like collaborative filtering where they're really thinking about, how do I apply what's kind of the set of hundreds of potentially targets that we can show to the end user?

So that's really where people are thinking about pretty sophisticated TensorFlow based models to really think about that collaborative filtering to drive that scenario. Then the next thing is really want to think about this, how do I then do ranking. So all of these potentially hundreds of content, what's the 25 that I show to the end user right there? That's where they apply second cycle downstream search models that then sort of do a kind of scenario on ranking. But a key aspect to keep in mind is to drive both these machine learning models they need to get video features to flow. And they need metadata on what the video is about to do the recommendation. Because the recommendation is both user behavior and metadata about the video. And how do you get metadata about the video? You have to use something like Video Intelligence. But that's another aspect. The other big scenario is media portal, just take another example from the world of– thing is I have millions of content.

I want to start exposing content in a lot of different interesting ways. How do I do that? That's where some of the things that Juhyun just pointed out on thumbnails, around all kinds of moving images really start coming into play on video summaries. You just can't show a two by three pixellation view of any video and expect people to click on it. What you really want to do start enticing them to click on that. How do you entice them to click on a video? You have to show what's relevant in the video as a thumbnail, because people are drawn visually. They're going to first look at something and then read the captioning. Very few people are going to read the captioning first and then look at an image. So you want to make sure you're showing the most relevant image of that video, and then you want an easy way for someone to get a feel for what that video is about. That's why those things on moving images come into play, because you really want a way for our end user to kind of over our over video and get a really quick glance of what this video is about.

And then you draw them to the actual video. That's why these kind of techniques when you think about building media portals, video previews, really start coming to play. With that, what I want to do is [INAUDIBLE] talk about recommendations, media portals. There are lots of other scenarios that we want to enable with this, but you think about a lot of customers thinking about how they drive media workflows. One of our customer reference is [INAUDIBLE]. So they're really thinking about how to use metadata from these videos. They drive it into their media workflow engine. So they enabled media publishers and media organizations to drive media workflows based on entities that flow in. One other customer who's going come on stage earlier, is Wix is one of our favorite customers. They've been with Google Cloud platform for many years. They were also an early partner with us in the Vision API. They are part of our journey vision API. They were actually on stage same time last year at GCP Next. we were talking about how do you use Vision API to drive interests in scenarios in their organization.

But as they're looking about the launch of their new media portal with Wix media platform, they've also been an early adopter of our Video Intelligence API. So with that, I'd like to invite Lynne Hurwitz from to talk about how they're using Video Intelligence in their organization. LYNNE HURWITZ: Thank you. Thank you very much, Ram. It's a pleasure to be here. My name is Lynne. I'm a developer at Wix, and I'd like to start with a quick question. How many engineers would you think it takes to create a great website? Very good. The answer is zero, if you're using Wix. For those of you who don't know Wix, let me explain. Wix is a company that allows everyone to create their own great website providing them with the simplest tools to do so. Now, in this day and age, in order to have a really great website what you need to do is to have high quality media in it and a lot of it. Now, what Wix has to do also is support our whole 100 million users who upload millions and millions of files each and every day.

So as you can imagine, in order to support all this, in order to process all this we need a very, very strong infrastructure. And we did develop one. We call it the Wix Media Platform. This is what I'm going to talk to you about today. I'm going to tell you how you can use it not only Wix, each and every one of you can use it, everyone can use it. And I'm going to tell you how you can build your own media portal on top of it. So for example, meet Felix. He's a chef, entrepreneur, who wants to build the next big thing. Now he has this great idea to build a platform in which people gather up, study, and teach how to cook. So he decided to do it with a Wix Media Platform. First things first, his whole platform is so focused on users. So obviously he would like to have user profiles and he'd like to have profile images in them. Now, from his experience, he does know this kind of thing requires file upload, image manipulation. And he was really happy to see that with a Wix Media Platform, it does everything for him.

You can upload, using the media platform, every kind of file, specifically images, of course. And when you actually upload an image it automatically goes through the image API, through the Vision API. Now the Vision API generates different labels for this image. I'll talk about that later. And also it can detect faces from the picture. When you already have the face detection, now all left to do is just have the image manipulation. In this case, we're going to use the Wix Media Platform image manipulation tool. This tool allows you to do pretty much every operation you probably know from Photoshop. But you do it on the fly. So what we do here is pretty simple. We just crop the image, resize it, and bam you have your picture and it's ready for use. Now, I'll take a minute just now to stop everything and explain what's going on around behind me. As you can see, we're in the Wix Media Platform dashboard. And you can see that we are looking at demo widgets. This is that playground image manipulation widget that is actually made by Wix.

Wix is using it in order for you guys to try it out, see the features of the platform, but it's obviously more something to do with Wix's needs. Obviously everyone who uses this platform can decide what kind of feature he wants to show out and how he wants to show them. So this is only an example. As you can see here, on the left, you can see the different manipulations you can do and everything, and under the image you can see this HTML tag. Probably a little small, but what it does is reveals the rest API used by our platform. You have also different SDKs, more sophisticated SDKs, that you can actually do a lot with our platform with. In this case, Felix used the JavaScript SDK. And as you can see, in order to accomplish all we talked about he only had to write these three lines of code. Now, to the next big feature of the system, video tutorials. He wants his cooks to be able to upload videos. We already know the platform allows that, OK, we have image upload, any file upload.

But the real nice thing about the media platform is how easy it makes it to work with videos. You see, the platform provides you with everything you'll ever need for a video, especially if what you want to do is to create a portal, a media portal. What it does is transcode the video to two different qualities for you. It creates a short tutorial from the highlights in the movie, and also it creates poster images for you from different scenes in the movie. Now, I'm really excited to say that the Video Intelligence API made it so much better for us. Because up until now, we actually made these poster images from random places in the video. But now, thanks to the Video Intelligence, we can know what are the main scenes in the movie and we can just take the main picture from them. So we have the main poster images, most relevant ones, which is great. So now when we have our videos up and running, yay, we have everything ready to go, right? Not quite. We're still missing something.

How can we allow our users to enjoy the content? Think about it, we're going to have such a huge platform. We are going to have such many tutorials in it. How can they find the relevant content for them? So obviously, as you see, we need a search capability. Now, thanks to the Video Intelligence API, once again we have every time we transcode a video, we also pass it through the Video Intelligence API. Now this Video Intelligence API, as you can see, provides us with relevant labels for the video. Now, this is great because our platform obviously has the metadata for every file you upload, but using the metadata and the labels together, and we have a great search capability on top of the Wix Media Platform. Now the next thing about this, because we're eventually using Wix, it's also supported in SEOs. So different search engines can also see these results. Now when we have a search capability in our system, a nice way to show it would be using the poster images we have. So Felix here uploaded a video tutorial of spaghetti balongese.

And he has these poster images generated out of it. Now imagine someone searches for Italian cuisine or spaghetti. How can we find the best fitting poster image for him? Well, remember I told you images go through the Vision API and have labels extracted for them? Well, here we go. So we take these labels, see our labels from the Video Intelligence API, match them, and that's it. We can have the best fitting poster image for our search results. Now, another great feature powered by the Video Intelligence API and I think is extremely important is safe search. Felix here does not know obviously all the users in his platform, and cannot control the content they upload. But he can still be very calm and know he has full control in his system, because the media platform would mark any offensive content uploaded. That means he can just decide what he does with it. He can let it go, he can mark it, maybe add a warning for the users, or maybe even filter it out whatsoever. It's his decision to make.

Last but not least, now Felix would like everyone to enjoy the content. So this is already taken care of. Wix Media Platform takes care of hosting and serving files flawlessly, no matter where your users are, no matter what device they're using. This is thanks to the Google CDN, of course. Now that's pretty much it for Felix, but I do want to remind you this is only an example. The Wix Media Platform as a powerful versatile tool that every one of you can use yourselves. It allows you to manage your data, to have image manipulation, video transcoding, to tag and label your data, and make it searchable also supporting SCOs, and it serves your data flawlessly. The platform provides you with unlimited possibilities. I invite you to use it, and I'm looking forward to see what great things you will build. Check us out on, and also today in the Google Cloud Launcher. Thank you very much. RAM RAMANATHAN: Thank you. Perfect. So coming back to kind of key use cases– going back to the key use cases we talk about from video perspective.

Really want to help you find the main topics from a media perspective. What the video is about, when the video did something awkward. Drive things around in inappropriate content, and really thinking about how you can start highlighting the video, things are on thumbnails, video previews, et cetera. So really these are the kind of key use cases we're going after now, but it doesn't mean that we're not looking at other use cases. We're constantly evolving the product. And part of what we call our private beta is to get more feedback from folks like yourselves to start playing with the API so you can tell us what else to add. What are the gaps? What are the things– that's kind of really why we're trying to go through these private beta for people to give us feedback and start evolving the product. So you can join the private betas, just go to intelligence. You just kind of go there, and then you can basically the alpha and we screen and start on-boarding folks next week.

Basically, once I get back home.


Read the video

In this video, you’ll learn how to build a comprehensive catalog of your content by understanding the most important entities within your videos and identifying ads as well as inappropriate content using Google Cloud Vision API.

Missed the conference? Watch all the talks here:
Watch more talks about Big Data & Machine Learning here:

Comments to Find and Manage your Video Catalog with Machine Learning (Google Cloud Next ’17)

  • OMG, a new dimenstion of detection video contents. Crazy, imagine looking for a red car among many vidoes with 1000+ other cars passed by loooool

    Majed Alayoni March 11, 2017 9:27 am Reply
    • I hope I did understand the presentation correct…

      Majed Alayoni March 11, 2017 9:25 am Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog