Web Hack Wednesday

Custom Speech Service and Beezy

Custom Speech Service and Beezy
5 (100%) 1 vote
(Video Transcript)
Hello and welcome to Web Hack Wednesday. I’m Martin Beeby.

>> And I’m Martin Kearn.

>> And today we are going to be talking about speech.

>> Good.

>> Yes.

>> Yes, is that cuz we’ve got new microphones?

>> Well, no, although these are beautiful microphones.

>> They are beautiful, very shiny.

>> But no, not because of that. I’ve just been doing loads of stuff with speech recently.

>> You’ve been talking a lot.

>> [LAUGH]

>> What’s new about that?

>> Technically I’ve been doing a lot of things about speech.

>> Right, okay, yeah.

>> And one of the things we that we do a lot of is we do these envisioning, POC, proof of concept, hacks, and-

>> We love a hack.

>> Well yeah, so with these sort of engagements that we do, they’re like a week long. We bring a partner into the company and we work on a problem they’ve got. It’s not really a hack. But it’s kind of the idea is that-

>> Proof of concepts. I think of them as like proof of concepts exercises. So you spend a few days with a company, try to figure out what their-

>> Yeah.

>> Help accelerate them, trying to figure out what they’re working on, and get them a few steps forward in that process.

>> Yeah, and these kind of hacks are, Yeah, this particular one which I worked on last week I think was four days. And they come in with this really common problem, which is becoming more and more common, which is we want to do a human-computer interaction with speech. So I wanna speak to a computer and make it do something. And there’s lots of ways that we can achieve that. But we’re focused on the idea of building a bot. And they wanted the idea that someone was in their car speaking to their phone, their iPhone. And the iPhone would be able to take directions from them about their application.

>> Okay, yeah.

>> So if you look at, so this is Beezy. They are a company which bring together loads of different Microsoft products. It’s basically like a portal. I don’t know the best way to describe it. I’m not doing a very good job.

>> [LAUGH]

>> Imagine it as like a portal on top of SharePoint. But it’s like Office 365 and loads of different things. And it basically stores all the informational or helps companies find information inside of their document store. So people can find documents or ask questions about the company. All of the information inside of a company be easier trying to surface it to the user. And they have lots of different ways of surfacing it. And one of the ways they thought would be really interesting would be through their mobile application, through a bot. So they wanted to build a bot that you could speak to and then that bot would be able to surface the document.

>> So it’s via a mobile app.

>> Yeah.

>> Not via Facebook or Skype or something like that.

>> They actually are interested in maybe putting it in something like Skype for Business. Because all of their customers have Skype for Business because of the way they sell that product. But immediately they wanted to integrate it into an iPhone application, which they build using Xamarin.

>> Okay, right.

>> So that was the challenge, is that we want a bot you can speak to, and that’s where the sort of challenge comes. So if you look at my, the slide that they gave us was this. And they had this whole journey, but this was the one which kind of made me think, okay, I don’t know exactly I’m gonna do this. Cuz John is on his way to a sales meeting. He pulls out the Beezy app to talk to the Corporate Bot.

>> Okay.

>> And he says, this is one of the scenarios. There was about eight that we worked on, but this is one which I selected this problem. Which is, have we worked with Partner X in France? Or, have we have we worked with the partner Microsoft in France before? Now, they have the information to answer that question somewhere in their system, and it was surfacing that information. But if you think about speech to text, when it comes to have we worked with, yeah, speech to text is gonna work probably all right there.

>> Yeah.

>> Partner Microsoft, it will probably work with Microsoft cuz Microsoft’s a big enough brand to be included with-

>> But some other company, yeah.

>> All the speech recognitions. The company I used to work with, Misco, wouldn’t be a big brand. And it wouldn’t be included in this thing. So will speech to text really work in this scenario?

>> How will it differentiate between the company names and just general words?

>> Yeah, how do we train almost our speech system to understand these sort of domain-specific language?

>> Yeah.

>> And there’s lots more examples of this in the application. This is just one of the scenarios, but this is the first place that I thought. So the idea being they say, have we worked with a partner in France? It says, let me found out. It says yes, we found this Partner X is helping us with Client X. And then it’s got some documents which it found in the system relating to that. So it’s surfacing all this company-wide information. So one of the other sort of questions they want to ask is, does anybody in our company have skills in X? And that skill is tenant specific. So each of these Beezy implementations are tenant specific. So they might be selling to a company X and company Y. And company X have different skill sets than company Y. So the bot’s gonna have to know different skill sets for different tenants.

>> Yeah, sure.

>> So again, it’s adding to this sort of level of complexity. So I’ve fictitiously created some anonymous code to help with this. And fictitiously create some partners and some skills to explain this problem. So these are the partner names that I want these scenarios to work with. Microsoft, who I work with now. Misco, who I used to work with. IT AID, my first employer. Satra, who I built-

>> It’s like a little trip down memory lane for you, isn’t it?

>> Satra, who I built a coefficient of friction machine in VB6 for them. National Rail Enquiries, that I worked my first client, which I worked with at Microsoft. MySpace, which is the job that I never took. It was my choice between Microsoft and MySpace. That worked out probably quite well. I chose Microsoft. Corby Borough Council, the place that I’m from. And I’ve done some business with them at the time. Holland and Barrett, where I used to be a sales assistant.

>> Really?

>> In the store. Yeah, I’m trained in homeopathy and in product sales on sort of vitamins and minerals and things.

>> I can’t imagine you as a salesman at all, but.

>> I was really good, though I knew all the Holland and Barrett. I won’t tell you their sales techniques, but I had them mastered. I had the badge.

>> Martin, happy to help.

>> Yeah, the equivalent of McDonald’s five stars. I had all that. They actually make you sit exams, genuinely.

>> Really?

>> Yeah, it was quite tough. And Argos, which was my first job when I was 16 working in the warehouse. And then ultimately I got to sell the magazine as people walked into the store. That was my job.

>> That was your job?

>> Yeah, cuz I was getting more sign ups than anyone.

>> [INAUDIBLE] made some from that as well isn’t it, yeah.

>> I don’t know, I’d do it in a heartbeat I’d go back to that. And then skills, the skills that someone might have. I’ve put skills that I have, which is C#, Visual Basic-

>> That’s brilliant. I’ve immediately gone to the bottom of this and seen chess, homeopathy, juggling in there.

>> Generally, I was teaching my three-year-old son how to play chess the other day. Because I’ve got this thing that I want him able to play chess in the neighborhood. I’m massively into chess.

>> I feel for you. I was recently teaching my girlfriend how to play chess. It was quite interesting.

>> I think it’s much easier when they’re younger.

>> How do you get to our age and not know how to play chess?

>> No, there’s loads of people. It’s quite an easy thing. Like I got all the way through to uni and didn’t know how to play backgammon. Do you know how to play backgammon?

>> No.

>> See, it’s just not culturally.

>> But that’s different though. I mean, I’ve also been teaching my son who has just turned seven and they’re basically about the same level. So I’m gonna try and have them have a chess game with each other. And I’ll just coach them both as they’re going through. That could be quite interesting.

>> It’s like Go as well. Like that’s the thing, I’ve never played Go very much. I’ve played it on the computer once or twice, but not really got it.

>> Right, I’ve never even heard of it.

>> Go?

>> Yeah.

>> Well, that’s a different story. So yeah, homeopathy which is fake news, [LAUGH] fake medicine. I think, I don’t know. Don’t wanna offend anyone.

>> I thought you were a five star on this stuff.

>> I am, I can sell it to you. I know exactly what it does. If you’ve got a bruise, I’m gonna serve you some arnica. I know how it works, but it doesn’t work.

>> [LAUGH]

>> Never mind. And then juggling. [LAUGH] Juggling, that’s one of the skills I have. So there’s some partners.

>> Okay, got it, good.

>> And some skills that I can use to create my little world. So Bing Speech. Now, Bing Speech is not gonna do particularly well understanding some of those names that I’ve put in there.

>> So this is the kind of standard service, isn’t it, that’s available.

>> Yeah, and it’s as good as any. There’s others.

>> It’s not really training it in any way. It’s just gonna take a piece of order and give you some text.

>> Yeah, it’s gonna speech to text. Bing Speech is not gonna do a great job of understanding all of those skills and all of the things. Some of them it will, and some of them it won’t. And it won’t necessarily know any context between how people are likely to speak to my application. It’s generalized speech to text. Which is fine if you’re trying to speech-to-text generalized conversation. But if you’re trying to do something in an app where you know what people are gonna ask it, wouldn’t it be better if we had some way of training it a little bit more?

>> Yes, yes, it would be better.

>> Well, I’ll show you how Bing Speech will work with this. So this is going into Azure. I go and get a speech key from Azure. So I’ve just set up a Speech Cognitive Service inside of Azure. And then I’ve got a postman here. And I’m just gonna post. Firstly I get a token from API.cognitivemicrosoftthink. Don’t really need to worry about this. This is just how to use speech API. And then I can go over to the speech API URL. And I’m gonna put the token in my authorization in my header. Excuse me. Actually, this is taking too long so I’m just gonna go and. No, it’s definitely too long. Space, paste. And then I’m gonna post a WAV file. So this is a WAV file. I’m gonna send it over to the thing and the response comes back. It should say, have we worked with the pond on Misco before. Which should be, have we worked with the partner Misco before.

>> Right, I see.

>> That’s the thing that I want this speech to text to work with. Have we worked with the partner Misco before. And if I take the skills WAV file and put it up to this generalized speech engine, I’m gonna get back, does anyone in the company have skills with asp.net. It’s actually So I was thinking about the algorithm.

>> With ASP.NET, but it’s not quite right, but it’s not bad. It understood that second-

>> You can tell what it means from that, but it’s still [INAUDIBLE].

>> Yeah, you have to do some cleaning up of this text now to get it to work. And command, and that’s general, when you’re doing text to speech, this is the bread and butter of what you’re doing. But, there is a new service that we have called custom speech or CRIS, C-R-I-S, it’s sometimes referred to custom speech recognition. And it’s aimed at solving this kind of problem. So this is what this custom speech service is all about. The general idea is, instead of having a generalized speech detection system, I have a custom speech thing, something which I have trained. And you’re training it with two things. One is language data, right, like strings of [INAUDIBLE] utterances, effectively. So things that people will say to your application.

>> Okay.

>> And then the other you can train it with is an acoustic model. An acoustic model is you upload a load of WAV files and a load of transcriptions of those WAV files. And it can then fiddles with the knobs to figure out which acoustic model would work best to identify those things to get the most accurate results.

>> Okay, so that’ll be useful in scenarios like maybe airports, hospitals, or when you’ve got specific background noise.

>> In this particular example, it would’ve been interesting if I had recorded some of these things in car. Cuz that’s a severe acoustic environment. We didn’t do that in the time, cuz we didn’t have time. So all I trained with the Beezy example was, we just did it purely for-

>> The language.

>> The language data. And the way that you train it is, you basically have a big, long notepad or text file of all the lines and things that people could ask through your application.

>> Yeah, right.

>> All the different combinations of words in there. The first thing that you’re gonna need to do though is once you credit, you slot in to custom feature service is server. And honestly at the moment, this UI is very preview, it’s very basic.

>> [LAUGH] Very preview.

>> You have to go in and do stuff which is ridiculous actually I think. You have to go cut and paste subscription IDs from this portal into this. That won’t be the case I’m sure in the final product. But in the moment, you have to go and get a key, you have to create a CRIS or a custom speech service inside of Azure, which is one thing. It takes a little bit of time to spin up. Once you’ve created that, you’ll get a key and then you add that key to your subscription. So we’re going to import some language data into our application. And what I would need to do, it says no active subscription’s been found, so I need to add a subscription. So this is a bit fidley at the moment, we go over to the Azure dashboard where I’ve created, I’ve got somewhere a speech service. And over the Keys. I just grab this key.

>> Okay.

>> Or maybe this one. I can regenerate them. [CROSSTALK]

>> Cognitive service really I guess same idea as a subscription key.

>> It is, yeah, but it’s just that you have to do this little bit extra inside of this is a bit odd. So you just give this a name to sort of so to make it unique and then just add it to our system. And then hopefully, that one’s actually tipped and so I’m just going to delete that one, which is still pending.

>> Okay.

>> So now I’ve got this BBS and subscription set up in my application, I can now start to import some data. So Language Data > Import New and I’m gonna go in and type in Just anything really, basic Stuff. And get my description. And then import my language model. And I’m just gonna upload this as output.txt file. Talk a little bit about how that’s actually created later, okay. And then what will happen is it will say it’s waiting, blah, blah, blah, processing. And I’ve actually, we’re gonna edit up here. It takes a lot longer than actually is shown there. [LAUGH] It shows that it’s 40 utterances or 40 things that it processed. So then when I have the data uploaded, I can create a language model.

>> Okay.

>> And a language model again, you’re just giving it names to just identify, at least, aren’t really that important. [INAUDIBLE] What do you wanna update and then any old description. You have to have a base language model, so it could either be conversational or speech and dictation. And then you choose the subscription you want. What creates, uncreates it. The difference between the general one and the dictation one is that the conversation one’s better if you’re trying to listen to generalized conversation, the other one’s better for apps.

>> Okay.

>> Or at least the thing I found. And what we can do is create a deployment. So once that’s there, that model’s been created we then can call a new deployment and yeah, let’s deploy it. And then I’ll choose the base model. And then my little specific language, the Beebs one shows up. And then I create this deployment now. And what deployment is, is it’s gonna generate three HTTP end points, just like the Bing Speech SDK. So the Bing Speech SDK has three end points that you can call. One for short dictation, one for long dictation, and one for basic HTTP posts. So what I should be able to do now in my application is if I want to, I can employ it. Now, the thing is I can just upload text like generalized text. List of common words, all those skills.

>> Yeah.

>> And that would work absolutely fine. And that would be 100% okay to do that.

>> Yeah.

>> And by thought, this actually interesting cuz Beezy have a Lewis endpoints, it’s what about Lewis last week.

>> Right, yeah.

>> And Lewis and has all these utterances and has phrase list in there with all the skills and partners in there. Is there someone I can generate it from the Lewis a piece of output?

>> Yeah.

>> And that’s exactly what I did and I don’t know if this is great advice or not, but it worked. By the way, this is the deployment information you get by the way if you get a subscription creed HTTP end point, a web socket end point for short phrase dictation and long dictation mode as well. So language understanding cell is Leiws, we looked at it before. I’m gonna go into My Apps, and there’s a Beezy application in here.

>> Okay.

>> And inside of the Beezy application, we have [INAUDIBLE], and this is actually Beezy’s Lewis implementation, it’s my cut down version of it. Well, I’ve just added two intents, find partner and find skill. And I’ve got two utterances in that everyone reported [INAUDIBLE] before. And I go find skill which is, is there someone in the team has skills in. Is anyone in the company has skills in. So just two utterances that I’ve got there. It’s not gonna be a very good Lewis model but it works [INAUDIBLE].

>> Now download that JSON file.

>> And the cool thing is that JSON file or you’re exporting the application as JSON, has all of that stuff in there. Has all of those little utterances stored in that JSON file. So what I thought well, what I could do is write a little C# application, which basically loops through that JSON and extracts all the utterances. And also joins them together with what I’ve got in these things, these lists of partners and lists of skills. So you go, there’s the utterance that the [INAUDIBLE]. By the way I’ve got these, the [INAUDIBLE] model features. But these are the phrase lists that are in cognitive services. So if I go back to the application I just look at the phrase lists. Inside of features, I’ve got two phrase lists. One which contains all the partners, and one which contains all the skills. So if I go over to Visual Studio, I’ve got this little application. And inside of here, it just reads this JSON file. I’ve just stored it Rather than go through this, I’ve got this on my gif at home.

The only thing which is slightly clever about this which I’ll just show quickly is the way that I’ve passed that json object. I like to work with a solid CSharp object rather than passing json or anything json soft to really just navigate the json object. So what I did is I opened up the json file. Inside of Visual Studio code. This is a trick that I use all the time. But no one seems to ever know that it exists. If you copy any XML or JSON, and then go into Visual Studio, you can, I’ll just remove the current class which is in there [INAUDIBLE] That’s enough. Think I nuked one too many, too many k instances, put that one back in. And then what you can do is if you go to the edit menu. Edit > Paste as classes. So you’ve got JSON in memory, and you paste classes, it generates a class, a C# class from your JSON object.

>> I’ve always used a website, I can’t think what it’s called though, to do that. Same thing.

>> And then what I do is just use json soft to convert that. What my json, which I’ve downloaded the file, into a solid object. And then you have all the access. It’s much easier to work with I found then dealing or navigating the json thing. So ultimately what does is it grabs all of those files, grabs all of those skills, all the partners and it replaces partners in the correct places. And I end up with a little output text file which contains all the things that someone has sent me.

>> So if I have a look [INAUDIBLE].

>> In the sorta context of how we would actually use the language.

>> It’s really smart.

>> I don’t know

>> If it’s smart or not.

>> Right, it’s not super clever in terms of the code, but the idea is smart that you’ve kind of gone through the phrase list and generated the [INAUDIBLE] from the phrase list.

>> Yes, it seems to work anyway. It definitely helped improve the speech recognition quality. I don’t know if I’d have gotten the same results if I had just uploaded all the partners and all of that.

>> Yes. But when you look at the dots they seem to suggest that you should upload it in this kind of sentence format so it was a nice easy way of generating that stuff anyway. So now we’ve called the Bing API so the proof is in the pudding. That’s called a ChRIS API and you call it the same way. First you get a token by your subscription ID, once you got the token you then. It’s called the, one of the urls that you got when you deployed Chris.

>> Okay so now if I go into, I get the subscription key and I’ll go into Postman over here. And in Postman in the subscription key in one of my headers I’ll pass in that That will give me ultimate might. Obviously, you wouldn’t be doing this and posting in a real application.

>> No, but it’s fine.

>> Just to show how it works. And then I’ll add that is my bearer authentication token to call this CRIS URL. And then I can upload the file. So I’m gonna chose this Skill.wav file. And I’m gonna send that over to the url. And now, if I look the response back, does anyone in the company have skills with asp.net. And it’s actually got it right.

>> Cool.

>> It’s got ASP.net right, it knows that that’s a skill. And then my partner example. Send that over. Have we worked with a partner? Does anyone work with the partner misco before? So it knows those, it’s recognizing the name misco and it’s put it correctly together in a conversation. And those are exactly the same [INAUDIBLE] that I sent to the free service. So in this instance It’s definitely improved the speech recognitions of those systems. So that’s kind of it. That’s how you can use Chris, how you can call Chris, how you can customize Chris. How you can generate some if you don’t have loads of conversational information you can generate it from your newest models. And I guess would the scenario be easier if telephone’s in the car? That’s a very specific acoustic model as well [CROSSTALK]

>> Yeah, I think the next stage, we were just playing with this and we got really good results from this, so it seemed to work really well. We also built a Xamarin application which had a message, we implemented a [INAUDIBLE] channel effectively in it and then a button which they pressed It saves the wild file on the iPhone. It uploads the wild file to the Chris service. Converts it into text. We then use that text and we pass it on to the framework. You end up with this really cool speech to text scenario. It really works. It really improved the speech recognition. So if you’ve got an example of an application you want to build and current speech recognition stuff doesn’t work because it’s just too generalized for you, this is a way of creating a customized speech model which may even be able to customize the different Not just language data into main specific language but also for acoustic models as well, it’s very, very cool.

>> Very cool.

>> And that’s it for another week of Web Hack Wednesdays, we’ll see you next week.


Read the video

Martin and Martin look at the Custom Speech Service and how it was used in a bot project that Martin Beeby worked on with a company called Beezy. The Custom Speech service allows developers to train a speech recognition engine to specific language data and acoustic models to improve accuracy when used within a website or app.

Leave a Comment

Your email address will not be published. Required fields are marked *

1Code.Blog - Your #1 Code Blog