Data Collection And Management For Teaching Machines To Hear At Audio Analytic - Episode 139

Summary

We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what you are building at Audio Analytic?
    • What was your motivation for building an AI platform for sound recognition?
  • What are some of the ways that your platform is being used?
  • What are the unique challenges that you have faced in working with arbitrary sound data?
  • How do you handle the collection and labelling of the source data that you rely on for building your models?
    • Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with?
    • How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?
  • challenges of building an embeddable AI model
    • update cycle
  • difficulty of identifying relevant audio and dealing with literal noise in the input data
  • rights and ownership challenges in collection of source data
  • What was your design process for constructing a pipeline for the audio data that you need to process?
  • Can you describe how your overall data management system is architected?
    • How has that architecture evolved since you first began building and using it?
  • A majority of data tools are oriented around, and optimized for, collection and processing of textual data. How much off-the-shelf technology have you been able to use for working with audio?
  • What are some of the assumptions that you made at the start which have been shown to be inaccurate or in need of reconsidering?
  • How do you address variability in the duration of source samples in the processing pipeline?
  • How much of an issue do you face as a result of the variable quality of microphones in the embedded devices where the model is being run?
  • What are the limitations of the model in dealng with complex and layered audio environments?
    • How has the testing and evaluation of your model fed back into your strategies for collecting source data?
  • What are some of the weirdest or most unusual sounds that you have worked with?
  • What have been the most interesting, unexpected, or challenging lessons that you have learned in the process of building the technology and business of Audio Analytic?
  • What do you have planned for the future of the company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know and I need your help, go to data engineering podcast.com slash 97 things that's the number is nine seven things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode with their managed Kubernetes platform, it's now even easier to deploy and scale your workflow so try out the latest Helm charts from tools like pulsar package derma Daxter, with simple pricing, fast networking, object storage and worldwide Data Centers you've got everything you need to run a bulletproof data platform. Go to data engineering podcast comm slash linode. That's l i n od e today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For more opportunities to stay up to date, gain new skills and learn from your peers. There are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com slash conferences to check out the upcoming events being offered by our partners and get registered. today. Your host is Tobias Macey, and today I'm interviewing Dr. Chris Mitchell and Dr. Thomas liqueur knew about audio analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music. So Chris, can you start by introducing yourself?
Chris Mitchell
0:01:56
Yeah, sure. So, Chris, CEO and founder To organic,
Tobias Macey
0:02:01
and Thomas about you.
Thomas Le Cornu
0:02:03
Hi. Yeah. So Tom, data engineering lead at ODU analytic.
Tobias Macey
0:02:08
And so going back to you, Chris, do you remember how you first got involved in the area of data management?
Chris Mitchell
0:02:13
a while? Yes, I did a PhD in audio classification. So I suppose that that's the place where you can say I, I got my start in it largely sort of dealing with all of the fun constraints of academic research, which is sort of smaller datasets than you'd like, but still larger than you're used to dealing with, as well as all of the fun challenges of building the technology and doing the fundamental science as well. So
Tobias Macey
0:02:40
that's where I got my stuff. And, Tom, do you remember how you first got involved in data management?
Thomas Le Cornu
0:02:44
Yeah, some of the things Chris, it was during my PhD I was working with different data sets. And you know, just dealing with them more on disk and stuff and then moving to work at Ballard Institute and working with computer vision and realizing that kind of having massive It says just on the file system is not great and then move toward to analytic where you see a new way of doing things.
Tobias Macey
0:03:06
Yeah. And so in terms of audio analytic, can you give a bit of a description about what it is that you're building there? And what was your motivation for building an AI platform for sound recognition and getting the business started?
Chris Mitchell
0:03:18
Yeah, sure. So So
0:03:21
someone said I did some research in the field and found that people weren't tackling sounds there was a lot of work going on in the speech field, there was a lot of work going on in the the music fields and obviously the broader classification field so image and text etc. But sound itself has its own set of unique challenges. So during in comparison to say speech, you don't have language models to work with so you can constrain the the acoustic patterns, you're looking for it in that sort of way. And you have very large open set data sort of problems. So Obviously the sounds that you're looking to detect, you also try and differentiate them from the large number of other sounds that can happen in the world that can happen at any point. It sounds a relatively random in in that respect. So what I was interested in is, could you make a sound recognition system that could capture a broad sense of hearing so that that's normally around a range of target sounds to be detected. So whether it be safety and security targets sounds such as glass break or smoke or car alarms going off, whether they be sort of health and well being sounds of coughing, sneezing, that sort of thing, whether it be entertainment sounds or whether it be communication related sounds, you can start looking at this world of sounds and then you can imagine, what could you do from a product design perspective? If products have the sense of hearing and whether that be mobile phones, headphones, whether it be smart speakers or whether it be smart home, giving it back sensitivity? means that those devices can respond more naturally in the way you and I would do if those things were happening around us. And then they can take intelligent action. So that that was the sort of the motivation for it at a personal personal level, the motivation for it is I just like machines that make strange noises. So it's quite a natural extension for me to like machines that can classify those those noises into various different classes, and then give the outcome. So quite a bit of a visceral, personal
Tobias Macey
0:05:33
love of sound. And you touched a little bit on some of the contexts in which your product is being used. But can you give a bit of a taste of the types of use cases that it's intended to empower and some of the ways that it's actually being employed?
Chris Mitchell
0:05:48
Yeah, so we do it by device is probably the easiest way. So if we if we take a device like a smart speaker, and you want to be able to turn into into a sort of home security device, and you want to know if somebody's breaking into the house from the sound of the windows being broken as somebody enters into the property, then listening out for that sound. There's four different major types of glass laminate plate, wide temperature, different sizes, different thicknesses, obviously breaking with different implements. So you're very quickly into a, a large data management problem. And that's just dealing with the target sound less alone, let alone although the non target sounds such as I know, if you've got cats them knocking things off the work surfaces in kitchens that may be confusable with the types of sound you're trying to detect. So that's the sort of smart speaker side on what's called audio event detections that's detecting specific sounds in that case, glass windows being broken if we move on from event detection to something like scene detection. So this isn't a single sound. This is a sort of a combination of sounds or singing because of the soundscape detection And that might be around detecting whether it sounds like some Deezer to a train station or to,
Tobias Macey
0:07:07
you know, a coffee shop and sort of whether it's a physical scene in that case, or whether it's an acoustic scene, so whether it sounds calm, whether it sounds lively or not, or indeed, whether it sounds like it's inside or outside, those would be examples of acoustic and physical scene detection itself. And both of those sit under what's called sound recognition, which is the field in which the company leads. And it seems that at least the majority of the use cases that you're discussing now are more consumer oriented for people to be able to take advantage of some of this intelligence to enhance their sense of well being or get some sort of feedback about their environment. I'm wondering if you've also experimented at all with use in industrial contexts where particular types of sound might be indicative of some type of imminent failure in terms of structural or issues with manufacture Or, you know, maybe in mining where certain sounds might be indicators of some type of physical risk. I'm wondering if that's something that you've looked at at all or something that you're intending to branch out into
Chris Mitchell
0:08:11
it we obviously looked at when we started a whole range of different applications of the technology, sort of a foundational technology. And in that respect, yes, we looked at what the area described might be, for me would be called predictive maintenance or something of that nature. The commercial activity of the companies is largely focused towards consumer electronics, it's where we've had the most success commercially wide scale, you know, adoption. So that that's the bulk of the commercial effort that then obviously translates into the thrust of the sounds were detecting it most of this world can be described in terms of breaking it down in terms of the type of sounds so obviously, the, the sounds you'd get in a production plant, I think is the example you'd use would be very different than the sorts of Sounds you and I would care about in our house or if we're active out on the street, they do stand up being very different sounds. And we capture that in in terms of the taxonomy that we use to structure our data.
Tobias Macey
0:09:14
And my understanding of the way that you actually deploy your product is that it's an embeddable AI model that other companies can license and include within their own products. So I'm wondering what types of challenges that poses in terms of the deployment mechanism and the types of interfaces that you provide to those companies to be able to take advantage of your technology and just issues in terms of updating the model definition if there are any changes or enhancements that you make to it.
Chris Mitchell
0:09:44
So on the yeah for sound recognition, because inherently sound goes hand in hand with privacy concerns for obvious reasons. We believe that sound recognition under a large range of use cases is best done at the edge of the network. You know, obviously all the audio data can stay there for means of classification. You don't need to transmit and off the device, which gives you costs and other economic benefits and scalability benefits. In terms of the more on the con side of that approach, you don't get to update the models as quickly as you would if it was a, you know, SAS base model or something like that. We don't tend to see that being much of a commercial issue. You know, most of the firmware now on consumer electronics devices is updated reasonably regularly. But we also know that when the customers do want to get that updated, it's something they can easily push out to the end users on the general point of getting the challenges it faces, it means that you need to know quite a lot about the subject matter variability that you're trying to detect. So that's where the quality of the datasets comes in. Generally consumers don't tolerate lots of failures out of classification systems, and especially not around the fault tolerant, sort of other aspects of security or safety where, you know, if I told you, Tobias, yeah, your house has been broken into now, because I heard a window being broken, if you rushed home, if you're not already there, which is hopefully, like you are given circumstances, but if you did rush home, and find out you're not going to tolerate many of those as false alarms, so actually want the models to be pretty well structured and an understanding most of the variability they're going to come in contact with. Otherwise, the overall value proposition doesn't work very well. So that that sort of that aligns with this notion of being edge base in in the large number of use cases that are applied to sound recognition.
Tobias Macey
0:11:47
And given the fact that you're building these AI models and everyone knows that it's garbage in garbage out and you highlighted the fact that you have to ensure a high amount of quality in the input data. And I'm wondering what are some of the unique challenges that you are facing in terms of being able to collect and label and create a taxonomy around these arbitrary sounds and being able to ensure that you can correlate them with some sort of meaningful event?
Chris Mitchell
0:12:16
It's a great question. I mean, just, I'll give you the broad brushstrokes. And then Tom deals with this on filling out that taxonomy, you'll see part of his the great job he does is around sort of doing what we call sound expansion. So on the taxonomy side, we break things down at the top level into three parts there's Anthony G, often he and by often eat, but there's sort of 700 Label types that we're dealing with on a daily basis inside systems or a label type would be something like a glass window being broken or a smoke alarm going off. So this is a this is a large set of classes you're dealing with on the target and non targets at the same side, on the just on before It Tom answers the the sort of practical problem on the the AI side, there's also a range of specialized things you need to do on the AI. So if you take a an off the shelf speech recognition system, the acoustic model is is designed for our voice box. And clearly, a large number of the things we deal with are produced by humans, and an even larger number on produced by humans using their mouse. So you know that there's quite a lot of issues around the acoustic model side. And then as I said earlier on that language model that the speech recognition companies rely on very heavily, he does quite a lot of the heavy lifting in correcting the errors made by the acoustic model. Clearly when somebody breaks a window in the example I'm using, it's not trying to speak to you in any structured way. So you don't have that language model. So there's also fundamental AI things you need to solve before you even start, which you can only do with the good quality data. So you need to both get the garbage in, garbage out. About sorted, and then you can start at the AI principle off the back of that, and that sort of the out of the box techniques don't work. In terms of the day to day stuff, Tom, your best place to sort of explain some of the challenges we face there and the tools we use to overcome it.
Thomas Le Cornu
0:14:15
Yeah, sure. So I mean, for most of the stuff we work on, like a data collection, you can consider as you know, like a project type approach. So we'll be given some semi some notion of a sound that we want to work on and we'll attack it in different stages. So the first part will be kind of considering, like the spec that we want to develop for it. So perhaps for certain sounds were more strict around certain criteria and then for other sounds less so. And that spec is really going to be important in kind of guiding us through the through the rest of the process, in terms of like, how we go about the data collection, and right through to like the labeling and and the metadata associated with it. So you can consider that for example. So like something perhaps like a dog bite sensor, you consider all of the different degrees of variation that dogs exhibit if you like, you have small dogs or big dogs, different breeds that age will affect the, the, the the sound of the dog bark will think that the dog will make when it barks, and, and other sorts of factors. And so that's like, you know, for each for each sound that we work on, we have to do this massive sort of brainstorm exercise around making sure that actually the data we're going to gather is going to be valuable, I guess in in a similar way with speech recognition, when you're designing data sets, right, you have to sort of make sure that they're sort of have have all of the different sort of phonemes or whatever it may be that you're interested in. So so then we'll we'll develop this plan for like the data that we actually want to collect. And then they'll probably be like a stage where we then have to essentially source this data or I mean, unlike some sounds like perhaps if you're doing a smoke alarm type sound where you can combine the sensors for a dog, you know, you can't just get a dog off the shelf as it was you have to source these dogs in some way. And so we'll reach out to like the volunteers that we have. So we'll have a global volunteer network. And so we can draw on those people to provide us the sources for these sounds. So we're able to just email these people. And first for some amount of money in return, they'll be able to provide us with adult in the situation. So we'll line up a lot of people to help do the recordings and then situation like that will probably go into our anechoic anechoic chambers that are sound lab and or semi anechoic chambers and make sure we record the sounds, you know, to really good standard, as you said about the garbage in garbage out. But of course, you can consider there's a few sort of, again, bits of variation there that that you've really got to focus on. And one of the areas will be the different channels or the microphones if you like, but you can also consider for example, something like with a dog, you know, it's going to be running around and moving a balance so that presents its own sort of, sort of challenges and and each saying kind of my experience working at the company. No, no one sound is really, there's always challenges that you don't expect when you first Start working on these things that that you kind of have to have to overcome to be able to do a good job of those sorts of things. And then so the SEC, you know, we've spent a load of time then gathering a load of really, really fantastic data. And then you have to process this data and label this data. And, you know, the label labeling is a really important aspect for for the sound recognition problem, you know, deciding how you want to do the labeling, do you make sure it's extremely sort of fine grained, or maybe more coarse or use like weak labels and stuff. And then so once we've we've kind of gone through that stage that we can make sure we get into to Alexandria, our data platform for storing all of this, this data. And, you know, I think it's one of the things that every stage in that kind of this really just the data aspect of the pipeline is kind of, you know, each each part first has its own challenges, and you have to get a lot of things right to make sure that the data that you can present onwards to the machine learning teams is of the sort of best standard that you can You can really get
Chris Mitchell
0:18:01
so what they said, because I know, Tom, you've got what some 15 million pieces of audio data and Alexandria 200 million pieces of metadata and 700 Label types or so, I think, Tobias, one of the interesting things to realize is is labeling audio data as opposed to speech presents its own set of challenges, as Tom talked about. So if you take something as basic as baby cry, and I had to buy something, if you're a dad, or you've got kids or anything, but if you if you sit down a bunch of people, and you say, Tell me when here's the recording time when the baby starts crying when it stops, most people will disagree each other they'll agree in the main, but at the edges when you try to do I know 10 millisecond labeling accuracy around when a baby started crying, you'll get into debates like is it crying, is it grumbling? You know, all the things that you would do as a parent start to come out because that that audio is just less exact? Was it in a speech weld, it's a lot more exact, you can take a typical person off the street and say, label the words and when they started, and there'll be very little disagreement in comparison. And that gives Tom and his team some sort of fundamental challenges with labeling up these sort of things. Which, which I know they spend a whole bunch of time just figuring out how do you label a new type of sound? What is that sound? And when does it start and when it stops.
Tobias Macey
0:19:27
And the metadata aspect too, is interesting, because with things like textual records, or structured data, it's easy to associate the metadata with the record at the time that it's being created. And with image data, there's the standard of the Exif tags. And I know that for instance, with mp3, they've got ID three tags, but I'm wondering if there is any sort of useful standard that you can use for embedding the metadata with the records or what your approaches for being able to effectively associate that information with the actual audio segment and ensure that they propagate through your system in conjunction with each other. So that they're easy to relate to one another.
Chris Mitchell
0:20:02
So we have a whole subject, we call it audio provenance. And it's a, it's a whole subject matter for us internally. The two examples you've used, let's take image data. But if I showed you three pictures of a toy dog, and one of a real dog, you very quickly be able to identify with with no prior information, that's the toy dog, you know, knows that the real dogs, audio is much more complicated than that. We're very much attuned as humans to sort of fill in the blanks. And so I could play you three recordings of smoke alarms and say one of those is a fake smoke alarm, and I guarantee you'd be very bad at telling me which one was the fake and which one is the real one. So if, for example, you scraped a bunch of audio files off the internet, you'd be straight into that garbage in garbage out principle. By doing that high quality data collection inside those semi anechoic environments. It's means we're there when the subject matter variable is explored. And then you're right that that sort of chain of evidence, if you will, has to be passed all the way through the pipeline,
Tobias Macey
0:21:09
right through the, you know, data collection processing, labeling, or mentation, training, evaluation, you know, even sometimes data compression and deployment levels so that you know that it's doing a good job in terms of frameworks for doing that. No, no off the shelf frameworks that that's a completely new area itself. And yeah, with audio data, there's definitely a huge degree of variability along a number of different axes where as you said, You've got your anechoic chambers for being able to isolate the sound to the specific piece that you're trying to collect. But then out in the real world that's going to often be overlaid with whatever the other background noise is, whether it's the you know, hum of your washing machine in the next room or the sounds of engines going by outside and then being able to isolate that sound and then for your volunteers who are Contributing the audio that you're using for this collection process, I imagine that there's variability in terms of the quality of the microphones that they're using the sample rates, that they're collecting the audio and the specifics of the audio format, that it's being collected the length of the segments, I'm wondering how you approach being able to try and encapsulate all of that variability and be able to standardize it in some way for being able to feed it through your model training process in January, right? We think about it in terms of subject matter variability, and channel variability, with channel variability split into two parts, which is sort of a acoustic coupling variability, which would be the environment you're in the acoustics of it, is it the reverberant environment is in the bathroom or is it you know, sort of in in the hallway, and then you've actually got the the, the actual device channel variability, which includes the microphone includes all of the brass parts of the audio subsystem before the input audio is received. Ai three, which is the, the inference engine that we run to do the high quality, sound recognition we do on the device. Tom, in terms of the challenges and all that, if you want to pick that piece up,
Thomas Le Cornu
0:23:12
yeah, sure. So I mean, yes, you're absolutely right in terms of like, you know, it's a massive amount of data to try and encapsulate it, but consider all the things you've talked about, and then sort of draw it back to just one one single file. And for one single file, as you said, you'll have the particular sampling rate, the particular particular bit depth, that particular channel is recorded on a bunch of settings around you know, whether it's the device you're using to record or the sound card, then you'd have stuff like the room it was in, perhaps like you say, yeah, is there some sound in the background that's going on at the same time? Obviously, this will be in situations where we're kind of recorded in situ versus say, in the anechoic chamber, and then around Yeah, exactly like the source variation, the gender of the dog, the age of the dog, all these sorts of things. And so it In terms of cat capturing that data, I mean, we've got tools that sort of help us help us, I would say, kind of structure the collection around this, but you consider it still quite a manual effort of, you still have to kind of match up. When a volunteer comes in, you have to put in the name of the dog in a particular record and sort of make sure that that's stored along with the correct file. Now, when you consider what we say, save him going back to the dog example, if you record one particular dog bite, you might be recording it with, you know, some, like several 10s of devices at the same time. So you need to make sure that the the information around the dog is kind of propagated to all those safety of devices. But then the information about the individual devices is kept specific to the devices. And so you know, we do have a pairing of a chunk of metadata with each individual sound file, and you can imagine that you know, that those numbers grow, grow pretty Last in terms of Yeah, how many elements of metadata we, we have a record
Tobias Macey
0:25:07
of. And in terms of the taxonomy that you're building for being able to track and categorize these different audio segments, what was your approach for structuring the initial taxonomy? And how was it had to evolve? And what are the assumptions that have been challenged in the process of building and growing that taxonomy for being able to make that information useful in some sort of structural or hierarchical way?
Chris Mitchell
0:25:33
So the data, the taxonomy is structured on what's called an Act to principle, which is why that sort of amps are often used by often injury often at the top level things so obviously, caused by humans caused by geography, if that makes sense and caused by biology. So and then it cascades down from there. The active principle is a fundamental one. It was a specific taxonomy principle. We came up with big Because obviously something needs to cause those sounds in the environment. So using that as a fundamental building block means that you're not going to go far wrong in terms of skipping your last question, tons of things that we've, I suppose, effectively learnt that we didn't know we were going to have to learn. One of my favorite examples of that is, is not realizing sometimes the sound will inspires for you. And sometimes it conspires somewhat against you. So there is a smoke alarm. That's I think it's the third or fourth most popular selling smoke alarm in North America. And it sounds identical to a bird species in the south of France. Now, I'm pretty sure that that that bird species hasn't evolved to mimic the smoke alarm. But that sort of then thing that is then presented to the machine learning engineers and saying, well, these things sound pretty much identical to humans. But you need to separate them out. Otherwise people are being told that their their houses, their smoke alarms are going off when in fact, it's just the bird that they keep in their living room to pre Dwayne happens to sound identical to this North American smoke language which they the engineers solved. But that those sorts of interesting quirks of I suppose fate, if you will, a fascinating to experience although do give Tom and the rest of the team sort of, I'm sure sleepless nights of worry as they try and figure out how to best collect the data and best separated out.
Tobias Macey
0:27:35
And another element of this problem space is that a lot of the tools that have been built for being able to work with and process large volumes of data are generally oriented around textual and structured data. So I'm wondering what you've been able to use in terms of off the shelf components for being able to actually process and build these models and how much of it you've had to custom build in house specific To your use case and your problem domain.
Thomas Le Cornu
0:28:03
Sure, yeah. So I mean, you're absolutely right. Like I mean a lot of the tools when it comes to the to the audio world, say audio and then kind of bridging a bit into machine learning really kind of for speech recognition type applications, or kind of music oriented stuff. So you might have for example, you know, some some tools around transcription, that sort of aiding with transcriptions for speech recognition type problems, but obviously there's Chris's touched on a few times, you know, speech sound isn't speech. And then the other way you can consider there'll be you know, even even software said like audacity, absolutely fantastic at doing the job it does, but But again, it's specific to music and in this case, kind of recording and music production and that and that sort of things. And, and so often, it's the, as what the problem that we're trying to solve is quite specific and is difficult to do. There is an element of having to roll your own or enhance if you can, of course, it would be it's easy. It was Time Time beneficial to do it that way enhanced. But if you if you have to kind of roll your own, then, you know, for for a lot of stuff that we work on, we do kind of have to roll our own. I mean, consider the example of that, you know, recording with many devices at one time, you know, there isn't a magical Start button that allows every single device to be started at the same time and stopped at the same time because, well, actually, if there was, we'd like to hear about it, because that'll be extremely useful. But you have to solve it. Yeah, you kind of then have to make sure that the audio obviously, if there are then like, different offsets at the beginning of the file, that the sounds occur in the same part of the audio across across the many different devices. And so, we've developed techniques to handle that problem, you know, ourselves. And yeah, I guess, you know, right, right through the pipeline, we have got, you know, a lot of stuff that is bespoke to to us order analytic and to just just help solve those problems that that aren't quite the same as, as in other areas.
Tobias Macey
0:29:58
And can you dig into a bit of of how the actual data pipeline and data management is architected, and the ways that you work with it for being able to train and build and deploy the models that you're working on.
Thomas Le Cornu
0:30:10
Yeah, sure. So So I mean, so in terms of like, you know, the whole the whole pipeline of what we do or do analytic, you know, consider it as standard machine learning delivery pipeline. And so we go through from the data collection right at the beginning through to the deployment of the models at the end. And there are many, many stages in between, they're so focused around, say, Alexandria, which is our, you know, massive data, massive database, you know, the stage of the pipeline will focus on kind of around the data collection side, as I've touched on the processing, like with all these many different devices, and many different you know, the formats that the audio come in some devices will give it to you as one format and others will give to you as another format, and you'd have to then the next stage will be That said, the labeling part and that's where you kind of marry up the all of the metadata and the labels and the sound Get it into Alexandria. So it can then be made available for the machine learning teams. And then you know, there'll be a stage in between there of, you've got to consider as well, it's not just you know, that a bunch of a bunch of audio data in a data set, or you know, from a data set you download on its own, it's not, it doesn't necessarily give you that much, you then got to kind of say, right, let's split it into those machine learning sets where we need to make sure that, you know, ultimately the stuff that goes in that testing set is representative of all real world situations is going to be and then you need to make sure you've got the same kind of distributions throughout. And so that's, you know, that's another very important aspect of, of making sure that you gather enough variation all across the board so that when you chop it up into these little sets, that you've got it in the right places, and so that that's when that's the point at the machine learning team will will take this stuff and so they'll apply techniques like data augmentation, you know, to further increase the volumes. I mean, you know, machine learning models nowadays are so so data hungry, you have to kind of apply these techniques, and then look into the training. You've got lots Different model evaluation type procedures going on. And then then you have like a more formal evaluation stage, then kind of deciding on what sort of models that you're interested in or what trade offs you're making, I guess, would be a better way to sort of structure that, you're then looking at trying to compress these things. As Chris said, we know we run on the edge, we can't have have have massive models that take you know, tons of computation that, you know, they need to be really small, you need to be really, really fast. You know, we're not the only stuff on these devices, right? there'll be other other other sort of functionality that these devices will, will have. And we're just one part of the stack if you like, yeah. And then that kind of bleeds into the the deployment side of things
Chris Mitchell
0:32:39
in terms of optimizing that that pipeline, one of the in that sort of evaluation stage. One of the interesting things we've done recently is around something called polyphonic sound detection score. We found that light like with any challenge, you need to optimize for any machine learning challenging problem. Broadly, you need to optimize for the right criteria. And the generic methods that were being used. just borrow from machine learning, we're not optimizing the systems appropriately, and the pipelines and everything else. So we released a bunch of getup code for this polyphonic sound detection score that's now being used. I think it's published at ICAST. This year, is being used as part of the decays competition, which is sort of the benchmarking world. This is the standard, if you will, community standard of fan recognition. So it's great to see the the discipline, grow out of it empathy into those more developed areas and moving into what we start to call second generation sound recognition
Tobias Macey
0:33:57
and in terms of the models In the deployment of it, I'm wondering if you just deploy one model that works generically across all the different sound categories that you have collected. Or if you train these models for a specific deployment targets where you have one that's specifically focused around security where it has things like the gret class break or sounds of you know, a door being hammered on and then you have a different model that you deploy that's focused on things like detecting cough and sneeze and sniffles for health related environments.
Chris Mitchell
0:34:29
It's a great question. So we tend to think of the the sound profiles as we call them, which is a one or collection of sounds, broken Bagpiper device because the the value propositions tend to align at the device level. And then of course, you can run multiple sound profiles together in terms of those configurations of those underlying models. It really comes down to the individual use cases that are applicable on the devices But we try and make sure that that end set of sound profiles we're delivering are optimized for that set of use cases for that set of devices.
Tobias Macey
0:35:09
Imagine that that helps to with restricting the size of the actual deployed bottle rather than trying to fit everything into one thing. And then also compress that down to a size that's runnable on an embedded device versus just training the model for a specific use case, and then reducing the overall scope of what it needs to be able to handle and thereby the size of the model that's being deployed.
Chris Mitchell
0:35:30
Yes, sizes is somewhat proportional to the amount of sounds that you add, especially when you're dealing with a smaller number of sounds. So as it grows, that that sort of relationships becomes less distinct, in terms of, of course, if there's sounds that you will never come across and never need to detect on a device category, it would seem a bit silly from a computational point of view to to burden that device for looking for those sounds. So good. Yes, it definitely helps him though that notion, this is the banks of structuring a sense of hearing in line with what is needing to be heard for those device categories and, and sort of a just enough type approach. That means that you have that good trade off between computational smallness, or sort of resource size, and that sense of hearing. So in fact, we did a recent set of things. The Consumer Electronics Show this year in Las Vegas on our private demo suites, which was shown that we were capable of running a sense of here right down onto an M zero plus processor, which is the smallest grade of processor that the arm do so so really showing that you can push that sense of hearing down onto incredibly
Tobias Macey
0:36:46
small processes. And then in terms of being able to actually evaluate and test the models that you're working with. I'm wondering what you have found as far as their capability of operating in noisy and complex and layered audio environments and how the overall testing of the model has fed back into your strategies for collecting the source data.
Chris Mitchell
0:37:08
Well, if I take the top piece, then Tom, if you if you relate back to the source data piece, so in terms of that feedback, so generally, because we've got quite, we've got the world's largest collection of data for this area, we have high degree of certainty with the models we're providing to the marketplace already. You know, we have large amounts of say 24, seven recordings, large amounts, environment recordings, and obviously large amounts of targets and environments. So we typically find that we're, we're, you know, pretty good in our guesses of what the performance will be for a new sound profile that we're producing. In terms of the sort of things we learn, going back to that example of things that you just can't predict. You know, I was using the example of the the bird in the south of France and the North American smoke alarm that that's, that is something beyond the wit of man to sit in a room. figured out, you're only going to get that sort of insight from the actual field deployments are technologies deployed in some, like 160 countries worldwide. So we've got a very good sense of the sort of problems that are faced on a worldwide scale. In terms of how that feeds back, obviously, it feeds back into, do we need more data in a certain area? But Tom, you're probably best placed to sort of pick back up that full loop back to the beginning of the pipeline, the data collection piece.
Thomas Le Cornu
0:38:32
Yeah, I mean, I like to say is beyond the wit of man to try and think of all these things. I mean, I've been in you know, we started these projects. So make sure we sit in a room and think about all the stuff that's going to try and bite us in the backside. And you do just never, never think of everything will just be there's just too many bizarre, bizarre things. And your head is so focused on kind of one aspect of looking at the problem that you know, you'll be completely blindsided to another even though you spent all that time trying to figure it out. So, yeah, the the delivering of these products, you know, they are very intuitive. And you'll develop something and, you know, get it deployed, and then you realize you have these sort of issues. And then you kind of have to, I think, often, you know, going back to data collection is one of the important ways to kind of address these problems, I think you can consider as well, for example, you know, talking about how do you sort of actually do the evaluations, like, as Chris touched on the 24, seven sort of use case, for a lot of our products, you know, we have, we just have absolutely tons and tons and tons of data that is just kind of use case, it's like a product in a room and it's just recording all that time. And we'll have you know, many different examples of that. And so, hopefully try and identify some problems early on in your kind of evaluation, you know, kind of your product development cycle, yet by by doing tests like that, and then in terms of like the actual problems that you Make the obviously that's the game kind of focusing more on these false positive, like area of the evaluation and then in terms of the more like focusing on, you know, how well do we actually do with actually detecting the sounds we're interested in, you know, we'll, we mentioned the anechoic chamber earlier and and that really allows us to have a sort of green screen for sound, you know, like you said, you mentioned that you get this layering of background sounds and so on. And really, that's a really good way of going about it is that again, let's go back to the dog bark analogy, if you if you want to know whether whether a particular model is going to, to, to detect, sorry, a dog bark in a particular say room with, say, tile flooring, and it's a really solid and I say, really reverberant environment and someone's hoovering in the background and you know, whatever else, maybe the devices five meters away or something, you know, you have to you have to literally test for that exact scenario. And so, you know, by having these these green screens, we can sort of test and a It's it's looking at this stuff iteratively, evaluating, you know, often and then feeding it back in and seeing where you can make improvements.
Chris Mitchell
0:41:10
What we what we find, though, I think is the experience we now have with doing these things means that internal iteration, speeds up and speeds up and speeds up. So we're producing songs at an increasing rate of knots and those songs being produced in their first iterations internally at higher quality, just because obviously we're starting from a higher place. So we we've learned a lot that that iteration cycle means that we can now iterate internally very quickly so that when we do release that product out into the marketplace, the customer can have high assurance that's already working from a sound recognition perspective, and not thinking, I'm sort of getting a bit of a product, but I'm going to have to feedback data to improve it because clearly that's
Tobias Macey
0:41:54
not going to be acceptable to their customers. And in terms of the audio that you're working with. What about Some of the most interesting or unusual or strange sounds that you've had to try and collect and categorize.
Chris Mitchell
0:42:06
Well, I'll do three stories as they're strange in just sort of strange experiences of capturing the data. So to do that one first so we did some gunshot recordings, we oddly enough to chose to do them in the UK and machine guns anywhere in the world are not easy to come by. But in the UK, they're particularly challenging to come by and we managed to arrange a set of machine guns to be recorded. One of them was a an Uzi submachine gun. And we had a rental van to go down to the there's only two civilian sort of automatic ranges in the UK. And we had a rental van to go down there. I remember sitting there with a guy and he laid out the guns and he he explained that we had to move our rental van and I said, Well, why is it it seems to be nowhere near that the targets and he said, Well, they use he doesn't so much Bullets have sort of direct them vaguely in that area. So you want to move it for pretty sure that would have caused us a whole range of fun from around least the deposits on the range of and let alone the explanation to the police I'm sure would have been come. So that's on the sort of a strange day collection experience in terms of strange sounds. Who has a good one to Tom, what's your What's your favorite strange sound that you've done the data collection for so far that you're able to handle? Obviously,
Thomas Le Cornu
0:43:31
yeah, for sure. For sure. Well, yeah, again, this this is trying to think of individual sounds. I mean, I guess you know, many of us you come across cross sounds when you've got 24 seven days, you might come across stuff like these honking which is always quite an interesting one. In terms of like, weirdness in terms of collection for me, I guess it's, it'll be the outdoor glass break. connection we did where, I guess, perhaps, similar weirdness, to Chris's thing, but like, you know, Essentially, we were tasked with just being on the street and smashing people's windows, of course with permission and everything else. And we will pay people we were saying that, you know, we'll give you some good amount of money, you can repair it, you know, to replace your windows 10 years old, but we'll, we'll break them for you. And it was a really, really bizarre experience. I mean, you know, you're literally in this like a residential area. And you've got, I mean, we've put on fluorescent jackets. So we kind of look like, you know, we were we were doing we were meant to be there with all these different microphones and all these booms and stuff and all these wires going around, and it's almost slouching, smashing these windows with a sledgehammer and this sort of stuff. Yeah. So you know, you kind of work for a very interesting company when you're stuck with something like that. But yeah, that's definitely weird.
Chris Mitchell
0:44:50
The way I talk it was we did a car alarm data collection exercise, and we were we done the sort of in car model Fit. But we were looking at the retrofit car alarms, but obviously wanted to know what they sound like when they're fit into the car. So we had a board in the back of a car with these car alarms on it. And we collect most of our data. But there was one challenge left, which was to go to a very reverberant real environment and collect them in there because we just wanted some, you know, sort of high quality reverberant and I remember sitting in a multi story car park with a car and these things are very loud as well. So you want to go a bit of a distance away because obviously the car alarm is unlikely to go off and people are physically sitting in the car which would dampen the sound. So we're sitting away from this kind of car with a wire sort of train from the car with a big button that sets off all the car alarms. And then somebody pointed out this might look a bit strange to the security guards watching us off on the beat inevitable CCTV cameras that would be watching this things in sort of a some bizarre no Italian Job kind of blow the doors off type piece or not entirely clear and then realizing we've been To finish our experiment up rather quickly before we cause anybody a whole bunch of the hassle, I think on that one that one gets interesting because the the recording of audio itself presents a whole range of challenges from a data privacy data ethics perspective. So if you try and do audio recordings in the public, there's a whole a whole set of rules that govern that, obviously, there's increasing rules and direction of travel around what data you should be able to use in machine learning models, both legally, ethically and the direction of travel, you know, from a GDPR. And the things that are, you know, sort of going through the various stages of legislation. And even if you do something basic of you want to record a park, you know, when you try and square that with the various laws that you need to get permission to do that. But it's a public park, who'd you get permission from, you know, you're damned into really quite fundamental challenges. If you want to make sure that not only is your Data good today to be training off. But as the rules change and evolve, and society understands what they want to do with machine learning, that data you base your models on doesn't start to be eaten away at, and you can't include it or it's decided that the, the, you didn't have the right permissions, or the traceability around the data and what's gone into it isn't clear, you know, you don't want to fall foul of any of those things. So the, the effort we go to, to make sure that we've got that complete chain of evidence around our data is, is quite extreme, and that that's been ingrained in us from day one. It's it's always been expensive and time consuming to do but that choice has paid off quite substantially with the direction that the machine learning world has done. And what are some of the other interesting or unexpected or challenging lessons that you've learned in the process of building out the technology and business aspects of audio analytic on the business side, we licensed per sound per device, which is the closest in the speech world would ask, but presumably, licensing, either wake words if that was the equivalent or language. So there weren't any directly comparable models because that those two were clearly different in various ways in terms of then we've had to go out and we've we've obviously explained to people why sound detection is extremely valuable if you're designing products and, and why it gives that very intuitive sense of the devices behaving as you would expect it to, but that that's been done from the ground up. And that was a new area for product owners to experience which means that how they judge how we should license it, how they judge, the commercial value, where the value is realized and things like that is is as being part of that journey. And that's been fascinating to experience how people perceive and how they think the world Have sang should be monetized, I suppose.
Tobias Macey
0:49:03
And so as you continue to build out the technical and business capacity for the company, what are some of the plans that you have for the future either in terms of new problem spaces or use cases that you're looking to reach into or enhancements to your existing processes of data collection and machine learning?
Chris Mitchell
0:49:23
our roadmap is largely punctuated by what sounds for what use cases is the way to think and when we constantly add to that, as we expand our ability to differentiate sounds, and obviously the target devices and capabilities themselves. In terms of future stuff, I think we did a blog post recently on on our website, if you go to audacity Comm. We did it on what we'd call contact systems. So contact systems Look to not only detect individual sounds, but draw larger inferences off those. So the example that was used in that blog post was, it sounds like you're leaving the house. So you know, Tobias, if the other houses is leaving the house and you know that you've, you've not reminded her to get beanbags or whatever the other sort of small items that you just forgot to remind her are, you will know that you'll sit on your living room, your ball to hear it if it's an air shot, and you'll know what leaving the house sounds like and what preparing to leave the house sounds like. But it's not a single sound. It's a collection of sounds in a certain sequence put into broader context. And that sort of higher level reasoning requires even even more sophisticated approaches to sound recognition. And that forms the some of the foundations of what we mean by second generation. So first generation standards ignition systems are around a small range of sound typically safety and security applications and typically triggering events through to the end consumer. So there might be push events through to mobile phone, I've heard a somebody break into your house I've ever spoken to alarm going off that something. Second generation is about much more sounds covering not just safety and security, but entertainment, health and well being and the starting introduction of these contacts system that start to use that, that fundamental understanding of the individual sounds or, or scenes that it's hearing, and building higher level inferences on top of them. And that that's an exciting world to see unfold as that sort of sense of hearing becomes more and more real, in line with what you and I recall the sense of hearing. So these devices can be more and more intelligent. And I'm also wondering, given the fact that you have collected this large volume of data, and it's well labeled and highly valuable for building your models, and that's what you're licensing. I'm curious if you've also gone down the path of looking to license out the actual source data for other people who are looking to take advantage of it for other use cases. No, it's typically commercially, people asking the model. So we've got this as you read one in Alexandria, which is the world's largest collection of data about 15 million audio files, 700 Label types, about 200 million pieces of metadata that helps steer where you put your research from an AI perspective. You know, coming back to those points I mentioned around its fundamental architectural blocks and machine learning point of view that you just can't use like a language model, like the speech acoustic model that would materially impact your performance rates and take mean that if you just took an off the shelf machine, learning to Honey, your performance rate wouldn't be good enough for the vast majority of sound recognition tasks. So we've used that data to steer our research and and come up with our own inference engines that are specifically designed for sound recognition. So they, those two pieces tend to go hand in hand. Because even if you have the data, you'd still need to direct all of that research effort to get to a point where you have a well functioning, you know, sort of second generation sound recognition system. So generally people want to license the output of that and take advantage of both of those pieces.
Tobias Macey
0:53:41
Are there any other aspects of the work that you're doing in audio analytic or the overall space of sound detection that we didn't discuss that you'd like to cover before we close out the show?
Chris Mitchell
0:53:51
No, I think I think you're good. Apart from I'd love to send you over some bizarre audio files, and we'll see if we can find you some instead. But just because there was fun.
Tobias Macey
0:54:03
Well, maybe we can put some of those in interspersed throughout this conversation just to give people something to entertain themselves with, will help for anybody who wants to get in touch with either of you and follow along with the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Chris Mitchell
0:54:26
So for data management, obviously, it's gonna be biased because we're solely focused on audio. So most of my concerns that kind of be around the audio side, I think a realization that the machine learning frameworks have come on and the data pipelines side has come on and the data management side has come on generically, but now we're into the specialisms and the specialisms require us to extend hopefully what are extensible frameworks love to see more of these frameworks becoming more extensive, more more modular, so We only have to roll our own pieces in the area that makes sense. You know, a lot of times we come across tools where they just don't work well together. So you end up even though there's good bits of functionality and tool a and good bits of functionality and tool B, that you still need to specialize them to the task at hand. And then those two tools weren't aren't as flexible as you'd like them. So if there is one area in sort of that digging into each individual task within machine learning has its own set of challenges and realizing that at a fundamental tool level, and baking that into the architecture of the tools that could go across the industry would be incredibly valuable.
Tobias Macey
0:55:41
And Thomas, how about you?
Thomas Le Cornu
0:55:43
Yeah, I mean, I've just echo Chris's answer to burnish the view. It's just as soon as you get into something that's just kind of a bit more niche than the the sort of mainstream applications for a lot of these things that you just need have to start rolling around. And I'd say that applies to you know, right across The kind of spectrum of wherever you play machine learning really has been my experience before as well.
Tobias Macey
0:56:04
All right, well, thank you both very much for taking the time today to join me and discuss the work that you're doing with audio analytic and sharing some of your interesting stories of collecting these audio data. It's definitely a very interesting use case and interesting problem domain that you're working in. So I appreciate all of the time and effort you've put into that and the time that you spent sharing your experiences with me, and I hope you enjoy the rest of your day.
Chris Mitchell
0:56:26
Great boss. Thanks for having us on the show been been great. I look forward to listening to future episodes as you go forward.
Thomas Le Cornu
0:56:34
Yeah, cheers, the bosses, quick speeches and so much.
Tobias Macey
0:56:42
Listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language, its community in the innovative ways it is being used, and visit the site at data engineering podcast.com Subscribe to the show. Sign up for the mailing list and read the show notes. If you've learned something or Write out a project from the show and tell us about it. Email hosts EPS data engineering podcast.com with your story, and to help other people find the show, please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!