Summary
Industrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Your host is Tobias Macey and today I’m interviewing Mário Pereira about building a data management system for globally distributed IoT sensors at Vopak
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Vopak is and what kinds of information you rely on to power the business?
- What kinds of sensors and edge devices are you using?
- What kinds of consistency or variance do you have between sensors across your locations?
- How much computing power and storage space do you place at the edge?
- What level of pre-processing/filtering is being done at the edge and how do you decide what information needs to be centralized?
- What are some examples of decision-making that happens at the edge?
- Can you describe the platform architecture that you have built for collecting and processing sensor data?
- What was your process for selecting and evaluating the various components?
- How much tolerance do you have for missed messages/dropped data?
- How long are your data retention periods and what are the factors that influence that policy?
- What are some statistics related to the volume, variety, and velocity of your data?
- What are the end-to-end latency requirements for different segments of your data?
- What kinds of analysis are you performing on the collected data?
- What are some of the potential ramifications of failures in your system? (e.g. spills, explosions, spoilage, contamination, revenue loss, etc.)
- What are some of the scaling issues that you have experienced as you brought your system online?
- How have you been managing the decision making prior to implementing these technology solutions?
- What are the new capabilities and business processes that are enabled by this new platform?
- What are the most interesting, innovative, or unexpected ways that you have seen your data capabilities applied?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an IoT collection and aggregation platform at global scale?
- What do you have planned for the future of your IoT system?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Vopak
- Swinging Door Compression Algorithm
- IoT Greengrass
- OPCUA IoT protocol
- MongoDB
- AWS Kinesis
- AWS Batch
- AWS IoT Sitewise Edge
- Boston Dynamics
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Mario Perera about building a data management system for globally distributed IoT sensors at Vopac. So, Mario, can you start by introducing yourself?
[00:02:07] Unknown:
Hi. First, thank you for having me. So my name is Mario. I'm from Portugal, but I'm currently living in Netherlands. I work as a senior engineer at Vopak.
[00:02:16] Unknown:
And do you remember how you first got started working in data?
[00:02:19] Unknown:
So first, I started in, in tech startups. I worked at OutSystems. It's a local platform, and I worked then at MessageBird. It's the Twilio, competition. And I worked majority in the data cloud infrastructure part, and now I started working in the edge data infrastructure. So around 10 years that I'm already doing it, but majority using cloud technology.
[00:02:44] Unknown:
So the company that you're at now is Vopac, and I'm wondering if you can just describe a bit about what that business is and the types of information that you rely on to be able to power the business.
[00:02:55] Unknown:
So Vopak is the leading independent liquid storage company. It's a Dutch company, the oldest Dutch company. It's a 470 7 470 years old. The business, what they do, Opak stores vital products. So everything that is vital for mankind. So it could be oil and gas, industrial chemicals. Currently, OPAC has 47 sorry, 73 locations. When I see locations, it's plants, facilities, terminals, which stores, the oil and gas in 23 countries. What the information that's the business or powers the business is the status of the liquid inside of the tanks, the flow meters, the state of the pump, the truck, the information that are inside of the truck.
So all this information that is on-site empowers the operational part of the business, supply chain of the business, and maintain operability of the business itself, the facility itself. So what we do at Vowelpark, and I will explain later on, is how we ingest this data and transform this data at the edge that empowers the business.
[00:04:07] Unknown:
And you mentioned that you've got a few different categories of locations and different types of information that you're collecting. I'm wondering if you can talk through the different types of sensors and edge devices that you're relying on and what types of consistency you might have within a particular category of location and the degree of variance that you're dealing with across your overall fleet of sensors and the the sort of matrix of complexity that you're dealing with?
[00:04:33] Unknown:
So we have 73 locations. Currently in production live. We are in, almost 20, 18, right now. And every location is different, totally different, with different sensors, different edge devices. What happens is so currently, what we do at the edge, we ingest the data. So from 4 meters sensors, so industrial sensors, not your normal sensor. We 4 meters, so we know how much how does the speed of the liquid that is good, temperature, the level of the tank, weigh bridges, so we can weigh the trunks, license. We can ingest the data from license plates, readers.
This is funny from drones. We have drones that inspect the tanks. We have robots that inspect the tanks, camera, in this case, image camera. Everything that has a data point. So everything that has a sensor, we grab it. We ingest it. And then then this data is then sent to its processes at the edge before sending to the cloud regardless of the variance. What happens is that, for example, we have a tank. We are pumping the liquid to inside of the tank. You don't want to send all data to the cloud. So it gets still maybe you only want when it starts at the middle then at the end, or you need to let it, flat get stable. Right? In this case, we use what they call dead band, and in some other sensors, we use what they call sinking door compression algorithm.
What happens is that this algorithm cuts values to some degree that are in the middle and which you don't need. So what happens is, for example, we are pumping, and the value is constantly changing. We only get some part of this value, the value that really changes, for example, because it can go to a value of 6 digits, and you don't want all this big value, and then we cut it. This is all done at the edge. What we do at the edge too of the data, we do some like this, but it's a complex event processing. So based on some data change, we trigger another change.
So, for example, if a Weybridge if a truck arrives to Weybridge it, automatically, we get the license plate and, automatically, we get the weight of the truck. And then based on the billing, we are able to open or not the door to the truck to come in. So there is a lot of things that we still do at the edge before sending to the cloud.
[00:07:16] Unknown:
As far as the selection of sensors and the deployment of sensors, I'm wondering how much control or oversight you have as the person who's responsible for dealing with the data that they're collecting versus just having to deal with accepting whatever the terminal manager happens to decide is the most economical or, you know, best option for them to be able to install and maintain in their physical locations?
[00:07:43] Unknown:
In this case, it really depends which sensor it is and which wire it is and which type of terminals. For example, this is public. We recently invested in the company that provide the sensors, industrial sensors, and the reason is because the industrial field and in our field, a a lot of the sensors needs to have specific certifications, for example, ADTEC certification because it are places which has those hazards conditions. What majority of the times as edge team, what we ask or we have something in the saying is that we would like to have a sensor that connects to a DCS. And then from the DCS, we would like to have the data from that sensor.
DCS, it's a common word from industrial world. It's a distributed computer system. We kinda have sometimes, we're able to say, hey, if this is a new terminal, bear in mind that changing these sensors is not something that you change every day. Okay? You are not gonna build terminals every month. We are talking about facilities with 30, 40 years. And so when you build it, you build it to last. You build, like, some of our terminals have, like, a 100 years. You know? Of course, you constantly upgrade your sensors and your terminals, but it's not something that, yeah, you upgrade every every day or you change every day. Yeah. Regarding that of the the selection of the the sensors, it's what we do in this case.
[00:09:14] Unknown:
And then as far as the general capacity for computation and storage, both within the sensors and at the edge locations where those sensors are operating. I'm wondering if you can talk at least in terms of orders of magnitude, what kind of compute and storage and processing capacity you have to be able to push more of your logic to that edge location so that it doesn't all have to be centralized.
[00:09:38] Unknown:
Currently, in general, we are currently processing 65, 000, 000 events per day at the edge. And in the cloud, we are storing around 8, 000, 000, 000 events. So the computing of this, can already tell, but we are using IoT Greengrass from Amazon, and IoT Greengrass from Amazon allows us to run Lambda's, which you run the cloud on prem inside of your Raspberry Pi. So the computing power of this, you don't need a really big machine. Okay? You can run these in a Raspberry Pi. So from a computing power, it's a very low spec computer that we have there. Currently, we run it in the physical server and in the hack, and that's because of compliance.
Once again, because of the attack certification, I cannot just put the server or the Raspberry Pi inside of a tank or or near a tank because, you know, it can explode. So it's a very low spec computer, which processes that all that information. And then this is because we use the West Lambda at the edge. So on-site in this case, because we use a less IT ingress.
[00:10:48] Unknown:
As far as the processing that you're doing at the edge with that Greengrass service? What are some of the types of filtering and preprocessing and logic that you are pushing out to that edge location in order to determine what events and what aggregate data you actually want to centralize?
[00:11:06] Unknown:
To give them more information to that question, we use in order to integrate and well, to ingest the data first, we use OPC UA. OPC UA, it's a machine to machine industrial protocol. So what we do, we we connect to a OPC UA server, we have a client, and we ingest this data. The data that comes, it's very hard to understand from IT, software engineer, BI, data science perspective. It is like example, temperature from a tank originally. So as a hot data, it comes like CV337.cv and then 37. So, what we do at the edge besides the filter of the dead band, as I told before, we add context information.
For example, and this depends of the terminal, by the way, which type of sensor it is, which type of measurement it is, which type of unit of measurement it is. So we have much more information, and then we have some business related information, like the location specific location of the sensor timestamps. So we have the server timestamp. We have the sensor timestamp. We have the ingestion timestamp, transformation timestamp. We are able to, at the edge, besides filtering, we, adding information, so transforming the information at the edge. So what happens so giving already a little bit of the architecture, we have IT ingress, and then we have 3 Lambdas.
K? So 1 Lambda runs PCA client, which ingests the data. So this data very hard data that is very hard to understand, depends of the terminal, but can come on change. So every time there is a new value, we get it and can on quest on batch. For example, we can call the the way server every 1 minute or every 1 second and see if the value changes or not. So we then we have a Lambda that transforms the data, that gives the context, which type of sensor it is, in which location, which unit of measurement is is using, the the method that is using to measure. Okay? Because in a tank, it's like a cup of glass. Right? It could be off full or off empty, and this can change on the tank. The sensor could be measuring if the tank is half full or half empty. And then we add this information, as I was telling. We transform this information, and then we have a lambda that stores the data locally.
And we have and the normalization lambda transformation lambda is how we call it. It sends the data to the cloud, to the cloud. We store it in MongoDB, but going a little bit back, we have eduBLAS Kinesis that streams the data to an s 3 bucket for the BI and data science team, and we have a less patch, which patch the data every day to other departments to use it. So so the complexity of the things, the transformation, as it asked, it's done at the edge and filtered at the edge using the normalization lambda. The ingest ingest everything.
Normalization transforms, and then we have 1 that stores locally in case in case we need example, if the Internet fails, we are able to cache, by the way. I didn't talk, but we are able to cache the information for around 1 hour. But if the Internet fails for a longer period, we may need to go to the data storage, the local data storage. So this data is then, as I told you, sent to cloud through the normalization regarding your transmission and filtering.
[00:14:49] Unknown:
To the point of the network failures and the potential for not being able to receive data in the central location for, you know, up to an hour or more. I'm wondering what the tolerance is for those types of latencies and some of the ways that you're able to mitigate issues due to that lack of central coordination by pushing more of the logic to the edge for any types of decision making that has to happen in a more real time fashion?
[00:15:19] Unknown:
So as I told you, we can cache for 1 hour. We try as a team to make it the maximum is 50 minutes of data. We shouldn't lose more than 50 minutes of data. Okay? So we cache 1 hour of data. If you say, okay. If everything goes down on a location, then even the industrial system goes down, and that's very hard because these industrial systems are made to be keep running forever. Okay? So they are very redundant, and our system is very redundant too. We have 2 ingresses. In case if 1 fails, we can use the other 1. Even if it fails, there are safety mechanisms. We currently only ingest. We don't write down, so we don't give instructions yet to the sensors. Okay?
And they may be we give instructions to the sensors, but we only ingest. So if there is a failure, everything fails on-site, not only us. So what I'm trying to say is it's very hard to have something that really, really, really, really fails. The Internet, we have double connection to the Internet. So, yeah, we built something when we built it, that's thinking about scalability and high
[00:16:31] Unknown:
availability. In terms of the retention periods that you have, I'm curious what the utility is of this sensor data as you move beyond the, you know, time horizons of maybe an hour or a day and into weeks months years, and what types of longitudinal analysis you are looking to be able to do over those longer time periods versus just saying, you know, this particular category of sensor data is useless to me past 5 minutes?
[00:17:01] Unknown:
So at the edge, we store for 2 months, and the main reason is because if we lose data or something, we are able to to fetch it, again. Or if a local application wants to access it in case of emergency, it's there. In the cloud, as I told you, I can say it goes to MongoDB. Okay? And then we have an API on top of it. Majority of this data, it goes I didn't talk, but Vopak is going through a massive digital transformation. And the reason that edge was built this product was built, is because during the digital transformation, we found out that we only we store more than oil and gas and and other liquids.
We store vital data too. Okay? So that is very important for us and for customers and partners. So the API provides data like reporting the end of the day or give me the current value right now. And this data majority of the times is called by our b systems, CRM systems. Even our customers, this is 1 of the good values, is currently able to see the state of their tanks. We have a mobile app, powered app, where the customer can see the data, not only the people on-site, the operations, but the customers too. So they can see the status, what is happening on the terminal. And then longer data, we store it because we have a data science team, okay, and machine learning team besides BI team. So what happens is we store it in the bucket, and, of course, in the bucket, then we have life cycle. Okay? And this life cycle, it's coordinated with the data science team. So when they train their data model so, currently, they do predictive maintenance. For example, we do analytics over alarms and events.
So alarms and events, it's a type of communication protocol that in existing industrial field or domain. For example, if you are having leaks, it triggers a data point, an example. Or a pump is too busy, triggers a data point. And then we do analytics over this. We have our data science to this asset performance optimization in order to understand, why it doesn't work as well. We do loading and offloading optimizations. We have these jetties and these arms where it's used to inject to pump the the liquid, and then we do optimizations on it. And this data, of course, needs to be stored for many, many years. Know, when you train your data model, you cannot you need a lot of data and good data to train them. For example, for the maintenance, you need to for example, vibration.
It's 1 of the things that everyone uses is vibration in the spectrum. It's easier to detect. With that, you need a lot of data. So, currently, we are storing all data industry buckets for that, but bear in mind that we built this in 2018. So we are starting building in 2018. So we are starting since 2018. Even that's probably to have more confidence on our data models, we're gonna need even more data from more terminals, from for more ears.
[00:20:12] Unknown:
You mentioned that this overall effort of building this full end to end IoT platform is a fairly recent exercise, and I'm curious. You know, you mentioned that some of these terminals have been in operation for on the order of a 100 years, and I'm sure that maybe not for the full 100 years, but at least on the order of decades, they've had some sort of electronic sensors, and they've been doing something with that. And I'm wondering what the existing situation looked like before you started this project and what some of the evaluation and planning looked like as you were deciding what is an optimal architecture given the existing physical investments that we have and the constraints that we have as far as access power supply and reliability, you know, at these various locations and just what that evaluation and planning looked like for the technical selection and architecture design and how you have been thinking about being able to reverse engineer this end to end system on top of physical limitations that have been in existence since before you became involved? What happened here is that
[00:21:19] Unknown:
the industrial IoT, not only IoT, but more precisely on industrial IoT, the field is very segmented. Okay? You have a lot of companies that are building industrial IoT platforms, ABP, Amazon, Yokogawa, Honeywell. What happens here is it's so fragmented currently. So we have 1 company that is very good at ingesting. You have, another company that is very good at transforming at the edge, but is not so good at ingesting. You know, you have another company that is very good at doing the previous maintenance. For example, you have Fogorn, that is very good at the edge, doing machine learning at the edge, or we are at the edge. During the selection period, so we went through multiple suppliers and providers that were already being used on the terminal, and we give it, like, a problem. We have this problem, and we want to solve it. It's like a MVP lookalike.
We give it to AWS. We AWS partners, in this case, all these companies, Clinician, for example, PTC from Capware, Siemens, we give it the this problem and say, hey. Show us how can we solve this. Do a demo prototype. And then they all come up with it, and then we did an evaluation. Which 1 it fits our enterprise architecture? Which 1 fits the knowledge meant, the current knowledge meant, and application, and landscape we already have in house. So there was this process of selection. Okay? We didn't just jump to 1. It took, like, 6 months to have the selection phase, to proof of concept, to test on the field, and we didn't do, like, a big test. We tested very small, like, with 1 sensor or 2, and then we saw, okay, how much can this scale?
I didn't talk here, but we have a very small team that was lucky to hire at this time, which is very hard to hire good people. I hired 3 good very good engineers and seniors that helped me build this platform. So, we have a very small team. We did the design, the solution design, what we wanted, and then we went through all the suppliers to see which 1 was better for us at the time. But bear in mind, we are already building. We are 18, but we are constantly still looking for other platforms. You know, we never stop evaluating because it could have that, for example, for the people that are listening to this, we built our products. Right? But, for example, Amazon now has IoT SiteWise Hedge, which is, in the end, the same as what we built, which is funny because we built providing all the time feedback to Amazon. Hey. We are building this, giving this.
Maybe you want this, and they now recently did the same. But it's not only Amazon. Google, I think, have something similar. Microsoft has something similar in Azure, and we are still checking maybe what happens if we go to site wise edge. What we need to upgrade? What we need to change? Yeah. So that was a little bit of our process and continuous process. You know? We're constantly still building the platform. We didn't build an okay. It's done. No. We are doing step by step still.
[00:24:32] Unknown:
So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.
Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar. You'll also get a swag package when you continue on a paid plan. Given the fact that you are still in this phase of proving out the solution and scaling it up, I'm wondering what the criteria for success success are and when you can determine that, yes, this is exactly what we're looking for. We're just going to continue investing in this structure and strategy, and we're just going to scale it out across our entire infrastructure or to the contrary saying, okay. This isn't quite meeting our expectations. We actually need to stop and rethink and maybe tweak the formula a little bit before we try to hit that full scale company wide adoption?
[00:25:51] Unknown:
Product itself was a success. You know? Because as a company, we were not able to see this data to get this data. It was in silos. Our team is even like a use case inside of the company because the company is not that company. Yeah? We wrote the code. We built the code. We activated the deployment. It's quite funny when I say, okay. We are non tech house, and we do microservices, Lambdas, and we have a code code coverage of 99.5, and we have unit tests, regression tests, smoke tests, well, tests, static code analysis, everything inside of CICD, and everything is automated. It's quite funny when we say we did that. So the product monitoring and logging and tracing, when I say the product itself, it's a success, and we are already scaling it.
But probably I mentioned before, we are always looking to see if there is something even better. We're not just gonna stop. And this product is already giving a lot of insights and information to other departments. It's quite nice to see that now we are doing digital twins, which we create the model of our terminal in 3 d, and people can see in real time the liquid flowing and the temperature, and it's quite nice besides the other initiatives, but it's something that you can see yourself. Or being on the terminal, and you see people using smart glasses, and they are able to see in real time with the smart glasses the data of the assets of tanks through the glasses because they are getting the information from us. So it's a lot of things that you already say, okay. It was a success.
We are already scaling it. Of course, as you said, okay, tweaking, of course, that sometimes time we need to see let's say, we start noticing that there is a limit in fingers. Every technology has a limit. So when we start seeing, okay, we need to filter even more the data, or maybe we need to deploy multiple green grasses in a location, of course, there is always, as I told you, a tweaking.
[00:28:01] Unknown:
In terms of those scaling issues, as you are starting to bring more terminals online and starting to maybe experience a greater degree of heterogeneity across the different sensors and the availability of data, I'm wondering how you have been addressing that and some of the complexities that you're dealing with as you continue to expand on the usage of this architecture that you've designed.
[00:28:25] Unknown:
1 of the challenges, this is about the challenge, is the contact civilization at our scale. Okay? The sensor itself, grabbing the sensor itself, the information, it's easy because of the UPC way. It's very standardized, and our suppliers have it. And the configuration of it, it's standardized already. It's the constellation because every terminal is is different, but it's not the technology, but a process mindset that you need to change majority of the times is and communication. The challenges is contextualization of the data. So you know that that is really a flow meter type, and that is the type of measurement, which that information is correct. Okay? So the data quality of it, let's say, the trustiness of the data.
That, when scaling, is the biggest problem on this case because you need to communicate very well with the people on-site, with engineers, and you need to explain, hey, I need this information from this information. So you need to adapt very easily to the people that you talk to. And not only that, but expectations. You know? When scaling, I think the biggest challenge is mindsets, people knowing of this data and this type of data with this context to do this and this and this, the adoption and the data quality and the context, I think. Yeah. Not on the technical part because the technical part itself, if you built in our case, once again, I was very lucky. We built with the mindset from automation from the beginning.
So the more terminals we onboard, it's the work itself. It's the same. It's more we grab the data and transforming the data with the correct information. So it's giving the correct context at the edge.
[00:30:17] Unknown:
As far as the quality aspect of the data, I imagine that that's quite challenging. And understanding what is the lowest common denominator of information that I can expect across all of these different sensors and then being able to also take advantage of the variance where maybe 1 sensor, you know, say it's a flow meter, it measures in terms of liters per second, and then another 1 is gallons per minute and just being able to make sure that units are correct and that you're able to aggregate the information and try to maintain fidelity and not just kind of bring everything to the coarsest level.
[00:30:55] Unknown:
That's the the biggest challenge right now is that, currently, our automation system, in this case I know it sounds very strange, but the terminal on-site gives us a file with all the context information. We have a CICD then pipeline that transform the file, and then it sends automatically downstream to the CICD pipeline, sends automatically down to the edge part. And then, we have some validation rules that validates the file of the engineer that gives to us on-site. So we have some kind of control on the validation of that data. What happens is, if the data is wrong, so does that migration. Right?
Does that migration to to fix that yeah. If it's in the clouds, as I told you, we are starting 88, 000, 000, 000 records Doing the change, it's then a little bit more problematic, you know, because of the scale of the amount of data. It takes a little bit more time to change it. I think when you work on this scale, people are already used to it. So yeah.
[00:31:57] Unknown:
And then another interesting challenge of dealing with IoT data and maybe juxtaposing that against what a lot of people who are working in sort of standard corporate data ecosystems are thinking about as far as things like lineage. I'm wondering how that translates for an IoT perspective where you say, you know, this data point or this aggregate set of data points is coming from this category of sensors in this physical location. You know, this is the trip that it took from the edge location to our central storage. These are the transformations that are applied on it and just managing data lineage and sort of data cataloging at that scale where you're dealing with, you know, 1, 000 or 100 of thousands of sensors in dozens of different physical locations and, you know, some enumerate number of pipelines that are going to be processing the data for a different machine learning or analytical use cases.
[00:32:50] Unknown:
The data governance itself, it's quite funny because I was listening on this podcast. And the other day regarding Monte Carlo and monitoring the data quality monitoring, We kind of built our own Monte Carlo where we built the SLOs and SLIs over the the the governance and the time, for example, the processing time. Okay? So currently, from the sensor to the cloud, it takes off a second. So ingesting, processing, and storing. Our monitoring system monitors that time. We try to go from 1 minute, max is 1 minute. If it's higher than that, then someone will wake up. The 50 minutes as I told you. The data quality will monitor the let's say, if we have a flow meter, and suddenly the flow meter is measured in kilos, and then we get, like, 2, 000, 000, 000 high cars with that. Let's say not 2, 000, 000, 000, but 100, 000 high cars with that. We have monitors that triggers, an incident.
Luckily, once again, our system is very robust. In 3 years, I was only waked up at night in 4 years, sorry, 3 times. We don't have a lot of incidents till now. Let's see what happens when we are live in 72 terminals. But for now, in 18, we didn't have any problems yet. We are building the strategy, the data governance monitoring for it in order to prepare us for the future. Yeah. It's a little bit a leap of faith and see what we're gonna be getting next.
[00:34:20] Unknown:
And then going back to what you were saying as far as some of the business users looking at real time aggregate flow information, you know, from a central terminal or terminal managers walking around with smart glasses to be able to see, you know, these are some of the predictive maintenance things that I need to be considering. What are some of the types of new capabilities and business processes that are enabled by being able to actually collect and aggregate all of this information in near real time and some of the new kind of business use cases that are yet to be considered?
[00:34:55] Unknown:
As I told you, the customer now is able to access in real time information of the tanks. We can even provide to our partners and suppliers. Okay? Because Vopak is something like a man in the middle in the distribution of liquids. We store, and then we give it to someone else sometimes. So from a business, now the customer and partners can see that information. From operational perspective, as I told you, now the terminal manager can use smart glasses, autonomous robots, like Boston Dynamics for asset inspection. So we don't need to send people inside to see what is happening.
Because if there is a problem in the tank or flow meter or the pump, the operations will stop. It means that then the customer doesn't get the liquid. And because it's a vital product, it already it can happen. It never happened, but it can happen it can happen that we stop a country because, for example, if we look at when colonial pipeline was hacked, for example, in the United States, like, half of the country stopped. You know? So, all these things, when we stop, yeah, we need to avoid as much as possible to stop.
So, our data, it's 1 of the key pillars of the digital transformation of FOPAC. So with this data, then you create all these initiatives that help the customer and the company.
[00:36:19] Unknown:
Continuing on this discussion of the types of impact that a failure in the physical plant can have and some of the interaction with the sensor data. I'm wondering what are some of the other types of ramifications that might happen due to failures in the data collection and analysis or the physical failures that can be reflected or have an impact on the sensor data that you're processing?
[00:36:46] Unknown:
As I told you, we only read right now. So the systems, we don't write down. So if there is a failure, not because of us, so we just ingest. In the case of failure, we don't get any data. Of course, then the person on-site clicks on the app and doesn't get the the actual information. But, however, they are still on-site. You know? So they have this control system, like, a power plant imagine look alike. They have this control system, and I'm the person sure they will have an alert on their industrial system, or they already have someone looking at it.
Because if there is problem on the sensor or we don't get the data or the OPC UA, that's why we use the PC UA, if there is any problem with the sensor, your PCUA itself gives a status code. And now we have our data quality monitoring system, monitors the status code. Okay? So if this stats code the OPC weights have a lot of stats code, it's once again, it's industrial protocol. Based on the status code, it means a type of problem in the sensor, and we monitor that. And based on that, then we can notify the local team, hey. If you are not looking, if it's probably they're already looking at it, this is happening with this sensor. Yeah. So we can already do that on behalf of the operation teams on-site.
We can already monitor on them. Hey. This is happening with this sensor. So the disruption itself, of course, they will not have the most recent data. But because of our monitoring system, we can already notify on-site and that they can already fix the problem. So this data will not go to the cloud, by the way, because the status code, PCA, only ingests in the cloud when it's 1 of the rules of the transformation. We only send to the cloud tags, in this case of PCOA tags, the data point that with the stats code, good. So if it's not good, we're not gonna send to the cloud, and our monitoring system alerts the people on-site or ask people on-site what is happening with this.
[00:39:00] Unknown:
And in terms of the uses of the system that you're building and the data that you're providing, what are some of the most interesting or innovative or unexpected applications of it that you've seen?
[00:39:12] Unknown:
A lot, to be honest. At the beginning, I was not expecting the smart glasses, the digital twin, the analytics over alarms and events. I don't know. There is the operation set. We have 1 project, which is we are still starting in working. It's in the early stage where if there is an accident, we can it's a little bit like alarms in events, but we monitor yeah. It's it's alarming events, but we can predict or try to predict when this can happen and which situations can happen, but it's a lot of data. But I think I don't know from a software engineer perspective. A lot of software engineers probably they build a picture, and they send it to production. They never see them using it. Maybe people, like, Google, they see people using it. But in my case, when I go to terminals and I see people using the smart glasses now and seeing all these that dashboard and I can see the impact, it's like, well, I was never expecting this weapon or, for example, the energy management project. You can see it in Forbes. We talked about that.
Reducing the consumption of energy, for example, for me, was quite mind blowing because I was didn't expect such an impact, like, in the company itself. You know? When we built the the system, I was not expecting to become, like, a key pillar to the company itself. It's a little bit hard to to say which 1 it was because so many initiatives taking place because of the data that we are able to provide. Yeah, it's hard to select 1. Maybe the thrones the thrones inspection is nice. I was not expecting that.
[00:40:49] Unknown:
In your own experience of designing and implementing and evolving this platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:00] Unknown:
I think when we were building at the beginning, early stage, we start very small. It was the expectations. We started with 1 or 2 sensors, and people, okay, then we can already start doing machine learning. No. But at our scale, maybe the operational part of it, I was not expecting or, like, example, fixing data. Right? As I told you, data migration was something that we were not expecting from the beginning to be so challenging. You know? Or even giving context to the data was something we were not used to it, you know, or expecting that to happen. So the the design while designing it, then building it, if it was now maybe the transformation Lambda, so the normalization Lambda so but this is now. Right? But this can change. Right? Maybe I will look at it with more careful.
We focus a little bit more at the beginning in the testing because you need to give value. You need to give value to the business as fast as possible, and they want the data. Let's ingest it. We focus a lot of in ingesting the data, understanding the OPC wave protocol, our modbus protocol works, all the industrial side of the things work. And then when it went up to the parts of transformation of normalization of data, we didn't focus so much on it, you know? Yeah.
[00:42:21] Unknown:
In terms of the near to medium term future of the work that you're doing, what are some of the things that you have planned or particular areas that you're excited to dig into?
[00:42:31] Unknown:
I think the capability to run machine learning at the edge. Greengrass allows that. So they call it machine learning interference. We are doing proof of concepts with it, even in the parts of the energy management. Like, for example, we see that a pump, it's consuming too much energy, and we do big shaving. It shouldn't be consuming so much energy because it's not gonna be used for 2 days or 3 days because you need to warm up the pump in order to work. We send the instruction to the operator saying, hey. Based on previous history and based on future schedules, this pump doesn't need to be on. And the operator can say yes or no, and we shut down the pump.
So that's something that we are currently evaluating, testing to see the capability to it's running the machine learning, the part of the AI, with the help of our data team. And, by the way, our data team is doing a very good and awesome work because I'm saying these what excite us or what we're gonna do next, but majority of the times is what the business wants next. You know? And for that, you need to teach we are even creating, like, a data academy in house where we teach business to people on-site, managers and directors, what they can do with the data.
And based on that, they are the ones that can give us ideas. You know? And then, okay. Yeah. We can build it, and let's do it. We need the business and the people on-site to help us. You know, without them, we are not gonna solve anything.
[00:44:02] Unknown:
Are there any other aspects of the systems that you have designed and implemented or the overall complexities of dealing with IoT data at distributed scale or just the overall space of building and managing these complex data systems that we didn't discuss yet that you'd like to cover before we close out the show?
[00:44:21] Unknown:
It seems 1 thing that needs to be building these IoT systems or any distributed system or these hybrid systems is that someone wants to build it. I think, first, people needs to think is about deployment and testing. Of course, you need to give value, but this IoT system is very hard due to test. Right? Because there is many factors that can change. So testing and automation testing, it's very important. Automation, it's a very important topic for the IoT field. Okay? Yeah. If you control the sensors, let's say if you are a company that build your own sensors, you don't have that problem. Right? But if you are someone that ingest data from different sensors, you really need that. Yeah.
[00:45:07] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:22] Unknown:
From a technical perspective, it's the fragmentation in the IoT sector, at least at the edge. Okay. It's good that companies focus in 1 problem. But when we were evaluating, it's very hard if you're gonna need to maintain, like, 3 applications at the edge or 4 applications at the edge. So it's very hard you to find an application that does everything, and the cost's still not very high. You know what I mean? So I can have 4 applications that are not very expensive and easy to maintain and the scale, but in the long run, I will need to hire 20 engineers to maintain it. So, yeah, the fragmentation.
From a people perspective, is you need or management of the data is when you are deploying worldwide as in our case. This is not the sprints. It's a marathon. You know? You're not gonna start ingesting and processing this data in 1 day, and everyone that you're gonna deal with is gonna have different expectations for it. So I think that's the biggest gap because you're gonna come as an engineer, a software engineer full of ideas, and then the other side, you don't know what are the expectations of the people. I think the gap is a gap in acknowledgment.
[00:46:40] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Vopac. It's definitely a very interesting problem space and an interesting design challenge that you've been working through. So I appreciate all of the time and energy that you have put into that and the time that you've taken to share your experiences with me and the audience. Thank you again for that, and I hope you enjoy the rest of your day. Thank you. Listening. Don't forget to check out our other show, pod cast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Mario Perera and Vopak
Types of Sensors and Edge Devices
Control and Oversight of Sensor Deployment
Computation and Storage at the Edge
Filtering and Preprocessing at the Edge
Network Failures and Latency Tolerance
Retention Periods and Longitudinal Analysis
Evaluation and Planning for IoT Platform
Criteria for Success and Scaling
Challenges in Scaling and Contextualization
Data Lineage and Governance
New Business Capabilities and Use Cases
Unexpected Applications and Lessons Learned
Future Plans and Machine Learning at the Edge
Final Thoughts and Closing Remarks