Decoupling Data Operations From Data Infrastructure Using Nexla

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion.

That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data.

By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata,

Data Band lets you identify data quality issues and their root causes from a single dashboard.

With Data Band, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives.

Go to dataengineeringpodcast.com/databand

today to sign up for a free 30 day trial and to take control of your data quality.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Saket Sarab

and Avanash Shadad Puri about Nexla, a platform for powering data operations and sharing within and across business boundaries.

So, Saket, can you start by introducing yourself? Hi. This is Saket. I'm cofounder, CEO at Nexla.

I'm an engineer by background who eventually went towards the business

side, and I've been in the

world of data for almost 12 years now. And how about you, Avinash?

Hi. I'm Avinash,

founder, CTO here at Nextelock. I had this I worked with Satya on his previous startup, which was in a mobile ad tech space.

And before that, spent bunch of years building data systems in financial services, so risk management platforms, and trading systems, etcetera.

Great to be here. And going back to you, Saket, do you remember how you first got involved in data management?

Yeah. Yeah. Absolutely. So as I said, you know, I was an engineer by background

through several companies.

But then in 2009, I started a company for, you know, building a mobile ad server, 1 of the early ones at the time. And that company, which is called Mobsmiths,

we ended up building a pretty complex data system. As you know, Tobias,

there's a lot of data in the advertising space. That's why you've seen Google, Facebook, and these companies build a lot of data technology.

Right? We ended up building a system that would process nearly 300, 000, 000, 000 records of data every day. We're doing some very

machine learning on that for pricing ad auctions and so on. Ended up creating a huge data platform as part of this advertising startup that I was doing. That's what got us into the data space.

Yeah. And, Abinash, how about you? I was building risk management and trading systems on Wall Street. We found this ad tech company.

We are very early users of most of

current state of our technologies. So

very early users of Kafka, very early users

of Spark. And then we saw this problem from very close quarters that every time a non engineer came to us with a requirement,

we're like, oh, we need to put this to the back of the queue, and it will take us few months to do this. This is where we thought about how about we build something

which makes it

easy enough for a non engineer to build a data solutions with sort of the radar of an engineering

system. In terms of the Nexla platform, you mentioned that it's aiming to help reduce some of that time delay and backlog in terms of being able to incorporate new requests and new datasets. And wondering if you can just give a bit of an overview about what it is that you've built there and some of the story behind how you got it started and why you've decided that this is the problem that you wanna focus on right now?

So

as I said, you know, there's a lot of data and advertising.

For Avinash and me, for both of us as well as our other cofounders, we felt

that the data challenge that we work on advertising was super interesting. We were not as enamored by the advertising and the media side of our business. Right?

And, you know, I had come into this world of data through the way of the compute side on the day of the world through having worked at NVIDIA. And when we came together and we're looking at this at our last company, we felt like, look, there's a lot of data in here.

We end up getting data from, you know, hundreds of different partners and companies

and clearly we have gotten a lot of value applying that towards machine learning. So the thinking that took us towards Next Level was that this is a problem and this is back in 2014 for setting context. Right? So this is in 2014

we're thinking well we think more and more companies will use data in more efficient ways, but it will be really hard for them. There's a lot of technology that has to go in and that started to seed the idea of next line inside inside of us, which is,

hey. Companies will have data in their organization. They will have data that they will get from their ecosystem. It'll be hard to work with. More and more people need to work with data, however so how do we make that happen? How do we take all this complexity and growing complexity of data? And how do we make it possible for someone who is not an engineer to be eventually able to work with it and say, this what I want to do with my data. And that was the core behind starting Nextiva.

In terms of the sort of overall problem space that you're tackling, you mentioned that 1 of the problems is being able to

ask questions of the data that exists within the organization.

And I'm wondering how you are approaching that problem, some of the scope of the problem that you're taking on at Nexla,

and some of the ways that Nexla might integrate into an existing data platform.

Sure. So

when we post the question, you know, how can data be used by more people

and in more applications, then it led us to 4 core problems that we felt had to be solved. Okay?

1st, and as I said, you know, we wanted the non technical user to be able to do this. So that led us to 4 key problems. First was that we had to break through this connector barrier

and, you know, there's so many systems to connect to, so many formats, different velocities of data.

How can we connect to new and new real systems

without having to go and write and create new connectors?

So that led to a piece of our technology which we call as a universal connector.

We can talk more about

that. So that was 1 problem. The other problem was that once we can connect to the data, how do we present to the user in a way that they can understand?

And that led to the concept of next sets as a logical data unit.

Something that is consistent no matter where the data comes from. Right? Then the other problem that we also looked at, as I mentioned, we were also thinking about data that's going between companies and their ecosystem of their, you know, their customers, their partners, their vendors, and so on. So we thought that if data has to move across companies,

it will have to move across clouds. It will have to move across on prem to cloud.

And therefore, the architecture should be such that it can support those scenarios. Right? And then the 4th thing with FAFSA, look, I mean, we want

the data user, the person who understands the data to be empowered, but data challenges are very diverse.

So we will need to have a way to collaborate. So 1 of the other problems was collaboration.

Collaboration

all the way from someone who is an expert in the data to someone who's an expert in the data system. And then, you know and that led to the concept of no code capabilities in Nextel to the low code capabilities

and the full on developer platform, which has, you know, APIs and SDK and CLIs and everything.

As far as somebody who has an existing data platform and they're looking to be able to simplify some of their data operations and take advantage of what you're building at Nextelow, what are some of the

components of a

sort of common data stack that somebody might replace with Nexla? So I'm thinking in terms of maybe some, like, the the open source components that people are familiar with such as Spark or Kafka or the data warehouse or just sort of where

Nexla sits in that overall domain.

At a very high level, I would say when it comes to technologies like Kafka, Spark, and so on, you know, we don't replace them. We actually make them more usable by people, more easily adopted. There's sort of 4 sort of things that we solve for companies. Right? 1 is the integration aspect, which is

multimodal integration with ELT integration, ETL,

you know, reverse ETL, streaming integration, API integration. So So we look at all these 4 modes of integration,

and we present a solution for that, something that, you know, can be done in a user interface, click through, and so on. When the actual execution happens, we leverage, you know, or we are able to leverage some of these technologies as appropriate

like Kafka or Spark or some of our homegrown engines as well. So in that way, we, you know, we can sit on top of or benefit from the tech you already have or the know how you have, but make it more accessible to more users. By the way, I would note that, you know, we never think of it as a replacement. When whenever it comes to technology,

especially in data, there are so many problems that are coming that are new. That's not usually go back and rip and replace things. It's more about, you know, solving the new problems. Right? So integration is 1 part. The other part is the preparation of data. How do we transform,

enrich, filter, validate, you know, the data so that it becomes ready to use? You know, integration means it's reading or writing it. So preparation is the next part of that. Since we are in the flow of data, we do

enable error management, data quarantine, notifications,

alerting as far as the flow of data is concerned

and what we expect from that. And then, you know, the 4th piece that we touch in addition to integration, preparation, and monitoring is the discovery. We sort of auto generate a catalog from our next sets that we'll talk about where users can find, hey. Is you know, where's the data I'm looking for? Is it the right 1? And then, you know, access and use. And a point to add here is these are all pluggable components. So it's not that you want to use the entire thing. You can sort of choose to use a part of this. For example, if you already have written some code in Python,

but you don't have the integration layer to where you want to fetch this data from, or you have a new integration layer that you want to fetch data from. You could use Nextiva for the integration piece, but it could use your prewritten transformations in Python

to transform your data and then push it out. Similarly, if you have a new customer requirement who wants to move from GCP to Azure, you have all the logic that you want on GCP, but you don't have a Synapse connector. You probably use NextLap to just push data to Synapse, again, which is you use your existing thing, but you are able to get the additive thing from NextLine in a matter of 2 or 3 clicks. So we have very much been used with existing monitoring systems, with existing catalog systems where we end up supplementing them or feeding information into those tools and make them more capable. In terms of the next sets that you mentioned,

1 of the use cases

that is highlighted in your website is the ability to publish those next sets for creating a data exchange between businesses, which is something that I've seen spoken about a few different places. It doesn't seem to be as wide spread as 1 might think largely because companies are using their data assets as part of their competitive advantage, and so they try to keep them proprietary.

And so I'm wondering what you see as some of

the use cases for those data exchanges and some of the ways that you are

approaching that challenge of

mitigating the decrease in competitive advantage that businesses might have by exposing their data and some of the types of data sets that organizations

are more willing to be able to either make public or engage with directly with other organizations to be able to share between them. First, a quick definition, you know, for Nexus for those who are listening to this is that Nexus are these logical data units. They don't have data, but they become a way to access

and and use data. And so these next sets have the concept of sharing and collaboration built into them. This happens within the team, of course. You know, I can create or design an accept which is PII compliant data and give access to a team to use that, but this has great applications as you said, Tobias,

across companies.

Where do we see that? When 2 companies work together, as you can imagine, there is always some flow of data going between them. Right? We see this with some of our customers. For example, you know, 1 of our customers is Instacart,

you know, for them to deliver groceries,

they will need to know that, hey. Does this, you know, Safeway or Albertsons or whatever, do they have the products available, and can I go pick it up? Can I list it? Right? So there's clearly data that Instacart is using that's coming from that, you know, merchant partner of theirs. Right? So there are these cases where we work and, you know, we enable that sort of flow of data across the companies.

That's 1 part of that. The second part of it is a community aspect. You know, we've seen people like, hey. I've worked on this health care data which was publicly available. I've

I've integrated with it. I've massaged it. I've, you know, it's ready to use and I can share it. I can make it public and anybody can use that. So there's that part of it as well. And then there is a flexibility part to it. Right? You know, it's not unusual for companies to put files on FTP or create an API so that they can give data to somebody else.

What happens though is that, you know, there's a 2 way integration challenge. You have to generate that data into a file and push it to an FTP server or create an API, And the other party now has to integrate to that. You know, they have to, you know, fetch that data and run a process for it. So these next sets are enabling this where once you give somebody access,

you give them the choice. You know, do you wanna consume that data into a file, into an API, into a database and not have to integrate at all, like, you know, just a direct sort of pipe in that way. So there is a lot of power in that. We do see that these use cases are growing,

and there are some real benefits for the companies and their customers.

Yeah. As soon as you started to discuss the data sharing aspect, I was going to bring up FTP. So thank you for beating me to the point on that.

It's been there for, what, 30, 40, 50 years, and it's going strong. Right? And yeah.

I just filled a form for an integration request

which said, do you want to put it on our FTP server or should we pull it from your FTP server? So it's it's as recent as yesterday. But to the other point, I think the aspect of

how companies share data, they probably give you a username password

and then you forget about it. How do you know

what all

integrations

are currently connected to an FTP server and API?

It's very hard with today's systems. You sort of go and look at your FTP server logs. When did these users last log in? Or you look at your deep inside API logs, what are the users connected currently?

But with the system like NetSet, you can just go to a screen and see these are the things that have access to, When was the data last for? So the monitoring aspect of data exchange is also

something that I think prevents companies from connecting multiple different data systems because you are scared of what

how will I know who is using my data? So having that aspect is also useful.

I've seen horror stories of, you know, somebody change the column name between, you know, an insurance company and an automotive company or whatever, and the whole thing came down because of that. Right?

So, yeah, it's a big problem sometimes.

And so the more modern analog to the FTP server is the capabilities that systems like BigQuery or Snowflake offer with being able to share either tables or datasets from your data warehouse. And so I'm wondering if you can give a bit of a comparison to what you're building with the next sets and some of the capabilities that they provide over and above those systems.

Yeah. I think the shared tables in these systems like BigQuery and Snowflake are a great way to share data. What the Nexus are helping companies do is, 1, you know, bring the data into that BigQuery into Snowflake. Sometimes it's not already there. That's definitely a value add. But the other thing we have to note, which is even more important, is that once the data is in a warehouse, you have become a batch or a static system.

So in use cases where the data needs to be going kind of real time from 1 system to another, then these Nexus, you know, as I said, have a multiple sort of execution engines underneath, can continue to move the data in real time. There are many use cases. We see, you know, ecommerce shipment tracking, you know, as 1 example where the data needs to be, you know, real time. When you share a table from Snowflake or BigQuery,

you still need to do a lift and shift on where do you want to eventually use it. So you might use it in a machine learning algorithm, but, again, you need to extract the table, push it out, get build a process around it, and all of those things. With net sets, you're solving the problems at both ends of the ecosystem.

On 1 side, you can have a net set which then push data into these things. On the other side, you can have a net set which then be referenced in a machine learning model. So you're solving both of those things, but, yes, these things do coexist. So we have customers who would share a table in Snowflake, but then we'll have another customer who just consume the data that from the table with an ad set on top of it. I've had actually cases of companies come to me and say that, hey. We are not, say, current customer of x or y data sharing tool, but our our customers are asking us to bring the data there. Can you give us a mechanism to move the data from, you know, our database, our lake, or wherever

into that system so we can share. So, yeah, there's definitely an integration need in some of these cases.

All of this capability of being able to deliver the data from 1 system to the other or to do the integration

is being covered by the term data ops or data operations that's gaining ground.

And that's 1 of the keywords that you're targeting in terms of the positioning of what you're building at Nexla. And so I'm wondering if you can give your version of what DataOps involves and some of the elements that need to be implemented to be able to actually deliver on that promise.

We were 1 of the early ones in 2016, 2017 talking about data ops. Our thinking about operations was that, you know, the purpose of operations is to bring scale. You know, not in terms of data volume necessarily, but in terms

of, you know, how things are run. How are more people working with data? Are there more datasets in more applications?

So we took that sort of point of view of operations when using the word operations along with data.

So our approach to the operations aspect was that

you have to have certain degree of automation, right, in how you connect with data, how you flow data, you know, how multiple people use it together. That's how you'll scale your team. Right? So those were the sort of core principles that we started with on the data operation side. The resulting application for the user was that, oh, I need to integrate data. I need to prepare it. I need to monitor that. So we think of the operations as a layer on top of

these fundamental

data functions,

right, that allows the team to scale. In terms of the actual

architectural and implementation details of Nexla and some of the ways that you're able to build these next sets. I'm wondering if you can give an overview of some of the technical aspects.

So the fundamental premise behind Nexla was on 1 hand, you have a lot of variety of data

increasing complexity. On the other hand, you have a user who shouldn't have to write code. Right? So we figured that in between, there has to be tech that can take all that variety of data, different systems, connectors, and so on, and present the data to the user in a useful fashion. And that was the next set. Right? So these next sets,

the way they start functioning is, first of all, they start with observing the data. So the connector is putting the data out, you know, and the next sets, that element is observing the data and is figuring what is the schema, and we'll get a schema from, you know, the first set of records, but it'll continuously keep looking at it. So the first aspect is what's the schema of the data? And for that, observe that over time, determine if the schema is changing,

evolving, or the data that is coming now is actually should be a new next set. So those that's 1 aspect of the thing. The second aspect of the thing then is, like, what time does data come? How much data do we see? You know, those are some of the metadata that we observe about that as well. Then when we go into the data itself, we say, oh, the data has an attribute called price, and it has been changing like this. So we're computing characteristics of the data. Right? So these are some of the elements that come in. The sort of going term for that is metadata intelligence,

and, you know, people are calling it as a data fabric architecture. But, you know, those are all the terminologies. Let's stay on the deck itself. So that's what happens, and that sort of feeds into the concept of the next set. Right? Now the next set also knows where the data is coming from. So for example,

oh, this record was line 500 in that file, for example. Right? So that information is also being gathered. Right? And then it starts to say, okay. I know enough about the data to present it to the user in a UI that they can look at. But at this time,

it doesn't matter to the next set that the data came from a stream or it came from an API or it came from a file.

It's immaterial where it came from. It can be presented to the user in a common interface.

And then the next sets

over top of that say, well, who can access it? You know, what is the history? You know, how is the schema changed over time? Which user has modified it? And then along with that, they keep documentation.

Right? Hey. What's the description of this next set? If has the user created an annotation from that to be able to say, oh, this attribute is actually coming from a web form. So those are the other things that come in and become polynexset.

And then as the data is flowing through, I know the nexets are not doing anything. They know they're like an incomplete

electronic circuit. Right? So it's not flowing anything yet. You know, it's observing, it's sampling, but nothing is going. Once you connect the next set to a data system,

right, then it starts to materialize the data there. So you point an exit to a database and it says, okay, it's an it's going to be an ETL flow, let's say. Or if you want to get an API to the data, it can surface that or, you know, wherever you wanna use the data, it cannot materialize. But it knows enough about the data to now convert the format, for example.

It also knows enough about the data to now run some validations on it. Right? Or this field should happen

like this or, you know, let the user define their validations.

And then when it sees those validations fail, then it throws the data record into a quarantine area where it's a separate sort of processing that you can do. So the next step is this logical entity in the middle and then sort of, you know, access, like, you know, the data router when you eventually end up using. And then there's a complexity on the whole run time aspect of this. Right? How do you

run a net set which might have trillions of records behind it and also on a net set which might have

few 100 records on the same system? How do you separate out your execution engines from

a user who is waiting on the other side to receive a response within

100 millisecond versus

user who is expecting a batch of data to be dropped in his data warehouse once in the night. Right? So that itself is challenging then. You present the same interface to the user, but under the hood, you can have different execution engines

running in parallel. Yeah. But that's the execution part of it. And I the next level, you're just looking at it in the UI. You are making your decisions. You're designing stuff. And when it's, you know, done,

then the execution,

you know, knows, like, from all that information, like, what to do and and which execution engine is right and all that stuff.

And so in terms of the actual technical details of

executing on the plans that the next set generates,

is that something that you're performing on behalf of the end user where you have your own

infrastructure and data platform architecture for being able to do all this processing? Or is this this something where

you provide the,

sort of orchestration and implementation layer, and then the customer

has the actual

underlying infrastructure

that just executes on the plan that you generate for them. So for the execution part, which is kind of the data backplane, if you will, the processing stuff,

that is something that we provide as a SaaS hosted service. But we run multiple instances. There's an instance in Amazon. There's an instance in GCP,

you know, an instance in Europe. So depending on what you need with the data and where needed to be processed, you know, we provide that execution engine, which is all the scaled up and scaled down the containers,

you know, the underlying, you know, Kafka, Spark, real time engines, whatever it needs. It's all packaged and included.

But we also do allow our customers to run

NextLaw on their own. So, you know, they may need this for high value data on premise,

and that's not connected to the Internet. So they can run this whole thing internally. So they can choose to run their own data backplane,

and we provide them with, you know, the necessary Terraform templates or, you know, to execute and run that. The end execution

is sort of distributed in a federated fashion. Given the level of flexibility that you're targeting and the level of sophistication that you're trying to offer and

be able

to abstract away some of the technical complexities that underlie it, I'm wondering what are some of the

engineering challenges that have been most challenging and that you have been most critical to your continued success.

There have been a lot of engineering challenges. I led a few. I'm sure Avinash has many, many more. It's never ending, by the way. You solve a challenge, you get more. The first challenge

we felt like, hey. If you're working with data coming from outside the company, we don't know what the API format, you know, and schema will will all be. So the first challenge on the engineering side was connectors.

How do we end up in a place where we are not creating new connectors? We don't want our customers to wait for us to write a connector. That was 1 engine challenge. We ended up creating an abstraction layer so that we have 4 main connectors

and with, you know, bunch of bells and whistles. But everything like, you know, authentication and retries and probing a connector,

managing credentials,

you know, all of those layers are abstracted on top of that. So that was 1 sort of part that was a big engineering challenge. This multimodal data processing that Avinash mentioned is another engineering challenge. Right? So you think of Nextel.

Let's say you have a database as a source and you're reading data from it and you're pushing the data into an API.

Right? In a normal flow,

it will probably run-in Kafka. If you reverse the flow, if you call the next set and say, hey. Give me the data, you know, as an API call. So you make the API call and you're waiting for the data to come back and say 10, 15 milliseconds.

That doesn't go on Kafka. That's a real time engine. Right? So in this exact same flow that you had designed in Nextiva, you can access in 2 different ways and will run-in 2 different ways. So having that sort of multimodal data processing engine, and we recently started to offer

bring your own data engine. Now if you have created some mechanism of processing data, you can bring it in, and we give you the interface to run that. So that's another enduring challenge. And I think the third 1, which is my favorite,

is, you know, the whole infrastructure is very dynamically orchestrated. It's all, you know, container based architecture, but these containers come up at the right time to bring the data, process it, go away, and all that stuff. And the reason it's my favorite is, by the way, when we started building Nextelah for the 1st 6 months, I was like, when can I run the 1st data flow? And they're like, no. We're building orchestration, and we did invest a lot of it in building that orchestration piece.

You want to simplify

the view for an end user, but then there is a lot of things that happen at the back of the system

to make it simple. So Satya talked about this dynamic orchestration.

You can do this dynamic orchestration in, like, 4 different systems.

Everybody talks about Kubernetes now, but the land of Docker, Kubernetes, Mesos, all of these have evolved over a period of time. You might want to run this in a managed service. You might want to run a few things within a warehouse, for example, because it's best suited. How do you make that switch with a simple interface on the top is super challenging.

At the same time,

users have done sort of fantastic things with net sets. We have some users who have pointed PDF files and gotten net set out of it. This is something that we were not originally thinking about, but people have done that by using some of an OCR in between. So yeah. I mean, what we started with was clearly a simple concept of what we can do with NetSense. But

with this fundamental building blocks, people have sort of used our platform as different Lego blocks and built ins that we didn't imagine in the first place.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it?

But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast dotcom/census

today to get a free 14 day trial.

In terms of that multimodal execution engine, the first thing that comes to mind is being able to ensure

consistency

in the processing

between those different modes of operation, where in this batch situation,

I want to be able to ensure that I'm doing the transformations in the same way that I might be doing if I am doing it in a streaming situation, which is sort of the canonical problem of, you know, the Lambda architecture or the cap architecture, this division of batch versus streaming.

But then there's also the question of real time that you mentioned of, I wanna be able to hit this API call and have it give me the records from the database at the time that I request it without having to wait for it to propagate through Kafka

or deal with change data capture to be able to replicate that information. And so I'm wondering how you're approaching that complexity

of being able to ensure the

consistency of logic in these different ways of executing and on these different substrates.

I'll add 1 quick note on that, and I think Avinash can give you a lot of the gory details. But there's a key part that

the processing of the data happens in exactly the same code base that actually runs in multiple places. You know? That module can run, you know, in a Kafka stream. That can module can run,

you know, in your spark, that model is run, you know, by our servers as a microservice. Right? So that gives us the consistency. That is only 1 place. We're not trying to replicate the logic in multiple different systems. That's extremely hard to achieve. Yeah. I mean, that's the fundamental reason because

when a user is coming to NetSca and then creating NetSca or NetSca's auto deriving NetSca,

we are

able to

keep what makes the netset totally separate from where the netset gets executed.

And

the separation of concerns between

definition of net set and execution of net set is what gives it this power of

being able to run-in different places.

So, I mean, tomorrow, there is a new flashy execution engine that comes. We should be able to take the same headset

and bridge the gap with that execution engine without really changing anything on the net set itself. But as a user, you have told us, here is where my data says,

This is what I wanna do with it. This is where I wanna use it. Right?

How you execute that, you know, does depend on a few things. Like, I want it real time. I want this or that, but it shouldn't matter to the user and, you know, it stays agnostic.

And then in terms of the

ideas that you had going into this business and this problem space of this is how I'm going to approach the problem. These are the technical aspects that I'm going to have to cover. This is how the sort of overall workflow and end user experience is going to look. What are some of the ways that that

initial design and architecture has changed or evolved, And what were some of the early assumptions that you had going into this that have been either challenged or invalidated

as you have built out the platform?

I think when we started, we thought that streaming would be the be all and end all. You know, it will stream. It will be micro batching and do all of that.

We did come across real time use cases pretty early on, and then we also came across use cases where streaming

could be an overhead that was not necessary to bear. So I would say that, you know, we started from that streaming engine core, and then we became multimodal

engine over time. You know, our abstractions helped us get there, but that was definitely a big change of assumption.

We always thought of data as

who is, you know, producing the data, who's using the data. And I think, you know, that concept has also evolved, you know, quite a bit within our product, just, you know, giving more flexibility and, you know, not just thinking about the source and destination, but thinking about the next set as like, hey. That's an entity that, you know, you can use and you choose how you wanna use it. Or things like, hey, if I have an exit, then can I find it? You know, does it serve, like, a search purpose?

So, you know, that piece also evolved quite a bit when our initial thought was more on sources and destinations.

And, I mean, 1 other thing that has

is having a platform which is pluggable at different places

has huge benefits.

So the way we thought about some of these things was

what are the points where we want the platform to be open?

Can somebody

use the underlying lens slash infrastructure build their bring their own connector? Can somebody bring in their own transformation logic?

Can somebody bring in their own validation logic? All of these things, when we started thinking about this, we're like, oh, let's try to

open up these points.

But what has happened is over a period of time, the points where we have opened this by our APIs, by our command line interface

has made the platform more and more powerful. Yeah. Actually, we started with the inspiration from set of Excel. A lot of built in functions that anybody can use, but advanced users can create macros and add their own custom menus.

So we thought that no code and low code would be a great combination to solve everything.

But eventually, what happened was we ended up opening up all our APIs.

We ended up creating a command line interface, which we had never thought we'll do in the early days. But our own ops team said, oh, hey. We want a command line interface, and then we open it up to our customers. So there's definitely things that came along. And I think that's what makes the data space exciting, to be honest. Right? I mean, you think you have done some amazing work and then up comes a new challenge the next day because a customer asked you for something and they're like, yeah. You're right. You know? We should have that or it makes sense. And then you're on to the next 1. So yeah.

Going to the

sort of automated

integration and some of the

abilities that you're exposing through these next sets for being able to say, this is my source dataset. This is how I want it to end up, and then being able to generate the transformations or generate the logic

that so that the user doesn't have to handle that. That's where a lot of the challenge comes in for data practitioners

is I have the source data. I need to be able to transform it in such a way that I'm either

not introducing

my own sort of bias or my own interpretation of this information,

or I'm not going to create a lossy transformation that prevents me from being able to actually answer the question that I was aiming for. And so I'm wondering what you have found to be some of the useful heuristics

for being able

to interrogate the source system to say, what is the data? How is it stored? What is the schema? How am I going to be able to actually generate these transformations

to answer the question that the end user is asking for or present the data in the way that they're looking to, you know, send it on to another system, and just some of the ways to be able to actually

encapsulate

and

hide some of that complexity from the end user?

Actually, in that point, we actually again go back to the fact that within an organization, there'll be people of different skill levels. Right?

So, you know, when we think of our connector,

there's a very advanced mode where you can configure many, many things

into it and literally connect to anything,

but that converts into a template. Right? So if I have configured that, hey. This API is OAuth 2 and it is 3 legged OAuth and it has a pagination of this style. And if you call this API and pass its data to that API, then we'll get this result.

If I have, you know, put all of that and configure that in Nextiva,

that can become something that's prepackaged for the next person. They can just

come and click on a button and go with it. So from a heuristic, I think more of it also comes from just, like, how will people work together. Right? And how will the person who knows a problem solve it once, but then it becomes a solution forever. Right? You know, at some point, we'll make this community capable as well. You're you can share what you have discovered or built in the broader community. So that's a big part of heuristic that comes in, you know, from a very technical perspective, I think there are definitely things that we have seen, you know, as far as, like, schema management and stuff. Like, what can you do, you know, when you start getting seeing more data, you see slight changes in the structure

or the color of the attributes and, you know, stuff like that and how we've done, like, even simple unions,

you know, of schemas can give you some very interesting results. Of course, we do more advanced match of schemas and overlaps and stuff. Some simple heuristics can go a long way. And then another interesting aspect of these logical units of data that you're building with the next sets is the idea of being able

to compose them together or do further refinements of an existing dataset.

And I'm wondering how that plays out in terms of being able to say, I've got this next set over here where I'm pulling data from a Postgres database and presenting it as a, you know, parquet file and s 3, and I've got this other 1 where I'm pulling data from, you know, a Kafka stream, and I am presenting it as an API. And I now wanna take those 2, and I wanna put them together and get a new next set out of that. There are a couple of very interesting concepts there. The core concept is that a next set

plus code or logic

creates a new next set. So you take a next set, you transform it, the outcome is a new next set. You could give somebody access to that. Then they could do their own transformations,

filters, enrichments, whatever, and get a new 1 and a new 1 and a new 1. So that's kind of 1 sort of aspect which is a linear.

The second aspect is, like, okay. Combine 2 or 3 next steps together. We see this all the time with our some of our customers. Take the product, next set. Take the inventory, next set. Take the price, next set by store. Combine that all into a single entry. There is a concept of looking

up from 1 next set to another. You know, just like in Excel, you have, like, a vlookup, for example. You can do a lookups across next set. You can look into that. We have the underlying mechanics to

hey. If the data is a stream, then appropriately cache it so you can do very fast lookups on that. You'll soon be introducing this concept of, like, even SQL joins and stuff like that on that. So there will be more things coming, but, yes, it does make it possible for someone to say that in order to get the outcome that I'm looking for,

I can create more derivatives or at any stage pass on to somebody else who might be the right person to do it. And I think this

fundamentally possible because of having an open entity which is composable. Right? So as you said, if you get data from post rest and you get data from s 3, currently, if you have to do it, you'll end up writing a lot of integration code and exposing something on top of this. But when you do this with NextLab,

you're able to get 2 net sets, join those 2 net sets, and you have, like, 7 different ways of accessing that net set out of the box. We had a very interesting use case. It was a a freight broker that was getting, you know, emails from drivers, and that email would have an attachment of, you know, their driver license or insurance policy.

And their goal was, hey. Can we validate this driver and onboard them onto the platform? Right? So they made these emails as sources in our system. And the first next step that would detect with the email would be would have the attachment, you know, sort of, you know, pointing to that, and then it would detect some tables inside the email if they have that and stuff like that. Right?

That next set, you know,

then was enriched by making calls to some, you know, OCR systems and say, okay. From that image, what other information can be grabbed? So the resulting next set after that sort of processing or the enrichment

stage with the, you know, some external services

had a lot more data

from those documents. And then they had the logic to say, oh, is the travel license valid? Is the insurance policy large enough? Or whatever they had the criteria.

And then the outcome of that next set was pipe is an API into their system. So they were able to say, okay. From now, we are able to onboard, like, 80 plus percent of the drivers

automatically.

And then where you did made that confidence score from

the result, they could pipe the outcome of that next set out into a spreadsheet

that ends up in the desk of a, you know, an ops person who will manually approve or disapprove. Right? So we find people come up with some

extremely compelling

and interesting applications

by composing these next sets together along with other third party applications or systems they have. And then for the situation where you might have 2 different next sets and you only want sort of half of the information that each of them are presenting so that you can combine that in a different way for a 3rd next set. Do you have a way of being able to

sort of optimize the execution so that you don't have to run both source next sets to their completion and then take that output as input to the third 1 where you can say, okay. Well, I see that I'm pulling data from Postgres and s 3 for this next set over here. And over here, I'm pulling data from Kafka and then an API,

but I actually only want the data from s 3 and the API. So I'm only going to make those calls for this 3rd and next set when people request it. So the execution plan of the next set is tweakable. So you can say, I want this next set to run-in

a execution engine. I want to run another net set in b execution engine. There is a windowing

concept that you can apply on top of a stream, for example, in this case, and

be able to get a net set

almost

immediately out. Right? So from a perspective of the underlying streaming and the real time architecture,

you are able to do things like these fairly easily. I would say that, you know, some of these use cases that you mentioned, you know, sometimes you also recommend to people that, hey. It may be okay

to materialize and exit into some system, then use it, and then make that as a source and, again, make a new Nexus. So I'll give you an example. Right? You get data from 3 different systems and you're trying to do something complex and you're like, you know, you don't wanna force

fit a solution. Right? So you can say, okay. These 3 nexus, I'm gonna materialize, say, in a database. Right? Or data you know? And then

I'll do some, you know, ERT sort of transforms in there or whatever, and then make that outcome a next set and then take it further downstream. So sometimes problem solving in computer science and software is always like you can solve the problem in 5 different ways. You know, 1 of the ways as Avinash is saying, and I'm like, you know, maybe that will be too complicated for user to figure out. And sometimes you can break it down into a few linear steps and do that. So

yeah. There's never 1 right answer in some of these use cases.

Yes. The the correct answer in computer science is always, it depends.

Depends. Yeah. Yeah. People ask me is, like, oh,

is this the modern dataset?

I'm like, no. There's no 1 solution that fits, you know. Yes.

You can do ELT and it has makes a lot of sense, but, you know, no. It's not an answer for everything. I remember the question people used to ask like, hey. Is Java the programming language that will be there? And then everything else will be gone.

And, of course, over time, like, no. More programming languages will come and you will depends. It depends what problem you're solving. So I would say the same thing with data solutions as well.

The capabilities

and the opportunity for

is definitely very impressive, and you are being very ambitious in terms of

the types of problems that you're trying to solve and how you're trying to make this overall space of data complexity more palatable and easy to consume.

And I'm wondering what are some of the sort of grand hopes and ambitions that you have for the Nextiva platform and some of the potential that it presents to organizations

and for being able to build these data exchanges

that are sort of more flexible and composable?

Yeah. I mean, at a very, very base level, our goal was always to present sort of multiple possible frameworks that people can use to build these data solutions.

That was 1 part of it. The second part of it was, like, how can you go from an idea

to something working very quickly? You know, I have seen

large companies can spend, you know, months or years figuring out the framework and never get to an output. So you're always about, like, how can you get to do that faster?

Get the data ready to use in the system where you wanna do that. Right? So our big hope in general is that,

you know, by taking a little bit of an approach where we step back and we said, hey. It makes sense to have certain abstractions

in the data flow, and those abstractions can apply in multiple places. We think that

1 will make the life easier for many users. Right? You don't want 1 system to discover the data, then have another system to integrate it, then have a 3rd system to prepare it, then have a 4th system to monitor it, then have a 5th system to catalog it. I mean, the people have done that and it becomes extremely hard. So, like, what's a good baseline

that can cover and solve a set of problems for people? But then an open design so they can plug in their favorite tools and pieces in different places. So, again, we go back to that sort of approach. It's a framework and present it to the people who are using it? And, yeah, we do hope that this does become or introduce a collaborative way of working with data where people of different skill levels are using it because I can see the future in my kids who are already learning to code at age 11

and, you know, looking at data in ways and, you know, the future is everybody's working with data. And that's only possible if we build better, robust, more usable data systems.

And in terms of

the ways that you've seen the Nextflow platform

next sets that have been created, what are some of the most interesting or innovative or unexpected applications of the technology that you've seen?

Using that, you know, email attachment is a very interesting 1.

We had a large financial institution use,

you know, s 1 filings which are publicly available as a data source. And our system could take care of, you know, fetching the data, handling the rate limits, and all of that stuff, and even detecting some tables

in those documents.

And then they could focus on, you know, their machine learning service that could take that unstructured document and spit out more sort of, you know, identifiable,

you know, entities or features for them. Right? And we felt like, oh, that's a very interesting way to use it because

it's accelerating them and their road map and reducing their dependency on self engineering.

So, yeah, we do routinely come across some very interesting use case. Business use cases like a shipping broker, multiple ecommerce companies working with different shipping brokers.

Each 1 of them have built their tracking systems in a slightly different way. Now an ecommerce company to integrate with all of those is going to be a nightmare. It's probably a 5 year project. But when you have something like next line there,

you get net sets on the inside, you get net sets on the outside, and do not just check all of this with a single API call irrespective of who the data is coming from, who the data is going to. So things like these, we didn't originally intend that this would be used in this particular way, but when we see this, like, oh, it makes all the sense. You're able to get 1 net set on the input, another net set on the output. Your 1 API call, you're solving a business problem in a matter of days for things that would have taken you years to do. And in your own experience of building the technology and building the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of creating this Nextel company?

We actually came in to starting Nextel with a lot of data experience, but we've been surprised by some of the challenges. I'll give you a non technical challenge that we sort of solved and maybe Avinash can give a technical 1.

1 of the things that as a repeat entrepreneur, we wanted to make sure that we build a product

that works. You know, we build a product that actually

solves a business problem that people are willing to use and pay for and do that in a very efficient way. Right? So we ended up creating a company which

started with a seed round of funding and then became profitable on that seed round of funding

before going on to take, you know, further sort of money. But we wanted to make sure that you're not in this sort of cycle of, you know, keep raising more money and hope at some point we'll find the right use case. So that was a big challenge for us, you know, getting done in practice, you know, while in theory, it sounds great. The the technical

challenge is

when you start a company, you want to build it on the most latest and the greatest technology that's out there. Right? So when we are thinking about this, we're like, okay.

Let's pull out Kubernetes. Let's do Spark, Kafka, whatever is out there. And then you talk to the first 5 customers and they're like, oh, we use a a s 400.

Now

I have a s 400,

or do you have this thing on the zOS?

So just sort of marrying those 2 words, like, yes, we are using the latest 39 stuff, but it needs to work with the technology people have built in last 50 years. Is it self challenging? Imagine getting a net set from an AS 400. How does that look like? Right? So I think that aspect of it, something that we did not

start with, but we

saw that the platform can be used for all these cases as well. So that happened with our initial thinking about everything should be streaming in real time. Right? I mean, you're like, oh, why not? Like, everybody would want streaming at real time. That's great. But you go down to the company, they're like, no. I don't need streaming or real time for most of my data. It's fine. So, you know, you can do everything in that way, but how do you still keep it compute efficient, cost efficient, and everything? Right? So yeah. Sometimes it's good to start with the latest tech and try to be ahead of the market, but you gotta meet at the right spot. So how much RPG code did you have to write? We figured it out to sort of keep it as segregated as possible.

So let's say

it's less than 1% of the whole thing.

I have some background working with a s 400 as well where the first job that I had in tech, I was the sysadmin, and 1 of the machines that I was responsible for keeping running was an AS 400. So

No. I was amazed at that implementation when I saw that happen. I mean, you know, running in the cloud, peering in with an on prem system,

and, you know, that, you know, connector getting done extremely fast was like, okay. You know, through many, many months of iterations on that connector architecture, we had fun. We got in it right where we could

connect to a completely new system, but, you know, relatively

work.

And so for people who are looking to improve the overall operability of their data and streamline some of their processes, What are some of the cases where Nextiva might be the wrong choice? We would like Nextiva to be the right choice whenever you are working with data, of course.

Always clarify to people that, look, you're not a BI or an AI tool. Our job is to bring the data to you. There are many aspects of it. Right? I mean, data is a very broad space. Okay? For example, as I mentioned, we do monitor the data, you know, as we're sitting in the flow. We're not going to say that we're gonna solve all your data monitoring challenges. What we do best is we generate so many signals that you can consume in your Datadog or your favorite monitoring tool and do it there. So our goal obviously is we don't wanna reinvent anything. You know, if something is there, we want you to connect with it. We want to power it. Right? So we certainly,

you know, that part. You know, as I mentioned, we do run SaaS but also offer, like, on prem options. So if you're running on prem and you wanna do this 1 time with data migration,

I don't know if you really want to take and bring in and get the security approval to deploy a system for 1 time use case. So there are definitely cases where, you know, you have to think about, hey. Is this technically the right solution? And then you have to think about business wise, does this, you know, make sense? But I think we cover a very broad set of use cases extremely efficiently for companies.

In terms of your plans for the near to medium term of the platform, what are some of the new capabilities that you're targeting or some of the projects that you're particularly excited to work on? We're expanding on this concept of templates where a user, you know, who's more technical or has some specific know how,

define something,

you know, defines the template of what is my schema, what's my validation, or what's the connector, or what's the next set and stuff like that. And then other people use it, so we we're expanding on that. We'll be bringing in more community capabilities, like, you know, how can you do this, share your knowledge

in the company, but also outside.

The concept of data exchange, which you asked about early on, that's not something that's, you know, formally offered as a product that you'll see that. Embedding next line to your own product, that's something that some companies have started to do using our APIs. They embed those capabilities into their own products I know, as a single button sometimes or whatever. That's another area. So yeah. And as we keep working with more enterprises and more sectors, we do see that the challenges are very similar. You know, we have had an education company use us. You know, they need to work with data across the school districts and tech platforms. You have seen marketing companies use us. We have seen finance, ecommerce.

So, you know, we'll definitely

hope that, you know, can we keep scaling up our capabilities across these sectors and their unique needs. And so are there any other aspects of the Nextiva

platform and the technology and the problem space that you're working in that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think we covered a pretty good set. You know, there's a lot to the data challenges. I think we're taking our bite out

of getting the data ready to use to the user.

I think, you know, asked us in 6 months and I'm sure new surprises have come up and new things have been done here. Well, for anybody who wants to get in touch and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today. I think that I'm super happy that, a, people are starting to focus back on data tools. I think, you know, 3, 4 years ago, people, they were like, hey. This has been solved problem. Companies have been there in this space for 30 years. And, yes, data tooling has been around for more than 30 years. Right? Starting with Golden Gate and Informatica. Right?

So

first of all, a lot of stuff happening. Super exciting. The gap that we still see out there comes down to that operational scale. How will data be used in more applications and by more people? And I think I'm getting more and more interested in the people aspect of it because that's very challenging.

What Avinash said, right now, like, how do you get this all this complexity

in a way that people can use it? And it's a hard challenge, you know, that's what the technology challenge usually is that. How do you take something very complex, make it easy? How do you make driving easy into a self driving thing? You know, some people have that as well. Right? So that's certainly a big challenge. The other part that, you know, we think is missing the gap is that

usually

when people say it's easy to use, it comes with, oh, it's not gonna be powerful enough. Right? So where we find the gap is that how can tools cover that continuum. Right? Be easy to use, but still give the flexibility and the power that people can bring complex problems to that. So that's another area where I think more tooling will come and no code, low code,

developer tools will probably merge together. So there's a lot of closed tooling that's available in the data space today. You have tools to do a lot of things, but you can't

sort of interact between those tools very easily.

And that's something that we think having an open platform

is super helpful,

but also have some of these concepts that we have added with netsets and what you can do with those net sets, that makes a very powerful combination.

Sort of plug it in with something

where you deem it's right, but also have this base broad capabilities that you produce. Yeah. That's actually a very important point. I think in his last company where we worked together, Avinash actually designed a common standard for ad

creative so that, you know, multiple companies can come and collaborate together.

Will there be a common data standard in terms of how data flows, you know, across companies? We are very close to that problem, and we think, you know, and we hope that does happen over time. You know? Yes. Open up the signals from your tools. We do that through APIs. We publish a JSON manifest of everything in Nextiva.

So, yeah, we hope that we'll ultimately help the end customer

use all these tools together in an efficient way. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing with Nexla. It's definitely very interesting

problem and an interesting solution that you're working on. Definitely appreciate all the time and energy you're putting into helping to make data operations and just data management more accessible to more people. So thank you for all of your time on that, and I hope you enjoy the rest of your day. Thank you for having us here. It was really exciting to chat about this,

and we'll share more adventures and stories the next time we reconnect.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links