Brief Conversations From The Open Data Science Conference: Part 2

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to to get a $20 credit and launch a new server in under a minute, and go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And last week, I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In the second part, you'll hear from Andy Eshbacher of CARTO about the challenges of managing geospatial data, as well as Todd Blushka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal.

My name is Andy Eschbacher. I'm a data scientist on the research team at CARTO.

I work on problems for clients

largely with spatial data.

We solve their problems on CARTO's platform.

I also build tools to help data scientists

use Carto more natively in a data science environment. 1 of the products that we have is called Carto frames. It's a Python package for using Cardo,

in in Python environment,

and it shines in Jupyter Notebooks.

And

we also work with,

partners who provide us with spatial data that they either want advice on or they wanted to surface it on our platform,

for consumption

with, with our clients.

And a lot of times dealing with geospatial data can be challenging just because it's often either misunderstood how to work with it best, or you might not necessarily have the right tooling or techniques to source it properly and label it appropriately and then be able to make use of it at the destination. So I don't know if you want to talk a bit about some of the,

solutions that you've built as far as being able to store

and,

surface some of that capability in your platform.

Yeah.

As most people know, a lot of the work for for data science outputs,

80% of the work is just getting the data in a good form. 1 way that we're trying to handle that is,

we don't want data scientists necessarily to work on on Carto's web platform to do that. We wanted to build a product like Carto frames where they can work in,

in Python Pandas or or whatever tool that they wanna use in, in a Python environment so that they can do all of that data munging and and and stuff like that.

And then then they can push the data up to Carto and maybe do some additional data processing off of that.

Other ways that we have for data analysis and data processing,

in Carto, we have analysis chains where you can apply an analysis

that they're all related to each other.

And

if the underlying data sort changes,

the analysis bubbles up. So maybe you wanna

do a filter, then you wanna do some sort of cluster analysis,

and then you wanna visualize the output of that.

Carta will handle the various nodes of that and then surface the output that you're interested in.

And then the endpoint of that, you get inter an interactive map and you can,

add widgets to that map and then do filtering off of the widgets to tell the story that you're interested in telling.

And do you find that there are certain common

misconceptions or mistakes that people fall into when they're dealing with geospatial data? Yeah. Data data quality is a big problem, in geospatial data.

I think that,

so if you're working with GPS data, the margins of error, depending on the quality of your data, can be 10 meters or it can be 50 meters or even greater than that.

So trying to

get your data into a proper form can be very challenging

and

making

informed decisions off of

the the data. There's a lot of nuance

and very much an art in getting the the geospatial data in the correct form. A lot of the times,

with data like GPS data, you wanna aggregate it to a small boundary. Sometimes you wanna do, like, a hex grid or,

like a 100 meter square. And then you're telling you're telling the story of, like, what was the summary

of maybe foot traffic data in this location

at this time of day,

this day of the week.

There are there's obviously margins of errors that go along with that. But I think with with any

analysis that somebody does, communicating the errors is especially difficult

And and trying to to,

convince whoever is looking at that that

that this is an output that you have, but,

communicating the errors on a map, I I think that there's not a good standard way for doing that that right now. And so by having

additional

components that you can add to it, like the widgets,

for for doing filtering or communicating,

like residuals and things like that, we're hoping to

help take away some of the certainty that people have about

about visualizing spatial data and really looking at,

what the residuals tell you

in a interactive map application.

And as far as

collecting the geospatial data, it can come from myriad sources

you know, whether it be somebody's smartphone

or IoT sensors or just housing address or IP. So I don't know if you have any particular techniques

or issues that you faced in terms of collecting that data and storing it in a manner that can be unified and addressed in a uniform

and, sort of tractable manner?

Yeah. That's that's something that we're working really hard on right now.

We definitely don't have it solved yet.

1 thing that we're working well, so

the GPS data is especially messy.

The data cleaning, we found

about 80% of the data we have to throw out because there are, various inconsistencies.

Maybe you have a ping at the same point, or maybe you have 2 pings

that are far apart and faster than light speed travel. And so you have all of these conditions that you have to look for in the data cleaning process, and so you end up losing about 80% of the data. And then the 20% that you have, you wanna make sure that's, representative of the population that that you're looking at as well.

And each data set's different. So spending a lot of time with the data, trying to understand its nuances and quirks and, speaking to the people who provided it so that you can understand where they're coming from and how they collected it. Some

some of the quirks that you have is the device might not sync for a couple of days. And so the timestamps,

might not be as accurate as you think that they are in cities.

There's a lot of scattering off of buildings, and so you don't really know very well where the person was.

And so there, some of the efforts that you can have are just to snap to a road network so that you can just make an assumption that this person probably was on the road. If they were in a building, that building wasn't very far away from where they were.

And then since each dataset's different, combining those datasets to give

representative samples. So if you're interested in,

in this footfall problem, so how many people walk on this sidewalk at 10 PM on a Sunday or something like

that? Being able to pull in multiple sources and then using,

techniques

to try to make good predictions off of what the actual footfall was, and then validating that with,

some known sources

is is the process that we're going through right now. We we definitely haven't solved it. It's a hard problem.

But, yeah, that's that's 1 of the core problems that we're working on on our team right now.

And there are a lot of different standards for storing and processing geographical information, whether that's GeoJSON

or raster files or vector files or, you know, some other maybe proprietary schemas. So are there any particular

technologies that you've settled on for trying to unify that storage?

So Carta is pretty agnostic to to data storage.

1

1 thing that we rely heavily on is the open source community around geospatial. So og2ogr is a great tool for, transferring,

When when people pull their data into Carto, it's in a PostgreSQL database that has post GIS on it. And so Carto stores the data in a database.

That data transformation step is really hard. So we built tools like Carto frames so that people can, do that ETL process for themselves.

We support

almost all of the major formats for for for data,

both for input and for outputs

as well.

And what are some of the most interesting or innovative uses of the Carta platform that you've seen?

1 1 thing I'm especially excited about right now is we just released,

a React

library so you can you can build an application using Carto pieces of Carto,

and then you can use our vector library

and other tools,

like terf. Js, to do some of the spatial processing

on the client.

And then once you do that spatial processing on the client, you can send that up to Carto and then do maybe more spatial processing that's, that's like maybe more data intensive,

on the server.

So the number of tools that are coming out

that that allow you to build these applications that take different advantages of different

parts of the tech stack

and,

allow you to iterate quickly over over building these projects. I think it's especially exciting right now. So what you can do with vectors is really exciting.

Being able

to rapidly build prototypes or dashboards with React,

being able to do server side applications,

being able to build,

applications with with Python using Carto Carto frames.

1 example that we have that that pulls a lot of these things together is we do an ETL process using carto frames. We're using open data

off of, Big Belly trash cans here in here in Boston and they report the status of how much trash is in there. And then we wanna try to, make

we're building a model to make predictions on based off of footfall,

and other things, other demographic information in Boston

on,

how often these should be emptied. And so the process for doing that is, we had the ETL with carto frames. We sent that table to carto.

We built a dashboard,

using cartovl

and mapboxgl.

And,

we did, like I said before, some of the spatial processing using turf.js

in the browser.

We visualized the data using a React library called ReactViz.

And then we built a special,

PostgreSQL extension that allows us to store the output of a model. So we trained a model based off of a number of features. We stored the model on Carto in a table, and then we later,

then once we got new data later,

we referenced that model to make new predictions. And then,

and then we also

extracted feature importance,

to communicate the results to the user. So being able to not only

visualize the data, but to analyze it and then give the model summary to the person all in 1 dashboard,

is something that's really exciting to us. So 1 of the main products that we have at Acado is Builder,

but by having all of these components that you can put together, you can build your own version of what you want Builder to be.

And for us on the data science team, we want to build dashboards

that allow users to extract all of the power they can from very specific analyses. So the 1 I was talking about before is just a gradient boosting regressor

dashboard, but we also wanna do things for optimization so we can build a custom application for a client.

And as 1 last question, I'm wondering if you have any particular perspective

on what you see as being some of the biggest difficulties or the biggest gap in the available tooling or technology for data management?

I think that there's a constant

tension between having a ton of data and visualizing

all of that data.

I think that the

there there's some compelling tools out there,

like mapd that that,

to visualize this data and do some basic analysis on it. I think the value for the person looking at the map isn't necessarily looking at all of the data, but looking at samples of the data and then making informed decisions on how to better analyze the data.

So so I still think that there there is

a tension between visualizing all of the data or or trying to figure out what the what the intermediary

is between visualizing all of your billion points or just visualizing the summary and then making decisions off of that. Yeah. That's that's the answer I have for right now. Yeah.

I'm here with Todd Blashka from TigerGraph. So, Todd, could you start by introducing yourself? Yeah. Hello. I'm, Todd Blaschka, the chief operating officer for TigerGraph.

And so TigerGraph is a graph database company, so I don't know if you wanna talk a bit about just graph databases in general, what they are, why they're special, or why somebody might wanna use them over their standard relational database.

That's a great starting point. Graph databases have been around for almost 20 years.

They it's all based on graph theory, which is all about finding the interconnectedness and the meaning of the interconnected data.

So when you think about relational databases,

relational databases

require joins to find meaning between the data. And,

it is a very challenging thing for relational databases to do even though they're called relational databases.

Graph database

is a natural way of how you think to naturally connect,

the dots. Think of Kevin Bacon and 6 degrees of separation.

What's the connection? What's the relationship? That's what graph databases do.

The first technology in the graph space came out in

about 17 years ago, a technology called Neo 4J.

So they they have been the pioneers in this space and they've done a wonderful job,

getting graph databases into corporations.

And then over the time as, especially in the last couple of years with the scale of big data, meaning large datasets,

the requirements on graph have needed to be able to scale, to handle very large datasets,

to be able to provide real time updates,

as well,

as be able to get a lot of data in there very fast. So from a scalable and real time standpoint,

that's where the customers are demanding graph databases. And so we're seeing a lot more adoption in the enterprises

because they're able to apply to many different use cases.

And when people are dealing with relational databases, it's pretty easy to understand that the data just goes in rows. There are multiple columns to that row. So when you're modeling for a graph database, is there a different type of thought process that needs to go into how you store the data and the types of data that gets stored in various nodes? Can you store data as part of the edge connections? Just wondering what the data modeling process looks like for that. Excellent question. So from a data modeling,

we often will take the data from relational databases and their data sources and you really map it out. So as an account, an account is a user a user, but then you're looking at what are the relationships attached to those,

to those,

nodes or entities. For example, if it's a user, if it's Tobias,

well, what device is Tobias using to make a payment on? Is it a phone? Is it an iPhone? What is the transaction device? All these

attributes to device,

you want to be able to capture and analyze. And when you think about that in aggregate, how can that help,

better customer service by being able to know more about the behavior of the users in an aggregate standpoint? And that's where from a coming from the modeling of graph,

we're able to bring the data in and then be able to provide Graph Studio, which is a visual front end to be able to, explore the graph and look at the relationship. What is the closest relationship

amongst multiple hops between

this user and this user or this customer and this customer to find those those relationships?

And so 1 of the things that you mentioned was that scaling graph databases is 1 of the current concerns and I'm assuming has some technical difficulties associated. So,

how do you manage that at TigerGraph?

TigerGraph was developed from the ground up, to be natively parallel.

Our founders come from the world of MPP, massively parallel processing, such as Teradata,

Twitter, which is actually just a big graph system.

And what we did is built from the ground up to be able to scale out. So when you look at 1st generation graph, it was all running local on 1 machine. 2nd generation

started applying the back end of Cassandra. So you're ready. It's Cassandra storage,

but you don't have the performance when you have add that layer in there and it's not a native graph. So the 3rd generation we developed using what we call native parallel graph technology to build a graph engine from scratch. We're using c plus plus

and, as as part of that architecture,

we're able to apply parallel processing to loading the data on a continuous basis, which also has been a very big challenge in the graph space, just loading data in, which in this day and age, it shouldn't be an issue, but it is an issue when you talk to customers about just trying to load in data. If I have terabytes of data, it may take them 15, 20 days to try to load data in. That's not gonna help your business when you're just trying to get loading the data in. The second is,

the parallel computation. So every single node,

within the TigerGraph system is a comp not only a storage unit but also a computational unit. So we're applying the the the process of SQL like programming

where you're able to do functions and build off each 1 of those nodes for initial amount of computation. So everything's running in parallel to be able to update in real time

as you are using the graph. So that brings in a whole another level of use cases

where you can provide real time recommendation engines that are looking at not only what you've done historically or the the shopper has done, but also what they're looking at today. If I'm looking at beach towels, well, if you're looking at beach towels, they may be looking interest in beach chairs, and you might be interested in

umbrellas. So all of a sudden, they'll provide a lot more information and relevant information

to a shopper while they're browsing. And this is where Graph is connecting the relationship and also be able to bring back this information

in sub second query response times so that you can develop a much tighter relationship with your customer

from this ecommerce example.

And would you use something like TigerGraph or GraphDatabase in general as the primary data source or would it generally be used in conjunction with a relational or others type of data storage?

Great question.

Oftentimes, it is being used as part of a system where customers may be doing machine learning

or they may be doing,

developing some AI and they have as part of their old system,

you know, having data that may be stored in other data stores such as HDFS,

and they wanna be able to extract that to be able to generate new features that may go into machine learning. And that's where the data would come into TigerGraph, and that becomes a great aspect for data science,

professionals to be able

to explore the data, look at where the attributes are, the connections in order to extract new features that they can bring into machine learning. And so then that becomes part of the whole process. So the data can then come out and then either go into

supervised or unsupervised learning examples.

1 customer that's in our telecommunication

space,

they're using us to help address

real time scamming on the phones. Think about you getting phone calls from people you don't know.

Well, instead of just looking at what the phone number is, but if you knew know about the behavior of that number, is it all 1 way calls? Is there any callbacks?

You start looking at these kind of connections. Is there other people that you call that are also getting this number? This gives more insight into the relationship of the data,

and that's able to help this customer extract new machine learning,

features that they can feed in so that they can reduce the number of fraudulent and spamming calls. Because

the real sophisticated

fishers or spammers and and are the harder ones to find. They know how to skip the easy test. So you're looking for the relationships,

and the behaviors of the calls in order to support that and to be able to update that in real time so that every 2 hours, they're updating their system to try to reduce the number of fraudulent and scamming calls that that their that their customers are are experiencing. You do that against us. They have 500, 000, 000

subscribers today. 500, 000, 000 subscribers,

8 calls. You're dealing with billions of events every single day. That's big data. That's where TigerGraph can help.

And what have you found to be some of the most difficult aspects of scale scaling and parallelizing graph data? And do you support anything along the lines of distributed transactions for that sort of query capability?

Yeah. Great question. So from a a distributed standpoint, we've architected the technology so it can scale out horizontally.

So we have customers that may be running on 20 commodity machines

handling, you know, terabytes and terabytes of data,

billions of events.

1 of our largest customers is Alipay, which is think of it as PayPal, but it's part of Alibaba, and it's 10 times bigger.

They have 1 trillion

vertices in their graph. So that's trillion with a t, the largest graph

in the world, and they're having handling billions of events per day. So they're scaling it out across 20 plus

commodity machines

and that's all acting as 1 graph database because of our distributed,

technology.

And for somebody who wanted to get started with using TigerGraph, is it something that they would install on premise or is it a hosted platform? And what does the workflow look like for getting up and running with it? Getting up and running is very easy. It is a software product that you can

download and run locally, or you can run it, in the cloud of your choice.

We are not a hosted provider. It's not a SaaS service,

but it's very easy to be able to get up and running. What we find with customers

is it's if they know the dataset that they wanna bring in, it's very easy with our Graph Studio UI to be able to create the schema

and then be able to load the data and then be able to explore the Graph and then be able to generate

queries using

our G SQL, which the G stands for Graph SQL like query language

to be able then to find insights in the data. And we have prebuilt queries in there where they can look at shortest path, page rank, ways to help them understand their data and and jump start their exploration of their graph.

And have you found that there are any sort of common

difficulties or misconceptions about how, to work with and analyze

graph data as opposed to relational or document oriented systems?

With graph, the the challenge is

the understanding what graph is. Most people have been trained

on

relational database. You think of everything as in columns and rows,

and you're then you're trying to find and build these joins. With graph, joints are natural.

So it's the first learnings curve of just understanding how to think about it differently, but you're actually thinking about how you normally do things today. It's a natural way of thinking about it. And then with it with document stores or time series, those are dimensions.

Those are dimensions that you can apply in a graph database. So time series could be 1 act attribute.

Location can be another. So there's many different attributes you can apply. So once someone starts

understanding the,

the thought process behind Graph,

then, they start finding

new ways in how to use it. I mean, some of the customers have told us, you know, what we thought was impossible is now possible because we can understand and make sense and get insight in the data we haven't been able to get to before.

And are there any other aspects of graph databases or Tagger Graph that you think we should discuss a bit more?

With graph databases,

my recommendation is go out and explore. Test test some graph databases. If you have a business case or you have a question you you wanna have answered, that is the best way to start because then you could look at how you bring the data together. An example is,

I just met with a financial services firm and what sounds like a simple question, I want to know how many wire transfers have an email

a phone number attached to it. The law requires that there must be a phone number attached to a wire transfer. But if you have multiple divisions in the bank

and you have 70, 000 wire transact transfer being done every single day, how do you find that information across 4 or 5 different divisions where data stored in silos

across those different areas? Well, that sounds like a very simple question, but relational database, it's a very hard thing to do. With graph, it's natural. But start with what question do you wanna be answered and then apply it and and try, you know, and try TigerGraph.

And as a final question, what do you see as being the biggest challenge or gap in the available tooling or technology for data management that's currently

available? The biggest gap we see is

aligning it to business cases.

Technology has advanced with so much tooling out there.

And when you look at the big data space, there's been so much movement about putting data, moving it from 1 place to another, but without asking what is the business benefit question. And that's where I think, you you know, the data scientists are are struggling because there's all this tooling. They're trying to figure this out, but how does this line did really result in what a value is for for an organization or for your project to be able to make sense of that? Well, thank you for that. And,

for anybody who wants to follow-up with you or TigerGraph, what would be the best way for them to do that? Www.tigergraph.com.

Come there, try out the product. Thank you very much. Hey. Thank you,

Tobias.

Data Engineering Podcast

Summary

Preamble

Interview

Andy Eschbacher From Carto

Contact Info

Parting Question

Links

Todd Blaschka From TigerGraph

Contact Info

Parting Question

Links