Summary
Many of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
- Your host is Tobias Macey and today I’m interviewing Jon Herke about TigerGraph, a distributed native graph database
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what TigerGraph is and the story behind it?
- What are some of the core use cases that you are focused on supporting?
- How has TigerGraph changed over the past 4 years since I spoke with Todd Blaschka at the Open Data Science Conference?
- How has the ecosystem of graph databases changed in usage and design in recent years?
- What are some of the persistent areas of confusion or misinformation that you encounter when explaining graph databases and TigerGraph to potential users?
- The tagline on your website says that TigerGraph is "The Only Scalable Graph Database for the Enterprise". Can you unpack that claim and explain what is necessary for a graph database to be suitable for enterprise use?
- What are some of the typical application and system architectures that you typically see for end-users of TigerGraph? (e.g. polyglot persistence, etc.)
- What are the cases where TigerGraph should be the system of record as opposed to an optimization option for addressing highly connected data?
- What are the data modeling considerations that end-users should be thinking of when planning their storage structures in TigerGraph?
- What are the most interesting, innovative, or unexpected ways that you have seen TigerGraph used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on TigerGraph?
- When is TigerGraph the wrong choice?
- What do you have planned for the future of TigerGraph?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others. Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at data engineering podcast.com/accryl. That's acryl. Your host is Tobias Macy. And today, I'm interviewing John Herc about TigerGraph, a distributed native graph database and how you can use it for your applications. So, John, can you start by introducing yourself?
[00:01:38] Unknown:
Yeah. My name is John Herky. I'm a developer evangelist at TigerGraph. I've been here for about 2 years. Do you want my about full background? I was a computer networking engineer in the military for for 8 years. Worked at a big giant company called Optum United Health Group. I was a networking engineer there until I got really bored, copy and paste. There was a lot of configuration changes. I explored something new which was being an entrepreneur in residence at this healthcare company, was building startups. And in about August 20 16, we started seeing emerging technologies really disrupting other industries, and we didn't want to really get left behind. So we created a new division. We broke off of the startup incubator and created a new division solely focused on emerging technologies. So if you think about when the Internet came out, people are like, what's this thing called the Internet? Do we need this thing called a web page? I don't think we need a web page. A web page is a fad. These are the same questions that are getting asked about today's technology, which is what is blockchain? Do we need blockchain? Is blockchain a fad? What is graph? What is AI, computer vision, natural language processing, machine learning, quantum computing, Internet of Things? So we had worked hand in hand with the business units to apply these emerging technologies in the healthcare space. Then I got recruited to doing a lot of things in the community around graph since my background was networking engineering.
I went really heavy into graph around 2018 and started to build the world's largest healthcare graph at that time and started to do different events in the community, and got a call from Todd. He says, hey. Can you do what you're doing, but do it for us? Interface with developers, help developers out, activate developers, engage developers, grow grow a community around you. So
[00:03:16] Unknown:
And do you remember how you first got started working in the space of data, and what it is about that that has kept you interested and engaged?
[00:03:24] Unknown:
Yeah. So I guess I'm the nontraditional route. You know, I went from networking to building companies to ultimately getting into data. And so I think the thing that's interesting about data is it's an area to explore, to understand things that might not be totally obvious when you first explore it. So to me, it's the creativeness of identifying a problem and then trying to find an answer to that problem. There are some really interesting things from my history, and I'm I'm pausing because my mom got diagnosed with cancer. You know, I'm a database guru here at Optum. I have wealth of information available to me.
So this was a what can I do as a engineer of data? What I could do is look at people that are similar to my mother, maybe preconditions, maybe have this age of 39, let's just say that, female, on x y z medication, has had this and that in the past. If I can run something to find similar patients and people like her, then I can identify maybe a better drug treatment path and identify a route to give her an opportunity at life. And so that's just 1 story, but there's many stories like that. How to improve the healthcare system, how to improve people's lives.
So that's really what data means to me, is making a difference, making an impact, finding things that you wouldn't be able to find at scale. That's sort of been my journey.
[00:04:51] Unknown:
Sorry to hear about that. Thank you for sharing, and I'm glad that you're able to do something to help in that. Yeah. And so that brings us now to what you're doing at TigerGraph. You mentioned that you ended up there to help grow the community around it, And I'm wondering if you can just describe a bit more about what it is that the business is building there and some of the types of community effort that you're doing to help sort of grow the ecosystem around it and engage with developers?
[00:05:22] Unknown:
As a developer, when I was working at UnitedHealth Group Optum, 1 of the things that was very hard for me is that there wasn't enough tooling around the ecosystem for integration. There was no tooling around for more of the traditional DevOps things as far as building, constructing, automating, deploying, your your solutions. There wasn't syntax highlighters. There wasn't all these different tools that you sort of take for granted when the technology first came out. Again, there also wasn't a community, so there was the ability to read the docs, which are phenomenal, but it's hard to maybe troubleshoot small things where you don't want to bother the TigerGraph engineer that developed the product on how to do a certain simple select statement as a developer. That's something you probably don't want to go and ask the core engineer, but you might want to ask the community.
So when I got hired, the first thing I was trying to do was build that foundation, build a place for the community members to go to, including the forums and a Discord group. After that, it was a lot about building the foundational assets. So there would be people in our community that are already working on syntax highlighters, build tools, Gradle, automation, deployments. So we'd work together in the community and pair up with developers to build some of these core technology integration components to the product themselves.
[00:06:39] Unknown:
And so I've actually had a conversation on this podcast about 4 years ago with Todd Blaschka, who you mentioned is the person who brought you into TigerGraph. And we had a very brief conversation about what the core offering is. And I'm wondering if you can just talk through some of the ways that the project and the product has evolved over those past 4 years.
[00:07:03] Unknown:
Yeah. I would say probably since 4 years ago, the community, the ecosystem has grown. The product itself has grown in the sense that it's being challenged by all of our customers at TigerGraph in different ways that weren't imagined before. So maybe in some cases, they would have instead of a 100 vertices or vertex in their schema, they would define a couple 100. Maybe there instead of just having 1 1 sort of schema, they'd have a multitude of schema changes over time. These sort of edge cases where there's lots of developers integrating with the product, push the product beyond where it was.
Beyond that, there's a lot of security things that were put in place. So vertex level act access control, there was the ability to manage user groups to the different graph solutions, and more recently, they're really focused on scaling, so deploying and scaling using the auto elastic scaling with Kubernetes.
[00:07:57] Unknown:
In that same time period, the overall ecosystem around graphs and applications of it and particularly for machine learning use cases has been growing and scaling. And, wondering how you've seen that ecosystem change in terms of the usage and application designs and systems integrations where graph problems are being applied and that actually require this core graph engine to be something that is natively available rather than something that has been added as an abstraction layer on top of something like a relational database.
[00:08:34] Unknown:
And going back to my experience at at Optum. So let's say you have, you know, 40, 60, 000, 000, 000 vertices, and you have a 100, 000, 000, 000 edges. So in the health care system, you have patients, you have claims, you have Rx. Rx claims, there's a bunch of different type of claims. You have provider's information, you have nurses' notes, you have calls, you have all of these different touch points. So when your data is super complex and you're trying to find insights in real time, this is where Techgraph really shines. So in 50 milliseconds, we could pull back all of this data together in real time. So if you have a lot of different joins throughout your solution, you might wanna look at graph databases.
So going back to having complex data that you're trying to retrieve in real time, there might be insights around your data that don't exist traditionally. So you're looking at the shape of the graph, the relationship between different data elements to derive new features. I guess the simplest example might be if you have in your data a person named John, and we have another person called Larry, and we have a relationship called father. And if you want to infer the relationship of the the father's father, that would be the grandfather. So you can look at the patterns of the relationships to infer different elements in your graph that you can use then for your machine learning models.
Other things are you can run algorithms, graph data science based algorithms to derive new features, and you can use the traditional features. You can use the relationships between different data elements, and you can use these derived outputs from these graph based algorithms to then use to train your models.
[00:10:09] Unknown:
As you're working with end users and potential customers of TigerGraph, what are some of the points of confusion or misconceptions or misinformation that you encounter and some of the types of that you find necessary to help them make best use of the capabilities that something like TigerGraph offers and some of the ways that they can think about how to shift their overall design process to incorporate graph algorithms and graph concepts into the way that they're approaching a problem?
[00:10:43] Unknown:
Yeah. I suppose there are some design differences when you're designing the graph. So there might be things that you might want to pull out when you're designing a graph. For example, if we talk about patients again, well, 1, you could store the attribute of their sex and their patient vertex. However, if you're gonna be doing a lot of different searches on that particular attribute, you might want to break that out into its own vertices. So when you're accessing the graph, let's say you have 50, 000, 000 patients, you don't have to read through every single patient in the database. What you can do is start with a, let's say, female, and you could traverse the edges that are associated with that attribute called female, and you can find the patient. So instead of having to read through database in a nontraditional way because the ability to navigate the edges and get to the related element is probably not a concept that is regularly thought about in traditional database design.
[00:11:47] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder. On the website, the main tagline says that TigerGraph is the only scalable graph database for the enterprise. And I'm wondering if you can just unpack that a bit and explain what are the pieces that are necessary for a graph database to be suitable for enterprise use and some of the capabilities that are required as you scale up on usage and organizational complexity?
[00:12:45] Unknown:
The biggest, most important thing as a customer was the ability to do deep link analytics, doing the traversing, to be able to find the insights in a relatively quick amount of time. Also, security is 1 of the most important aspects from an enterprise perspective as well. How can you define security down to not only the graph level, but the vertex level, and then have different user group securities based off your background, why are you accessing what data. So I'd say from a security perspective, TigerGraph has had a major focus in that. But more importantly is the way that the database itself was built. So it was built in c plus plus It was built to be massively parallel processing. It was built to do computations at the vertex level.
It was built to scale horizontally. So 1 of the criteria is when we're looking at the existing graph market back when I was at Optum in UHG was what can scale beyond just vertically scaling, what can scale horizontally. So as our data grows, we can we can also grow our cluster. When you're scaling horizontally, there are a lot of complications that could arise, especially if your solution wasn't designed to scale horizontally. For example, you could have many different databases with the data residing in each sort of separate cluster. And then as a end user, you need to identify how you can get that data out, and you have to write a query that goes to a specific machine to pull that data out. I think 1 thing that Tiger Graph did well was designing it with the end user in mind to simplify all that. So instead of having to understand exactly where the data is, how to get access to the data, and then writing queries in different ways,
[00:14:30] Unknown:
you can essentially write a query and all that sort of handled by the TigerGraph platform itself. Yeah. And I know that with graph databases in particular, being able to scale horizontally can be challenging because you need to understand where and how to partition the graph, particularly if you have super nodes. And I'm wondering how much of that is exposed to the end user and how much is able to be pushed down into the core engine so that the end user can just write the graph data. They don't actually have to think about what the partition structures are going to look like as you scale out horizontally and how much of that you need to do in terms of upfront design and how much you're able to
[00:15:10] Unknown:
just defer to the core engine to handle for you. As an end user, that was something that was very nice. I didn't have to worry about that and how to partition the data, where the data resides, and, you know, in what node across what machine is the data persisting. When I write the algorithm or the query, how to access that data, that was all removed as an end user. So as an end user, I could just focus on accessing what I need to understand about the data itself. I don't have to worry about the complexities under the hood. There is the ability to go in there and to make changes to some of that logic as well. But as a user, you don't have to understand all the complexities under the hood of of TigerGraph.
[00:15:49] Unknown:
In terms of the modeling aspect of it with relational databases, engineers are used to being able to start with a particular structure and then be able to create a migration to add or modify tables or change columns, etcetera. And I'm wondering what the equivalent process looks like when you're designing a graph structure where maybe you start off with a certain core set of objects that you want to model. So in the patient example that we've been using, you know, I wanna be able to model a person that has attributes of name, age, gender, geography, and then maybe I also want to add in another core object of medical care facility, which is going to have its own attributes of geographic location, staff, the sort of facilities that are available to it. So maybe this 1 has a X-ray department, whereas this 1 has a, you know, radiology department and just how you're able to mutate and modify and expand the graph structures as you dig deeper into a problem space.
[00:16:52] Unknown:
1 of the functions or features of TigerGraph includes the schema change job. So to make changes in your graph solution is writing a a simple block of code that goes in and alters the graph. So as a end user, you don't have to do much other than, hey. I have this new use case. I have this new data. I have this new way of looking at the data itself, and I need to reimagine the model itself. You don't have to drop all the data, only the the data that's related to modification or change. It's very easy for the user to be able to go in there and create a a modification with very little cost as far as maintaining the solution.
[00:17:32] Unknown:
In terms of the ecosystem that exists around TigerGraph specifically, but also graph problems in general, I'm curious what you've seen as some of the types of tooling or material for educational purposes that are available. You know, for example, with application development frameworks, they're typically shipped with an object relational mapper for being able to translate between your program logic to the relational database engine. I'm wondering what you've seen in similar cases for, graph engines for being able to incorporate them into application designs. I know that the machine learning community has been investing a lot in being able to build graph algorithms into their different machine learning and deep learning libraries. And I'm curious what you've seen as far as integrated support for being able to work with things like TigerGraph as the storage and computation engine for being able to power those machine learning applications.
[00:18:28] Unknown:
Yeah. So we have talked about our ORM in in our community quite a bit and creating 1. We just haven't gotten to the point of building 1. I think 1 of the biggest enablers for people that are building on top of TigerGraph is the ability of every single query or logic that you write inside your Graph solution is compiled down into a REST endpoint. So you can instantly call to the logic of what you wrote. So if you have input parameters, you can call to that. In some cases, you might have 1 query that has a subquery of some function, and they can call the subfunctions, and then you can retrieve the information. We have, like, a GraphQL connector as well.
We have different connectors based off the tech stack that you're using. I would say the most important part is just the focus that TigerGraph has had on enabling the REST services that they have in their product offering. So a lot of the different things that you could do, such as if you want to retrieve the metadata, you can call a REST endpoint. If you want to up cert data, you can call REST endpoint. If you want to call the query, of course, you can call the REST endpoint. There's a lot of different actions that you can interface with from a REST perspective, which makes it really easy to integrate with.
[00:19:34] Unknown:
Another question that comes up a lot when you're dealing with specialized storage engines is the question of polyglot persistence versus using a particular engine as the system of record. And I'm wondering what you see as the sort of decision path that people go through when they're figuring out, do I want TigerGraph to be my system of record? And that's actually where I'm going to interact primarily with my data versus using Tiger Graph as an optimization for specific subset of problems where maybe they're using a Postgres or a MySQL as their system of record for the application, and then they're either replicating some of that data or storing a subset of their data into TigerGraph for being able to optimize for those graph algorithms in the cases where that makes sense?
[00:20:21] Unknown:
I think it's a little bit of both. I see both use cases. It depends on what it is that they're trying to do. So, for example, if you have an application that is needing to pull together the the information from many different sources in real time. What you can do is set up that streaming from this the downstream source systems to TigerGraph and pull that data together in real time. So as soon as it's entered into Tiger Graph, all the relationships are built, and then you're traversing the graph to extract that information.
[00:20:51] Unknown:
So when end users are deciding that they want to incorporate a graph system as part of their overall architecture and set of capabilities, I'm curious what you have seen as some of the overall system architectures or supporting systems that they will build to be able to work alongside TigerGraph, either for feeding data into it or being able to query from it and just some of the types of use cases and applications that are built on top of it?
[00:21:24] Unknown:
Yeah. So as far as architecture wise, majority of the use cases use Kafka streaming data in directly. There are some nontraditional data sources that drop files into a certain zone and you want to pull and extract that data together to import into TigerGraph. But I would say it mainly, it's via streaming through Kafka is the primary way. As far as the architecture in at least some of our use cases, we'd use Jenkins as the orchestration service. We built our own sort of Gradle plugin to build scripts to sort of execute scripts in certain ways, and to create logic within the code itself to reproduce the graph based solution that we wrote. So as soon as we would create different scripts for the health care solution, we would then send them to GitHub and then would be trigger off a bunch of different jobs that would go and test and run through our unit test and then deploy to a test server.
So as far as some of the monitoring ones, there are some Elk is a monitoring service that TechGraph supports out of the box. They have a file b configuration file that's generated for your login. So as an end user, you wanna understand the the queries and optimize for the queries and what's the memory that's taken up when you're writing the queries. And so you're gonna look at oftentimes the different logs through either Elk Stack or Datadog was another platform that we had used as well.
[00:22:43] Unknown:
In terms of the sort of community aspect, which is where you've been spending a lot of your time and sort of your prime directive when you joined the company, I'm curious what you have seen as far as overall feedback, some of the growth in terms of interest and investment by the community in building up different tooling and design patterns around TigerGraph.
[00:23:08] Unknown:
When developers are able to contribute and put their time into building something open source, It's not something that we take for granted, and so it's something that we want to highlight and encourage the community and help them out with building whatever it is from a syntax highlighter to maybe an ELK Stack configuration, to a full stack example, to how to use Plotly Dash and integrate it as a data scientist into your project. So we see a lot of activity that the developers are building different applications, different tooling, different ETL tools, how to pull data out of x y z system and push it into. 1 example of that is is, like, Node RED, which is a nontraditional orchestration service. It's really designed and built for the Internet of Things. But when you put a Node in there, you can you can pull data from say, you have Twitter and you wanna send it to AWS's natural language processing solutions and then put it into Tegraff.
That is some of the things that the the community is currently and actively building is these tooling around TigerGraph. From a use case perspective, that can be everything from knowledge graph systems are really popular. So we might have some source data. For example, during COVID 19, there was a bunch of articles that were published around COVID itself. And so 1 of the challenges is you have a bunch of text, how do you wanna process this text? Maybe run it through the machine learning model to do entity extraction, and then model those entities inside of TypeGraph, and then you're traversing the graph to find and derive different papers that are related to certain topics.
Another big thing around that is concept maps. So if you have related concepts, even though the entity extracted is a certain word, there might be a correlated words with that entity. And so we see a lot of things around knowledge graphs. We see a lot of things around supply chain and use cases around supply chain. So when you wanna track, let's say, a part being created, and then you want to get that part all the way to the end product, which is a car, What happens if a boat gets stuck in a a canal and disrupts your whole supply chain? What happens if, you know, this huge global crisis that's going on affects your supply chain? So we're seeing a lot of different supply chain use cases. Other use cases include, because you can do deep link traversals, is crypto space. You know, cryptos are pretty popular right now, and so we see a lot of people in the community building transaction tracing. And so they can look at address to address to address to address and do deep link tracking. There's also things around geospatial and looking at the distances and calculating the distances, which if you think about the maps of intersections on the roads and the intersections, and you think about that intersection as a vertex itself and the road itself as an edge, there's a lot of things with logistics and maps, geolocation mapping.
[00:25:58] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, DBT models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to dataengineeringpodcast.com/monte Carlo to learn more. Another interesting aspect of any data storage system, but particularly 1 that is for a sort of specialized data model, is the question of being able to build up sort of data sharing and collaboration communities, which is an area that companies like Snowflake have been investing in. And I'm wondering what kinds of capabilities TigerGraph has for being able to build and share and expose and collaborate on different data structures to be able to grow community in that way, to be able to say, here's the dataset that I've built, you know, based on translating OpenStreetMap data into a, you know, set of vertices and edges that consists of all of the road networks in the continental United States, for example, and then being able to say, okay. Now I'm going to hand this off and, okay, somebody else has actually done that for Canada. So now we've got an interconnect, and now we wanna be able to say, what is the optimal path for being able to get from, you know, Dallas to Ottawa, for example?
[00:27:54] Unknown:
What are the companies doing together to achieve this? Right now, I think the biggest difference between the graph space and the more traditional relational space is you have SQL. So you have a standardized language in which you can communicate with all your your database solutions. 1 thing that's happening right now is the standardization. The same committee that got together to create the SQL standard is now creating a new standard called GQL. Once that comes out, I believe in the next year and a half, 2 years, there will be more of a standardized language and a standardized way to interact with database solutions. I think once we have that standardized approach of interacting with databases, we'll have more sort of shared resources between different companies.
Now there are a couple of different committees out there. There's this LDBC committee council that is looking at sort of standardizing the approach of how do you measure the the speed and accuracy of maybe not the accuracy, but the speed of which you can derive information based off certain questions that you want to ask your data. And so there, they have right now basically, like, a social media dataset in which all of the graph database companies are going on there and being able to have sort of a standardized way to sort of measure the speed at which you can drive these questions that it's asking of the social media data. Now they're also looking at adding additional data sources as well. 1 of the ones that we're contributing to as far as a shared data source is the Cynthia Medgraf.
So I come from the health care space again, and 1 of the things is you don't want to give everybody health care data to explore. So we are using a synthetic health care solution and model the solution and provide ways to look at that data just as you would if you're a health care provider. There are other synthetic based solutions, but I think that's somewhere that we could play nice together as far as all the different companies is coming up with not only the standardized language of how to interface with the database, but also some northwinds type of examples of how you can actually look at the data or query the data.
[00:29:56] Unknown:
In terms of the applications that you've seen users build on top of TigerGraph, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:30:07] Unknown:
The most interesting to me, maybe not the most interesting to everybody else coming from the health care space, was learning about them doing energy grid management systems. So they're recalibrating the energy grid system in real time. So I thought that was really neat where they're able to compute the power consumption of all the different transformers and from this route to that route. To me, that was a little bit untraditional, but because it's an analytical engine, you're able to compute that when you're doing this traversal. So I think that was the most interesting use case for me that I ran into when I was exploring TigerGraph.
[00:30:41] Unknown:
In your experience of helping to build and grow the community around TigerGraph and working with the graph engine and in that whole ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:30:55] Unknown:
The most challenging is it's a new language. Right? When you have to learn a new language, it's a little bit tough. 1 thing that is really nice about the G SQL language of TikGraph is it follows a similar syntax to SQL. So if you're an SQL developer and you come into a brand new language, on day 1, you can basically read the code itself, but I would say learning a new language is probably 1 of the toughest parts of any of our graph solutions or graph companies out there right now. I think that's where graph adoption will pick up once there's a standardized language. Yeah. I think 1 of the most interesting things coming into the graph space is learning how to design and how to traverse and the optimization of doing different actions within your query logic will affect the way that you can not only achieve the results. So just a small modification of your query could impact the performance by a hundredfold. So if you, for example, do a for each loop outside and have your select statement inside the for each loop, then that could hinder the performance, but it does what you're trying to do versus taking it as a function itself. So you write a subquery, and then you're able to call that subquery as a function.
Just being able to think about how to write queries has been very interesting and something that is also difficult for sometimes people to grasp is the query performance. So, usually, we do have a lot of great material out there about schema practices, best designs, but that's something that does come up a lot where they're doing something very inefficient as far as traversal, and there might be a lot of memory that might be being consumed on that traversal, which could cause issues as well.
[00:32:41] Unknown:
On that question of query optimization, is there something analogous to the sort of explain, analyze query that will show you what the query plan is going to be based on the specific statement that you entered and then being able to understand, okay. Well, because I did this select statement inside of a for each loop, it's actually going to greatly impact the performance of my query versus if I were to do this as a, you know, function or a subquery.
[00:33:09] Unknown:
Yeah. Unfortunately, there isn't a query planner that they have integrated into the product, so I don't have the exciting answer of, yes. We have that currently. There are some other features, including the visual query builder. So let's say you're not writing the query itself, but let's say you're just an analyst, you're not very comfortable with SQL. What you can do is use the visual query builder, which is a no code solution to basically draw what you wanna do. So it's not necessarily like a query planner that will go through execution, but what it will do is take what you're trying to derive out of your graph solution and then write a query based off what you're trying to extract, that query that's been optimized.
[00:33:50] Unknown:
And so for people who are interested TigerGraph is the wrong choice and maybe they're better suited with just an abstraction layer over a relational dataset or a different graph engine or just actually sticking with tabular datasets and not trying to get involved with graph and incorporate that into their problem?
[00:34:16] Unknown:
I think if you have a traditional system that's just, you know, transactional and you're just retrieving things, there's no connections, you should stay with what you're currently working with. But if you have highly interconnected data and you need to retrieve data across different source systems and you need to pull that back in real time and there's data that's constantly being streamed to your solution, and so the data is dynamic and it's not just static. If you have those problems where you're running a bunch of joints and you have sort of like this step by joints and trying to retrieve this data, I think that's when you want to use TigerGraph.
Of course, you could use it to read just a single vertex, but where it's really powerful is when it's combining a bunch of different data elements together, and you're trying to find insights inside of that data.
[00:35:02] Unknown:
In terms of the future direction of TigerGraph, what are some of the areas of focus for the near to medium term or any particular community engagements that you're excited to be involved with or just some of the things that people can look forward to in the months years to come?
[00:35:21] Unknown:
Currently, right on the road map is, again, the elasticity of scaling. So to be able to scale up, scale down, deployment is really important. Right now, building out more data connectors and ecosystem, product supported ecosystem components so you can integrate with core products. Integrating into different cloud service provider tooling is also really important right now. The other things that are exciting, we're doing a $1, 000, 000 challenge. So I think that's pretty exciting concept in which we're asking the participants of the challenge to come with their background, their skills, their domain knowledge, and to be able to solve some of the toughest challenges.
And so that's actually wrapping up in April, and the winners will be announced in May. So really the $1, 000, 000 challenge is really about making the world a better place, and so we're using this not only to activate our community, but to engage the community, to drive brand awareness. And we plan to do many different other activities in the future, including other hackathons, other challenges that will get the community excited, get developers excited, get developers to understand how to use the technology. Yeah.
[00:36:30] Unknown:
Are there any other aspects of the work that you're doing at TigerGraph or the overall space of graph applications and graph storage that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think something we probably didn't touch upon
[00:36:45] Unknown:
is the language itself. So G SQL is specifically designed to be turning complete, which means you can put that complex logic inside your query solutions. But what also that means is for our graph data science library, if you want to be able to optimize or create or change or modify your algorithm, you can do that within the language itself.
[00:37:10] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tool in our technology that's available for data management today.
[00:37:24] Unknown:
The biggest gap working from an enterprise perspective was the metadata management. So just every source system is different. The types of data is different. The date time formats are the most annoying thing that I ever ran into is everyone has a different date time format. I think having more standardization around the data in itself and maybe that logic of data to help the end users, the users that are building the solutions on top of maybe these these different data sources. That was the most challenging thing that I ran into from a data management perspective was the different data types. And so I think that's 1 area that I'd love to see some really interesting creative solutions. And so 1 day, hopefully, there will be no data cleaning. There'll just be magic, and data will be cleaned.
1 day, there'll be consistency. 1 day, there'll just be types normalized. 1 day, there'll be everything sort of automatically done for you, and so you can just focus on what you're trying to derive as the analytical
[00:38:28] Unknown:
insights that you're working on versus always being in the data. Thank you very much for taking the time today to join me and share the work that you've been doing at TigerGraph. It's definitely a very interesting product and interesting problem space, and it's great to see different companies investing in graph analytics capabilities because it is something that is absolutely necessary as we scale up the volumes and types of data they're working with because of natural interconnectedness that exists in the universe. So appreciate all of the time and energy that you've been putting into helping to grow the community and ecosystem around that, and I hope you enjoy the rest of your day. Thank you so much.
For listening. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to John Herc and His Background
Journey into Data and Personal Motivation
Role at TigerGraph and Community Building
Evolution of TigerGraph and Product Features
Misconceptions and Design Considerations in Graph Databases
Enterprise Scalability and Security in TigerGraph
Modeling and Modifying Graph Structures
Tooling and Integration for Machine Learning
System Architectures and Use Cases
Community Contributions and Use Cases
Standardization and Data Sharing in Graph Databases
Interesting Applications of TigerGraph
Challenges and Lessons Learned
When to Use TigerGraph
Future Directions and Community Engagement
GSQL Language and Graph Data Science
Biggest Gap in Data Management Tools
Closing Remarks and Thank You