StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar - Episode 132

Summary

There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data.

Tidy Data LogoTidy Data is a monitoring platform to help you monitor your data pipeline. Custom in-house solutions are costly, laborious, and fragile. Replacing them with Tidy Data’s consistent managed data ops platform will solve these issues. Monitor your data pipeline like you monitor your website. It’s like pingdom for data. No credit card required to sign up. Go to dataengineeringpodcast.com/tidydata today to get started with their free tier.


linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019¬†and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required.
  • Your host is Tobias Macey and today I’m interviewing Sijie Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at StreamNative

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what Pulsar is?
    • How did you get involved with the project?
  • What is Pulsar’s role in the lifecycle of data and where does it fit in the overall ecosystem of data tools?
  • How has the Pulsar project evolved or changed over the past 2 years?
    • How has the overall state of the ecosystem influenced the direction that Pulsar has taken?
  • One of the critical elements in the success of a piece of technology is the ecosystem that grows around it. How has the community responded to Pulsar, and what are some of the barriers to adoption?
    • How are you and other project leaders addressing those barriers?
  • You were a co-founder at Streamlio, which was built on top of Pulsar, and now you have founded StreamNative to offer Pulsar as a service. What did you learned from your time at Streamlio that has been most helpful in your current endeavor?
    • How would you characterize your relationship with the project and community in each role?
  • What motivates you to dedicate so much of your time and enery to Pulsar in particular, and the streaming data ecosystem in general?
    • Why is streaming data such an important capability?
    • How have projects such as Kafka and Pulsar impacted the broader software and data landscape?
  • What are some of the most interesting, innovative, or unexpected ways that you have seen Pulsar used?
  • When is Pulsar the wrong choice?
  • What do you have planned for the future of StreamNative?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances, go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You monitor your website to make sure that you're the first to know when something goes wrong. But what about your data? tidy data is the data ops monitoring platform that you've been missing. With real time alerts for problems in your databases ETL pipelines or data warehouse and integrations with slack pager duty and custom web hooks, you can fix the errors before they become a problem. Go to data engineering podcast.com slash tidy data today and get started for free with no credit card required. Your host is Tobias Macey, and today I'm interviewing Sijie Guo about the current state of the pulse our framework for stream processing and his experiences building a managed offering for it at stream native. So Sijie, can you start by introducing yourself?
Sijie Guo
0:01:30
I Hi, everyone. Thank you for having me on data engineering podcast, and my name is Sijie Guo. And I'm currently the CEO and co founder of StreamNative. stream native is San Francisco, San Francisco based startup. And we are providing cloud native event streaming platform powered by Pulsar and we also managing fully managed service or parcel on different public cloud and the managed service can run either in our Cloud account or in our custom My count. So yeah, thank you for having me here.
Tobias Macey
0:02:04
And do you remember how you first got involved in the area of data management?
Sijie Guo
0:02:07
Yeah, so I was a starting my kind of journey working on the distributed in cluster file system. And about like 10 years ago had to was getting the attractions in China. I was kind of the first set of the contributors there who contributed to Hadoop HBase in hive, and I was I was part of that initial team who build Tencent data warehouse and based on hive, and Tencent is one of the largest internet companies in China. And then, after working on Tencent data warehouse, I moved my career to Yahoo. And that's why I get involved in a lot of development, bookkeeping, and data on on parsa. And that getting into the whole mastering in streaming space
Tobias Macey
0:02:59
and The separation of the storage from the broker and pulsar is definitely one of the things that I find most most interesting about it from the architectural perspective. And I know that bookkeeper is being used for a number of other systems as well. And for people who are interested in more of the sort of background in early days of pulsar and some of the architectural principles, I did interview a couple of the other core committers to the project a couple years ago. So we'll put a link to that in the show notes. And for anybody who hasn't listened to that, can you just give a bit more of an overview about what pulsar is and how you first got involved with the project?
Sijie Guo
0:03:34
So Pulsar, we usually use kind of one sentence to describe what what it is possible. We usually say it's a flexible, perhaps a masking that is backed by a durable lock story that the first part of the sentence is basically describing what kind of a car capillarity providing by Porsche. It's a pub sub mastering system. So you can use that you suppose as a normal messaging system, like what you use the Kafka useful Reb mq and NEC mq but the second half of the sentence is basically tell the difference about the how pulse is different from any other many other messaging system. It's backed by a durable block storage and that durable block storage is basically the the bookkeeper project you mentioned. And as you can see, what I was kind of the first like engineer that who was involved in in the bookkeeper project. So, bookkeeper was originally started in Yahoo research and it was designed to addressing the high availability issue of HDFS name node. So, the call mechanism or call replication mechanism was abstract out of the neck disputes consensus algorithm that used by zookeeper and then it gets in, evolve into distributed lock storage, so that you can use the land for building out many different systems at that point, I think maybe 10 years ago, we tried to build the first pub sub mastering system based on zookeeper, which was called headway but ran out, that project was kind of already deprecated. But that is basically set the foundation of the whole architecture apasa or many other followers in in this space is basically separating the broker serving from the master storage. So you can have two separate layers individually as that you can scale up independently, and also improving a lot of bunch of the high variability and also fail over time. And I actually wrote a bunch of the articles a few years ago about talking about the architecture advantages of this layered architecture in segment century storage and feel free to check out those articles in in the internet.
Tobias Macey
0:06:00
And in terms of the overall lifecycle of data, where does pulsar fit in that overall ecosystem of the different data tools, I know that it is sometimes compared to Kafka, or also may be used in conjunction whether instead of things like Spark streaming or Flink wondering if you could just give a bit more of a picture of sort of the different ways that pulsar is being used and some of the use cases that it's optimized for.
Sijie Guo
0:06:25
So I think to get started there is I will try to actually maybe maybe clarify a bit about a capability of what capability that pops up provide. As I said, originally, pasta is flexible pops up matching system, so it offer all the capabilities of mastering system, but after parse, is incubated in aperture foundation for about two years, and that kind of get evolved into more mastering plus streaming system, what we usually call it a combination, native event streaming. platform, what does that mean? Is the court obstruction within pasa it's a it's kind of a dispute log, its event streams, it can be used for storing infinities streams of events. And so the capability provided by pasa is actually you, you're able to use the impulse to ingest events to event two topics. You able to keep the event for a longer duration based on your return retention policy. And you're able to using different data processing tools like you can integrating with spark and Flink to do unified data processing. You can use presto, or hive to do these interact interactive queries. And we also introduced partial functions to lower computation. With that being said, is from the role in the whole ecosystem. First is since it's able to providing the ingestion capability for people to using to ingest data into pasa. So you can use that as a matching system to connect the service was your your whole data infrastructure. So it will become kind of integration platform. And since we provide the capability for storing events for a longer duration, so you can use that for stop as a stream storage and in my opinion is it's more like it has been evolved in became kind of a string trimming database because it provides schema so you can treat those event streams as a structure event streams. And when we do an integration with Flink, we actually map these topics into tables in fling cat rock. So then you're able to use those data processing engine to query and processing data. So in short, summer days is it's a mastering platform that you can do data ingestion, and it's, I would say it's the stream storage that you can use for data processing. So that is kind of the
Tobias Macey
0:08:59
idea. The functions capability that's built into the pulsar project is definitely one of the things that I think makes it stand out in terms of the ways that it can be used because I know that for instance, with Kafka, it has support for Kafka streams, and the Kafka Connect plugins which pulsar has the IO is its analog to that, but it seems that the functions capability is a bit more tightly integrated into the capabilities of pulsar. So I'm wondering if you could talk a bit to that and some of the other capabilities of pulsar that make it stand out from some of the other options that people might consider for this durable pub sub use case?
Sijie Guo
0:09:40
Yeah, so I think so possible jack has evolved in changed of the past two years and definitely in function is one of the most attractive features that a lot of people love to use. So function is basically can very lightweight computing, I would say eventually testing framework that bring the whole serverless idea into event streaming. So you can write event processing logic using the language you like. Like you can write function using Java to Java developer, you can write function using Python using Go Go language. So you can write the functions as you like. And you don't need to learn a new framework. It just like for every engineer, the first thing you know is how to write function. So this would reduce the barrier for people want to, to add the processing capability to existing pops up mastering system and one of the reason is most of the I would say about 50% of the workload that mastering a mastering system is used is basically for connecting a service connecting service within a infrastructure and you in order to To provide an easy way for people to do the logic function is that definitely in that the simple simplest way because you don't have extra dependency you can just write a function as you want and so you can send me the functions so that is definitely it's a bit different from like a traditional data processing engine, we which we more focusing on those lightweight compelled computing use cases that ETL transformation routing and prediction and maybe simple aggregation. So that is function and a business functions parser has adding many features in the past and I probably I can share some of them like another one is the tier storage and tier storage is basically provide the ability to extend the cost storage capability that providing by bookkeeper into some much cheaper form storage system like s3, GCS. to prop saw even HDFS on prim. And so this would allow you to keep the data into the system in infinity Event Stream farm. So you don't need to kind of I need to dump the data out of my mastering system and going into some other storage format. And since by providing tier tier storage, you are able to keep data for much longer duration it after providing an a unified obstructions of your data, which is called the infinities event streams. And when you integrate in this data model with Flink, then you can create a unified data processing stack and that is kind of the whole idea behind that and I can call out some other kind of the features like key share substitution that is an interesting one and also the protocol handler and which is allow parser to be able to plug in different mastering protocols. Those those kinds of the futures
Tobias Macey
0:13:02
are kind of driven by the use case driven by the adoption of the community. And what are some of the other characteristics of the community that has grown up around pulsar that you would see as being distinct from some of the other streaming systems that are being used by people.
Sijie Guo
0:13:18
So I probably in the past few years, pasa has been kind of community driven, use case driven and what have been seen very successful. Most of the adoptions are positives coming from three main categories. One is existing rep mq active mq users. So that is kind of that is more coming from building out the cup applications and that drive that drives a lot of development of mastering arraign oriented features like TTL, dead letter topic and schedule masters delay, delay masters. Those are the features that more commonly seen in the traditional mastering queue. system. And the second category is more driving by the outside data processing. Use Cases like more integrating with Flink and integrating with Spark. And that introducing a lot of features like key share subscription, the tier storage and economy afloat. So be able to kind of providing an efficient way for your data processing engine to process the events within fossa. And the subcategory is what coming from IoT use cases that's that I would say that is kind of leads the bond of digital creation of passive functions. So it drives a lot of development around bringing in the 70s or lightweight computing features into parsa. And that's how the community how the whole partial team partial PMC to materialize the whole project
Tobias Macey
0:14:51
as a product and over the past two years. I know that some of the features that have been incorporated since the last time I talked about this project on The podcast are things like the functions workers. And I believe that the integrated sequel layer is new as well. And I'm wondering if you can just talk about how the overall growth of the data ecosystem and the focus on streaming as a core architectural principle of these systems has influenced some of the product direction and feature decisions that have gone into pulsar in that time period.
Sijie Guo
0:15:23
Am I talking about the kind of streaming capability that influenced the whole deca functionality development or pasa?
Tobias Macey
0:15:30
Yeah, just sort of how some of the recent trends in the overall data industry have influenced the decisions around pulsar and the direction that it's taken in the past two years since I last had it on the podcast.
Sijie Guo
0:15:42
We have been observed like to train in Lahore when helping people adopting parser that one trend is kind of more happening on the data processing area that especially the rise of the adoption of Flink and as well as spark is able to do both streaming and streaming and batch processing. And we find the increasing of the use case like machine learning deep learning create kind of a challenging to existing data processing stack is you need a processing engine that is able to do both batch in stream processing is Odyssey use case not just only need the historical data, but they also need the real time data, they need to combine both historical data in real time data into one data processing engine. And Flink and spark already do a great job on providing an abstract API or unified processing engine. But there's a lack of the data management system is able to provide a unified unified data system for those engine to be efficiently processing those data we find because pasa the call obstructions That provider proposal is an infinity event streams and that leads us into creation of things like tier storage that is able to support this unified data processing stack. So, that is kind of the first category and the second chain that we have been observed is with the rise of IoT use cases connected cars, you will see a lot my MA h data centers and my mass my devices and those devices, I can kind of collect it that the events are data of those devices are collected in the ages, but a age doesn't have enough resources for people to process those events. And hence you need providing a lightweight computing engine and for people to like maybe just easily write functions to processing those events in the in the age. So this kind of age oriented IoT oriented use cases has became the main driving factor, the creation and also the adoption of past functions.
Tobias Macey
0:18:05
One of the critical elements of the success of any piece of technology, particularly open source is the rate of adoption of users and the overall ecosystem that grows up around it. I'm wondering if you can talk a bit about how the user community has responded to pulsar and some of the barriers to adoption that have existed and the work being done to drive those down.
Sijie Guo
0:18:28
So I think and in terms of like, so we graduate, supposed to graduate around like late 2018. And it has been a very wonderful year for Porsche in 2019. Just a couple of the matrix is the number of stars is already doubled. And we have seen the Slack channel the users of Slack channel grows growing from like 500 to right now close to like 1700, and we see contributors going from like, around 72 like right now it's 250. So we see that kind of from different metrics we see the community has kind of double or even triple and from adoption size. We have been seeing a crazy adoption in 2019. And we see this happening in Asia, post Asia, North America and in Europe, in Asia, we have the one of the largest largest internet company Tencent is going into an all in state into pasa basically, the whole billing platform right now is building on pasa what does that mean is every transactions are every purchase that is happening in 10 cents product is going to parse the first and it has been like processing the 10s of billions of transactions every day. And in North America, we also see pasa is being adopted in different industries and we have a whole page About Power BI page for people to check it out. And we also do a user survey the PMC data users have a kind of end up 2019 report, publish the survey report recently and to disclose the kind of the current state of the adoption and how people use pause, and what what are their pain to grow power usage in in the coming year. So feel free to check that out. It's available in pause our website, you can go to stream Matic website to download the
Tobias Macey
0:20:33
user report and I know that the Kafka ecosystem has grown up quite a bit because of the fact that it was one of the first movers in this space. And so a lot of the existing systems that might integrate with a streaming system already have capabilities for working with Kafka. And one of the projects that you and some of your collaborators rolled out recently is an addition of the Kafka protocol running on top of pulsar, so I'm wondering if you can talk bit about how that's implemented and how that fits into the overall architecture of pulsar itself and what you think are going to be some of the benefits of that to the pulsar community. Yeah, I
Sijie Guo
0:21:10
think that is that kind of interesting question kind of means in the, in the previous question is still, we have a very wonderful 2019. But still, there are some barriers for people to adopt adopting pasa because existing there's already existing messaging systems like Kafka as well as what you make mentioned. And as well as Reb, mq and act mq, those are kind of written in the standard messaging protocol like mq P. Hence, we're still seeing a bunch of barriers for people to adopting pasta and what we have been thinking about like how we want to reduce the barrier for people to use pasa and enjoy all the features providing bypass like multi tenancy to storage and functions. And the first attempt we have, we have done and which is also Tried by ob h cloud is trying to implementing a proxy. And that is usually a people would commonly try to do when they want to adapt a new system to existing system. So they will write a proxy and try to write some logic to transfer the wireframes to from one mastering protocol to the mastering protocol. But we find that is not a natural way to do and there's a bunch of overhead and challenges and we kind of step back in thinking about what other real value providing by poser. So as I mentioned before is pasta is actually a stream Event Stream storage. So the call obstruction providing pasta is an infinitely Event Stream in our in our way is called a discrete log. So and Kafka is kind of building around similar abstractions is also distribute lock. So we found there's a lot of similarity between Porsche in cars And we think that we kind of step back and think maybe what we should do is make pasa as reliable and scalable Event Stream storage and allow developers to customize their own protocol or messaging protocol. This first would help people creating some adapters to fit in into existing messaging ecosystem. And the other way would allow a developer to make any innovations of of developing messaging protocols by leveraging the whole fundamental advantages provided by pasa. So we can introducing one framework within parsa which is called protocol handler, the product handler providing a way for implementation of a messaging protocol to interact with the whole Event Stream storage, a parser and these these into the creation of compound parser. So basically, Using the prosper hander framework to developer calf calf portico, and that is a plugin, so you can download a plugin and install it to your existing podcaster and your parcel broker is able to speak Kafka protocol with this capability, your existing Kafka application or Kafka service. You can you don't need to change any code, you can just point your your Kafka application or service a pulsar cluster, then you're able to go and we did the work by cooperating with ob h cloud and right now Tencent is also trying to using compound Pasha. So they're going to make pasa as a fundamental mastering infrastructure. And so this would we, we would expect there was a go, this would have been growing the community and reduce the barrier for people to try out pasa and I did a webinar with Pierre who is the tech lead of edge cloud. A couple of weeks ago,
Tobias Macey
0:25:02
the video is available in stream nature in extreme native website in as well as the YouTube channel and for people who are interested in Kafka. And also feel free to check it out. And because of the fact that you have this protocol handler layer in pulsar, and it opens up the possibility of adding new protocols. I'm wondering if there's any work being done to integrate with the open messaging specification that's being put forward as a common standard for different messaging systems to be able to interoperate more easily.
Sijie Guo
0:25:33
Yeah. So right now, what we have been working on is actually integrating with two other popular messaging protocols. One is mq P. The other one is mq TT. MQMQP is more it's very popular in the traditional messaging that work and MPD TT is what is popular in the IoT, mastering workload. And so we hope that This will simplify a lot of use cases that they're kind of moving from existing traditional messaging, queueing work workloads on from the IoT mastering. So that is kind of the one effort that we are doing now. And it's also a corporation with China Mobile. So that is, so I think that the interesting things of doing this in open source is we are able to leverage we work with a lot of end user to kind of deliver that what the end user needs and be able to serve the best use cases. And going back to the open mastering protocol, I was actually involved into the initial creation of the open mastering protocol right now. I think, open mastering protocol or standard is kind of still an API level standard. It doesn't get into the wire protocol layer. So we are kind of still pushing that effort for what? If there's any open mashing protocol coming out, we should be able to support that
Tobias Macey
0:27:07
very quickly. And another interesting aspect of pulse R and its relation to Kafka is that there is a decent amount of overlap in terms of the use cases that it provides for and as both projects are still very active and have large and growing communities. I'm wondering what you have seen as being some of the ideas that are being passed back and forth and some of the lessons that are being learned from each other's communities and each other's technical implementations
Sijie Guo
0:27:34
based on the my experience on helping people adopting pasa is I see that pasa is kind of commonly used in the two category of the users. One is, I would say, more coming from data pipeline data slash data processing where Kafka is many use there. And the Arctic Cafe is more coming from this online cost consciousness service. Is our event driven workflow. People can have more using the active mq or traditional messaging queuing system. And what I've been seeing here is the adoption could happen in either way, like people can have can come in from the traditional messaging queueing and looking into pasa because pasa is able to provide a scalability. It's more scalable than traditional messaging queuing system. Some other use cases more coming from Kafka. And in the Kafka world is most of the pain points are coming from the operational especially when you want to operating multiple clusters or want to scale beyond a certain point, you will see the operational operational pain point and the adoption of pasa is coming more coming from the these two kind of different category but I see a trend is like people when people are adopting post office Or maybe for their online business or now use cases they are kind of starting moving pushing parsa into can data pipeline maybe like data processing, if people are adopting parcel for data processing, they might be pushing to this online surface, I do see possible to kind of imagine these two different ecosystems then this also leads into the kind of the enhancement of the development of both ecosystems. So, with putting in that way is pasa is also lending, like from different ecosystem, how to address those kind of the issues that has been seen in the existing systems and I do see in the cafe ecosystem cafta community people are also looking into how to adopting the features, the architecture advantages that providing by pasa, for example, I see Catholic community has been talking About RTS storage for a while. And those tears storage, the idea was kind of originally brought in by Porsche. So I was see that these two community were still kind of growing in their own way. And but they were kind of continuing, like lending from each other. That is kind of my take on these questions.
Tobias Macey
0:30:20
And then in terms of your involvement with pulsar, you mentioned that you've been working on it for quite some time. And you were one of the co founders of stream Leo, which was one of the early companies built around pulsar and driving it forward in terms of its development and growing the ecosystem. And now you have founded stream native as a company to build a managed service of pulsar and its own distribution. Wondering if you can talk a bit about some of the lessons that you've learned from stream Leo that have been most helpful in your current endeavor and how you characterize your relationship with the project and the community at each of those stages of your career.
Sijie Guo
0:30:58
I think father, first class I think now walking in stream deal and trying to kind of helping people adopting pasa and see seeing the project going. It's very, very wonderful journey. And I have learned a lot of lessons from that experience based on those lessons and also the experience of running stream native, especially in 2009 2019. We have been really focusing on helping people adopting pasa and growing community. And I think the four, I would say, the most important lesson that I have learned is to first find the project in community fit. What does that mean is you need to find why people need parser and how parser can address people's pain points. And that has to be working with the those early adopters. And I would say in some time, you need to work with those large internet campaigns because they have these inference and They have this scale to be able to help you verify that pulse is able to kind of support from small scale to large scale and into different industries. And the last thing I find it's super important is also to find a position of software in the whole ecosystem. And I'd really like the question you asked earlier is, what is the role of Porsche in the whole lifecycle of data management and that, I think that is the most important lesson I have learned when running my own company is you need to fit in into the whole ecosystem. And if you can see in the past year, we have been doing a lot of integrations with Splunk and with spark because that is the fit for a parser in the whole big data ecosystem. Because you are you are the parser is the messaging system, you're able to get the data into the system. You're also a stream storage device. But to keep data for a much longer duration, that is the advantage of providing by pasa. And in order for people to be aware of parser you need to do the integration with the big ecosystem. And with those kinds of experience we can have moving into a product lead growth strategy is mostly focused on customers and also from committee users to learn what kind of the requirements and also their use cases and how we we can incorporate those requirements in and the use case into development into developing the project in as well as adding the features into the whole product. So that is kind of the most important lessons I have learned in the past. And you ask the second question is, since I'm kind of a vendor in this market, and how I categorize my relationship with this project in the community in each row, and I think Most important thing I want to raise here is as a project running in a patch Foundation, we can be walking in the opposite way. What does it mean is everyone in the project in the community kind of wear multiple hats, like, like, for example, taking me as an example, I'm the, I'm an individual that is acting as a PMC member, and also a committer for post pasta and bookkeeper and but in so I have to be giving out an independent opinions from a PMC and committer perspective because I'm, when I'm talking to the committee users, I'm the I'm representing the Apache Software Foundation. At the same time, I'm also the kind of the vendor or the owner opportunity. And so what we have tried to do here is we do a lot of things to do try our best to help People adopting parsa is small from partnership cooperation perspective because we believe we have to grow the community in order to grow any base base Stansfield around pasa who is helping people adopting pasa we get a lot of use cases that we can incorporate those requirements into developing a parser. And that, in return can help growing the community our parser I think it's a we where we play multiple roles in the community and we kind of developing those relationships in a collaborative way and making sure the the men focus off of the project is on growing adoption. And making sure people is able to use pasta in different industries.
Tobias Macey
0:35:52
And I know that one of the ways that you're helping to drive that adoption is by being a spokesperson for To the community, I know that you release the BI weekly notes of what's been happening within the community. And you also have a stream native distribution of pulser to help simplify some of the operational characteristics of the system. So I'm wondering if you can dig a bit more into what's included in that distribution and some of the work that you're doing to help simplify the operability of the platform because of the fact that it does have so many different moving pieces.
Sijie Guo
0:36:27
So I think from a streaming perspective, we do provide a stream native platform, which is kind of it's it's powered by a patch fossa and the country the main difference between a string data platform and the parser is basically we providing a lot of operational related tools for people to to simplify the operations of people running parser in different environment, mostly focus on the community environment, so we probably have Chad. We probably I go lambaste the administration tools. And we provide a password manager. And we also offer an enhanced version of Cortana Taskbar for people to really understanding what's going on in into the platform. That is kind of the main focus for the first version of stream native platform. Besides that, as we also bundle kind of Kafka and pass on natively into the platform. So for people who want to use Kafka and pasa you can download stream native platform and get started easily and at this moment, Ustream native platform is kind of purity Community Edition, so everyone is free to use and we might be developing some of the enterprise more kind of closed source features in the future, but we haven't decided yet. Our main focus is still on developing our call service and providing the new management services. In a car,
Tobias Macey
0:38:01
and I'm wondering what motivates you to dedicate so much of your time and energy to pulser, in particular, and the streaming data ecosystem in general, because a significant portion of your career has been focused around this project and this problem domain. So I'm wondering what is keeping you interested and motivated throughout?
Sijie Guo
0:38:18
Yeah. So as I mentioned, I think I started my career about like 10 years ago. And I see the kind of the, how Hadoop and how kind of infrastructure technology can grow and can influence the whole industry. Basically, the whole the whole growth of the economy in China, especially on the whole internet industry is kind of due to you have to have due to the whole big, big data ecosystem. So I have been seen how a technology can influence a whole industry, and I move from Yahoo to Twitter, and, you know, Twitter is kind of a matching platform. For the whole internet, and we you can think about Twitter is kind of the first company who is kind of using a lot of streaming technology. So I get into this space. And I see how streaming technology can be used for helping an enterprise like Twitter to became very successful. And I want that kind of the technology or the kind of the streaming mindset to be deliver to more industry to help them successful in this industry. And we see some of the existing technology have that didn't address this in a very great way. There are still some sharp, sharp, sharp coms that drop drop backs. So we want to kind of use our experience, use the technology that we have been developing, to have it help more industry, more enterprise to be able to enjoy the power of streaming technology. And that's why kind of driving me crazy into this space in dedicate my energy into this space. Yeah.
Tobias Macey
0:40:07
And one of the interesting impacts of projects such as pulser, and Kafka, and the overall focus on streaming data as a core component of a lot of these data systems is that that overall design is starting to leak out into other areas of software and technology. And I'm wondering if you could just talk about some of the ways that you have seen streaming data as being an important core competency of different technology industries, and the ways that projects such as Kafka and pulse are impacting some of the broader software and data landscape and the ways that applications are architected.
Sijie Guo
0:40:43
So, in terms of like in terms of streaming, what we have been thinking about streaming is actually coming from we see kind of a software usage pattern has been shifting within the enterprise. So initially, they're going to end enterprises small building out in a way of using software. So, you you might be building out a team of people who is kind of working with a database and then you put data layer you provide some kind of interface for people to query and that creates the whole database ecosystem and that get involved evolved into this big data of batch processing ecosystem, but the use cases and also the requirements has been shifting from more event driven work workflow like Are these the event I can begin generating can be generated from different sources like when for example, like when you browsing a web page, you click on this web page ID you are generating different click event. Those events can be used about enterprise to analyze it and analyze analyzing the user behavior and be able to do better targeting better marketing be able to providing better To services. So, we see the use case that has been shifting to more event driven or streaming driven use cases. And that means the whole software architecture of an enterprise has been shifting into an event driven architecture or event driven workflow. And in that way, the mindset is kind of is able to kind of shifting from processing static data set into dynamically changing data streams. So once you have this mindset shift, then you need new tools, you need new capability, and that creates the whole mastering ecosystem streaming ecosystem in streaming to change and we see these kind of two chains and the ecosystem has helping like from industries from like Internet, to financial to retailer and maybe to IoT has been a very successful and but these can also be very useful in public to other traditional industries, and that is kind of the that is why streaming data is playing a very important role in the current enterprise software
Tobias Macey
0:43:12
architecture, wondering what you have seen as being some of the most interesting or innovative or unexpected ways that you've seen pulse are used and the applications of streaming data.
Sijie Guo
0:43:22
Yeah. So one of the so I think, the common kind of impression that the industry has on pasta is basically a master cue. And I we have a we have partial users in China that is called base pay base pay is one is the third largest payment companies in China. And the use case there is very interesting is they use parser for the real time risk control pipeline. And you know, in the traditional data processing stack is people usually using lambda architecture. So you're building out a batch layer that is using HDFS or hive, and you building out a speed layer using Kafka and storm. And you combine these two together into a lambda architecture, the use case, in base pay is basically, they stay, they tried to get rid of these two layers and getting into a unified data processing stack. In the storage there, they standardized on using pasa as the source of choose. So basically, they put everything into pasa. And so then you would have a common, I would say, a set center that is keeping all these event streams both follow history data, and also the real time data and they standardize the computing engine into using spark so they can do spark structured streaming and also spark today, they reduced the system from four to two and with these kind of conditions In the I was shifting the whole calculability Apos are beyond a mastering to is became more becoming a streaming data warehouse. So that is kind of most interesting in in now in an innovative way for using pasa in in in the use case and I'm excited about this use case is because it's also using in real time risk control that is in the core business pipeline that is really deliver significant impact to a business logic. So that is kind of the most interesting use case I've seen,
Tobias Macey
0:45:38
policy has been used and for people who are adopting pulser what are some of the edge cases or design elements that they are most challenged by in terms of figuring out how to architect and design their own solutions and use cases around pulsar?
Sijie Guo
0:45:55
So I think one of the common pattern of common crash I have received in the community is especially coming from the event sourcing perspective people has the impression is possible to keep event data and you're able to keep the events for a longer duration by leveraging to storage and but people kind of looking for more from the Pinder cap perspective, they want to use a pulsar as storage for ponder cap. But parsa is mainly designed for streaming workloads. In other way, it's more designed for us scan base. So you're kind of streaming data, you're able to processing the streaming data in sequence, you want to rewind your data process, data processing job to an earlier point and re re scan the data. So parser was kind of more designed in a way for scanning oriented workloads not ponder cat so I think that is the common can the mistake, Miss usage of pulser I would like people to be realizing it before making any design on ampata.
Tobias Macey
0:47:07
And for people who are evaluating pulser or considering it as a component of their architectures, what are the cases where a pulsar is the wrong choice and they might be better served with either an entirely different approach or a different set of tooling.
Sijie Guo
0:47:22
So if you the Hollywood cop set you can start the fight for sure that is kind of the first long choice i don't i don't think in this moment, pasa is not capable for doing these operation yet, but we might it might be changing in future who knows, but at this moment, the doing in pond will come in power side, this is a long ties and the second second one is I see. A lot of pepper pattern is also coming up is pasta is able to support meanings of topics and about so people would end up trying to mapping devices or users to individual topic and they try to grow the number of topics beyond maybe minions are 10 minions and that is kind of still a bad is that based on Carlin pasa implementation that is still kind of a bad design is is to try to react in a way to use I mean this will reduce the number of the topics that will be used by that single application it definitely definitely can support millions but not 10s of minutes and especially when you operating in millions of topic is still a bit challenges. So, when design that when people need to be trained up how to use partial topics and how to leverage all the cool features providing by parser and the last one I will say is more pasta also providing these non persistent topics or non persistent capability and in order to use those non processed and capability people has to be realizing what Allah delivering Volunteers are dispatching guarantees to making sure you You are not surprised by those guarantees providing by non persistent topics.
0:49:10
So, those are kind of the three common
Unknown
0:49:12
pattern common Miss communicate use cases I have
Tobias Macey
0:49:18
seen in the community and as you look to the future of the business for stream native and the pulsar project and community what do you have planned and what are your goals?
Sijie Guo
0:49:29
Yeah, so, in terms
Unknown
0:49:29
of the surveys or the product providing bias to native that is we are kind of the 40 focus on developing and streaming at Cloud which is the fully managed Parcel Service running on public cloud we want to give people the first hand experience, very smooth experience for people to get get started in using pasa easily and that is from product side and project size is as I mentioned before, parsa is has been evolved beyond pops up matching system so you have many you have three men capability is you
Sijie Guo
0:50:07
you're able to connect
Unknown
0:50:08
that ingest the data into pasa yeah but the storing style data, you're able to process in data. So in terms of project side is on we want on the ingestion, ingestion side is we want to integrating with more mastering protocols for people to be able to integrate with the existing mastering applications. And in the storage side, we want to do more in the offload in the to storage by bringing in some additional data processing oriented capability like columnar storage and bring in index and be able to leveraging topic compaction. So most functionality can helping paws are providing better performance for a unified data processing story. And in terms of the processing side, we want to Improving the
Sijie Guo
0:51:01
parser functions to introducing some
0:51:03
orchestration framework to combine multiple functions into a pipeline. So people can write simple function pipeline to chain multiple functions together. And we're also looking into integrating with web assembly that is able to easily to support different languages or functions. And in terms of integrating with Flink, we,
0:51:27
we already make Pulsar
Unknown
0:51:29
as a source in sync, but both Flink and spark and we make parsers, the catalog for flankers as well. And the next step is how we want to deal with the state management and the state management would come both for the puzzle functions and as well as link integration. So there's a lot of things to do regarding state management, are there any other aspects
Tobias Macey
0:51:55
of the pulsar platform and its community and the ecosystem that's growing up around it? Or the work that you're doing at stream native that we didn't discuss that you would like to cover before we close out the show.
Sijie Guo
0:52:04
So plus the community was was going to
Unknown
0:52:08
hold the first ever parser user conference parser summit in the in April, and due to the increasing was the creation of Coronavirus. We can push the use conference to August but at this moment, organizers are also exploring a different approach by providing a pure virtual conference for parcels are made. And this fall, follow us on Twitter and we'll keep everyone on post We are very excited and we are very confident we are able to hold a kind of virtual conference and to be able to show more pasa oriented use case to
Tobias Macey
0:52:48
broader community. Well, for anybody who does want to follow along with you for that or get in touch and see the other work that you're doing. I'll help you add your preferred contact information to the show notes. And with that, I would just like to ask a final question, what you see as being the biggest gap and the tooling or technology that's available for data management today?
Unknown
0:53:05
I think the biggest gap is I think right now, still, in the in the whole data management space or in the whole big data ecosystem, there's still many, many components in the whole pipeline and the ability to kind of group different type of the system together and also provide a uniform operation and also man German experience that in our way, providing ability to trace event that is going through from data sauce all the way to the data in an analytics or data warehouse that I didn't see there's a good tooling and I will, I wish in this space would like to see more efforts happening
Tobias Macey
0:53:52
this well thank you very much for taking the time today to join me and share your experience working in pulsar and billing. A business around it. It's definitely a very interesting tool and one that I've been exploring for my own purposes. So I appreciate all the time and effort you've put into that and I hope you enjoy the rest of your day.
Sijie Guo
0:54:09
Thank you for having me here.
Unknown
0:54:10
And it's my pleasure to share all the experience or the knowledge around this project and as well as the company and if you want to check, which mean more about Pasha in general about streaming technology, you can find me on slack or
Sijie Guo
0:54:25
Twitter. Thank you
Tobias Macey
0:54:32
for listening, don't forget to check out our other show podcast.in it at Python podcast comm to learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast comm with your story and to help other people find the show, please leave a review on iTunes and tell your friends Coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!