MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to dataengineeringpodcast.com/linode

to to get a $20 credit and launch a new server in under a minute. And for complete visibility into the health of your pipeline, including deployment tracking and powerful alerting driven by machine learning, Datadog has got you covered.

With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you'll have everything you need to find and fix performance bottlenecks in no time. Go to data engineering podcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today I'm interviewing Christopher Ryan and Hitoshi Harada about Market Store, a storage server for large volumes of financial time series data. So, Christopher, could you start by introducing yourself?

Hi. Yeah. My name's Chris.

I work for Alpaca,

lead software engineer and,

1 of the, developers on, Market Store. Hi. This is Joshi Muradam.

I'm a CEO of Alpaca and cofounder of the company.

And starting with you again, Christopher, how did you first get involved in the area of data management?

Well, I come from a background of, actually aerospace engineering and slowly made my way into data management.

Working for IBM on the d v 2 replication team for a while. And then, yeah, made my way here and, started working on MarketStar almost, on day 1 after I started working, with Alpaca. And, Titoshi, how did you get involved? And yeah. So the, the company this company, Alpaca, and, like, the

faster application of this financial trading space, I I saw the needs for this,

large volume, the time series database designed

designed specifically for this, particular,

domain. And so, like, I pitched the the idea to Chris and the other developers,

and that's how we started. He he's being a little bit modest, but he's also a pretty, strong contributor to the Postgres open source too. So he has a strong background in database and storage. Right. So maybe I should mention that as well. Right? Yeah. So I'm,

so I started my database,

career path, from,

contributing the call to the open source post office community,

and I'm the major contributor. I was a major contributor in,

around the 8 4, 8.4,

and 90.

They're up in the window functions and so on. And then, like, I moved to, US from Japan to join,

this

company called Greenbaum. And, I I,

served as

a a main architect of the Greenbaum database,

touching every piece of the the huge database,

software

inquiry,

processing, and distributed transaction, replication, and so on. So so that's my background, and that's how I got this idea of, like, building some new idea of the database from scratch. Yeah. Generally,

the sort of general best practice

is not to build your own database, but if you have that much domain knowledge and experience on the team, then I can see that you can fairly easily ignore that advice,

which it seems that you did when you built market store. So,

you mentioned that 1 of the reasons for building it is to be able to handle the financial data. So I'm wondering if you can just talk a bit about the decision

behind actually getting the project started and what are the characteristics

of the types of data that you're using that made it such a challenge to manage.

Yeah. So in the beginning, I'll give you a little background about what we do.

We

predominantly

were workings algo trading space and people someone would we'd have customers writing algos on our platform and would need to very quickly back test those algorithms using and in the beginning, it was only using currency data.

And, you know, backtest, probably 10 years,

of that of those 1 minute,

OHLCV

or OHLC bars for for a given algorithm. And, you know, we're trying to 1st, we started,

just storing that in HDF5 files,

and then querying it through that and that was taking way too long. And then, there was like a kind of an in between step going to market store,

of trying to query it through a Python implementation since most of our other stuff was built in Python. And that was chewing through so much memory and and still not giving us the performance we wanted. And, then we said, okay. We need to go from scratch, start start something here to to deliver this this kind of, you know, for example, 10 years of 1 minute bar data in less than a second to to be able to run these back tests and then also do live testing, things like that, in the in the trading outdoor trading space. Yeah. And so you,

you mentioned that. So, yeah, it's not a good idea to search, a database product from scratch in any case. Mhmm. And that's also the case here. But so, like, so I I knew knew that, from my experience background, I knew that how hard it is to build and deliver the production production quality database from scratch.

But I I also knew that, if you focus on 1 thing, and just do a good job, for this particular use case, I was I I could see, like, how, like, it could be possible,

to do it

that way. And also, like, the the modern so, like, most of the existing traditional databases, for the performance reason, they've been, written in, like, the c or, like, the the low,

high cost languages,

in terms of development.

But we chose, the language goal to,

to achieve the best performance,

and, I I build the the Icoma's database from scratch, in a very, quick way. And did you already have a lot of experienced Go developers on the team, or was that something that you ramped up on as you were working on the project? I mean, for me, I I hadn't written any Go before I started working here. You know, I come from a c plus plus background,

and so,

but I I really took to it. I think, Hitoshi, you had a little bit of experience before. Yeah. Yeah. So, yeah. So the goal

was I saw Go,

evolving all the time. And, like, the language simplicity of the language. And, like, that time, lots of people start using Go for

more in different areas, but also, like, see more, coming from the web application layer. But I

personally was seeing the potential of call language,

using in, using, like, to this kind of,

low layer,

middle middleware

type of the use case, and, like, the our requirements

like, the goal fit, our requirements probably.

And in terms of the actual data that you're storing in market store, I'm wondering if you can give a sense of what the shape of the data looks like and

why you decided to build your own storage layer rather than relying on any existing time series, databases,

or document store, anything along those lines?

So predominantly, what we're storing are,

open, high, low, close,

volume bars. We also do some storage of of ticks that come across the wire and then aggregate those as well. Yeah. As far as

we wanted to when we looked at building the the

the storage layer, it was it was all about optimizing it for our specific use case. We wanted the database to be very focused on what we were,

using it for because

it just made it much cleaner to work with in, you know, again, the Algo Trading space that we're in. Quickly load that into a pandas data frame, do some deep learning,

testing on it or or whatever else,

that that the user

wants to do with it.

So, yeah. Yeah. And so, like, I mean, of course, like, a time time series database, it wasn't new at all

and not today. But and there are many, many other time series databases.

The main thing that we

bumped into this is,

most of the time series database is designed for general purpose use cases. And especially, like, I saw in the coming,

age of the

IoT and sensor data. So the the like, some of the products focus on the sensor data. Some of the pro products focus on more like a metrics type of data, like, monitoring the IT,

systems data and so on. And, typically, their storage format is more like a like a JSON or binary,

serialized version of the j the unstructured data. When it comes to this financial market data, it's pretty structured,

as Chris just mentioned.

And so the format has to be more optimal.

And that way, like, you don't have to worry about much about having hundreds of hundreds of, disability nodes. It doesn't fit to a normal

size of the main memory of the motor machine, but, definitely, that's something, like, easy to put it in the disk and and read it, at the

optimal speed.

And yeah. So I we we looked at, at different

products before,

jumping into this,

building from scratch phase, including InfluxDB

and and other,

evolving the time series data, but, in including Hadoop as well. But, like, it made more sense that, like, we build something, just optimized for this data. Yeah. The challenging aspect of the not just building, but this particular domain,

was, again, like so this typical use case is,

like, requires,

both low latency

low latency query and,

deep a long history,

volume volume,

history

query. So what it means is,

as Chris briefly mentioned,

application,

is

mainly around the algo trading. And

and that does,

both backtesting

and live trading.

So when it comes to the backtesting,

of course, like, you need a long history. And the faster you can query, the faster you can see the results quickly. And so, like, the existing backtest framework, we we looked at, using some of the other data format.

I was, like, kinda

slow because of these data issues

in a way. And, also,

once you satisfy this, backtest results, then, like, Argo

needs to run to against the real time market, and Argo needs to be react

quickly,

with the market changes. So from the traditional database again, like, traditional database

best practice is, like, you have you should,

separate those,

different requirements. Like, for the big data, for the,

like, analytical,

use cases, you should have something like a Hadoop. And for the query, like, more like a rapid, type of the past data, you should have something else, like in memory, distributed service or something. But I like this like, in note just to focus on this particular area

and provide the best,

the same interface,

same functionality, same experience.

Having this 1,

a dedicated database

server for this particular financial

market financial market data, was 1 of the challenges that we we saw.

And also in terms of the feature set, 1 of the things that I saw that would have found interesting was that the specific time series had different time zones associated with them based on which markets the data was coming from, which I imagine would be difficult to support in any of the other existing time series databases. And also

your mention of needing to be able to maintain

a long history, whereas a lot of particularly

metrics oriented time series databases have a feature where they automatically compact the data after it exceeds a particular time horizon where you definitely don't want to do that because of that need to be able to

do granular analysis

of the algorithm

across various time horizons into the past for doing the backtesting.

Exactly. Yeah. Depending

and and not just different markets, but also just different securities and things you wanna store. You know, a currency

just,

there is no market close on currency data or or or crypto data too now.

And then, you know, all of our US equity data needs to be on eastern time zone and be aggregated to that time frame. And then,

then there's also after hours data,

that that we also want to include but doesn't get included in the aggregated 1 day, bar,

for that day, you know. So there there were a lot of unique challenges to that finance,

finance realm that that we felt needed a specialized,

approach.

And you said too that when you were first getting started with building

the alpaca

data stack

that you were originally using, HDF 5 files for doing the analysis. So I'm wondering how the overall work flow changed from what you were using to what it is now that you have Market Store available as that,

single

source of data for those financial markets?

Yeah. I mean, it's kinda night and day.

Yeah. It was,

we we now we just have, you know, a market store running,

or or a couple market stores running, and then they're just openly available. You know, we have a dev environment 1 as well as our production, and they're openly available for query at any time. So anybody can now just okay. I need to back to us for 10 years of data right now,

in my little IPython client, with this little algorithm that I I came up with. And, you know, you're able to do that really quickly just in it's all,

crude via JSON RPC. So, yeah, it it has made things much, much faster. There's no, like, moving the data around. Oh, I need an h I'll have an HDF5 file over here, and I need to bring it over here and not parse it and all this stuff. Yeah. Like, before market

store, like, we used to have, lots and lots of each of this file here and there in on this locally as well as, in, s 3. 3. 3 bills are done. Yeah. Yeah. And download it, like, every time it needs.

And, also, it was very hard. Like, so 1 of the the churning part in HDFS file was, keep updating, keep appending the new data at the end,

as time goes.

And the I I found that the data, Dataframe or HDFS file was not designed for that. So that's, and then, like, today,

with the market, so it's very, like, quick to update and we don't care. Like, we don't notice, like, there's any performance issues at all. And also, so, again, like so the main client's,

application is, like, Python based application where, with the pie, a pie data framework, with the data frame, and also, like, using the machine learning stuff. So, like, a market. So it is,

returns,

most of, like, the binary format, like, in the JSON RPC in a way, but also in the message back, you know, in the binary protocol. And and

and since

the most optimal, like, c array style or the data directly to the client, and client just decodes it to the data frame. So you barely notice, actually, like, from the application perspective, you barely notice that there's any difference from, like, just reading the HDFS file from this call even faster, to read it from Apple

Store. And we've talked a bit about the fact that the actual financial data that's being loaded into market stores coming from external services. So I'm wondering what mechanisms you have for being able to maintain an up to date record of that data as it is generated

and how you are able to ensure that any queries that are executed at a given point in time are using the most up to date data that's available.

Yeah. That's been,

been a challenge. And part of the reason that,

we built Murgisource too was that these upstream providers

weren't delivering the performance that we needed to to query from them for our use case. So, you know, we would have liked to just say, hey, okay, you guys have the data. We'll pay you this, and then we can query our 10 years of that data, anytime we want. Well, that's not the case. It's gonna take you, like, 30 minutes sometimes to query that data. And,

yeah, it just didn't make sense. So

instead, you know, we have to keep we obviously keep our stores,

our,

history stored in the database, and then we're we're constantly appending,

new records,

from the upstream. And, unfortunately, also, many of these providers are not the most modern of,

API don't have the most modern of APIs.

So, you know, we don't we we end up having to just pull from from a lot of them. Some of them are now adding streaming services, but,

it's it's a slow process.

So in between,

Market Store because we wanted to keep Market Store just as a,

as the database itself. And then we added,

a plug in interface

to be able to,

have, you know, kind of add ons for,

for things that people want to want to do with it. So, 1 of those add ons we created,

is called the slate plugin, which is actually

a plugin that connects to another, open source, project that we we worked on, which was just in memory cache,

called slate.

And so

we are constantly pulling,

from this upstream and then it goes into this

in memory cache where we're also keeping, real time quotes because besides from the historical data that we store in market store, we need to provide real time quotes to our clients too. And just a snapshot as quickly as possible of the current,

market price of any given security. And then that,

data is then streamed from there through the plug in in onto disk,

throughout the day. And,

yeah. So

it's a pretty big,

pretty big issue that we've we've spent a lot of time working on trying to get that flow,

as clean as possible.

And given the fact that the majority, if not all of the data that's being stored in market store is actually coming from these external sources,

what is the impact

if the instance that's holding Market Store just has a catastrophic failure? Do you have any guards in place for ensuring any sort of point in time recovery, or would just be a matter of build a new instance and then reload the data from those external sources?

Yeah. So right now, we're on our production server, we're actually running 2 completely separate instances,

with with separate cloud deployments

and have completely mirrored datasets.

So

and then the challenge then is keeping those in sync. And, yeah, we're constantly monitoring that, keeping those in sync. And, you know, if 1 goes down, we have the other,

as backup.

Yes. Like, as a production system, definitely, we have to have this, redundancy, and we have some, like, management operational,

tools

that are main maintains the data echo incorrect and and perform the cross consistency

as well.

And and slate like, having this,

slate,

in between is also,

1 of the our best best practice,

to keep up running the market store as well.

So I said it's

kinda cash, but also, like, like, you can think of, like, a light light

lightweight version of the Kafka. And

so that's the message best.

So because it's external source, like, we keep pulling it about just once and keep it in, you know,

in, in between Slate. And 1 of the good features in Slate is

you can also query,

short amount of the latest history,

not just,

like, consuming the real time messaging messages.

So the market saw it. Like, in case in case of, like, 1 1 market sort of fails,

the it like, we just bring it back, and it catches up to the latest history,

through this slate. So we don't have anything.

So at at the moment,

Microsoft itself doesn't have any in built in replication

or

NICE feature or discover that this recovery kind of feature that, like, most of the database have.

But rather, like, we just,

keep pulling it and and make sure that that it can catch up, in a short amount of the failure short amount of the time failure. Yeah. But, like, by itself, in market size, we also have right ahead transaction mainly for the performance reason and because,

of, writing into the a different branch of the partition of the the time series. Like, you can see, like, the performance overhead. So, like, we enjoy we we the model is right. So they're the engine wall.

But this wall,

wider headlock can also help the,

durability and the, transaction cost as well. So no way that's so in this a quick panic or crash type of the failure, if it happens,

then market store can, constantly,

recovers

by its own. And if it lags,

in terms of data, then it can catch up,

by helping by the help of the site.

And since Market Store is being used for querying and analyzing

that has to do with financial markets and there might be large quantities of money being staked on the result of that analysis,

what kinds of mechanisms do you have in place, whether it's unit testing at the code level or

doing,

comparative testing against the raw data sources?

How do you ensure that the operations that are being performed in Market Store are returning accurate and repeatable results?

Yeah. I mean, of course, it starts, at the low level, with the unit testing. We we try to be as diligent as as possible

and and be that, you know, test driven development kind of a kind of a shop. But, also, we we use individual

instances of Market Store in our own integration testing with our systems,

all the time. So, like, on CircleCI, we'll spin up, you know, a slate, a market store, and then our application that's running those those tests with the data that's delivered from market store. So,

you know, we're constantly testing how the data is being delivered

and if that data isn't exactly as it what as it as it's supposed to be, then, you know, our our algorithms

back test will come out, with a different result, and we'll know that right away. And also from a security perspective, what mechanisms do you have in place to ensure that the data in market store isn't being tampered with by some,

malicious actor?

As of right now, we don't have any,

security kind of,

add ins to Market Store. What we're doing right now is it's completely internal. It's not exposed to the outside of your query. We're proxying it through our own, server. And so

And what have been some of the most challenging aspects, whether technical or otherwise,

in terms of building and maintaining the micro store project and integrating it into the rest of your systems?

I mean, it's been great and, like, successful,

at least within our,

system. For us, like, into like, 1 of the challenges we have,

is, like, we are not

selling this product to any,

customers. I mean, like, this is a part of,

strong backbone for the

the trading space.

Like, unlike other companies who who build a company along with this database product, like, the challenge is, like, how we can, like, keep up to like, keep up this open source and, like, and provide a good value to the rest of the community.

Like, we sometimes focus like, in entire engineering team needs to focus on, our delivery,

and that doesn't include the market itself.

Right? So,

you know, like, how we keep improving the the product and also

if that if someone

needs some help,

using when using, Market Store, like, how how we can, like, allocate those engineering resource,

and help them. And maybe, like, the we we don't have the the clear answer yet, but, like, maybe it's, more to have strong community,

building it and, like,

talk about it in some other meetups and so on. But that's 1 of the things we are seeing.

And you mentioned there too the fact that the project is open source, which I don't know if we had mentioned earlier. So I'm wondering if you can talk a bit about your reasoning for releasing the project as open source, and whether there was any particular cleanup needed to be able to release the project,

publicly.

The history is so, like, we start from, like, a building from scratch and, like, just for the our internal use case, and we made it for, like, as a a priority for us at at the beginning.

But, like, as we progress and work with, many other type of people, including those upstream data source providers

and similar companies who try to build something like a application in this trading space. We constantly saw the the same needs for this, and, like, I like, we still wonder, like, how they are doing next. I heard some of the the traditional, like, brokerage company just, like, build this similar type of the the workload using SQL database,

which kinda, like, ridiculous to me.

But so,

yeah, so 1 of the thing that we compare with the existing broker's API in terms of data,

streaming data,

query

The SQL data was like what was it? Like, Microsoft,

we done the data for the last year in at the mid level was

20 milliseconds or something, and their API returned

in 4 minutes or so. Yeah.

Yeah. But, anyway,

so what was the point?

Why are we open so Yeah. Why are we open so fast?

Right. So, again, so, like like, this is the same path that everyone goes. And even, like, after, open like, I was in the open source, I I I I was contacted by 1 of my old friends who is also personally doing the algorithm. He mentioned that this is great. Like

like, I

had to build something,

smaller version of this, like, by myself

and and so on. So and

for us, it's not just given to the con community. Of course, like, buying

like, opening open sourcing

gives us a lot of was 1 of the things marketing as a company, but also,

like, a development resource contribution. Like, we are totally welcome. Like, if someone wants to

contribute and even, like, see like, we want to see more, like, use cases,

what kind of thing they want, and and constantly keep improving our products

is great thing. So,

again, like, from my,

traditional,

relational database product

experience,

at this point of time, database being open source is a natural thing. So the Oracle and those,

proprietary database,

calls also doesn't make any sense to the users.

They need to see, like, inside what's inside even if it's, paid software, like,

like, the open sourcing and and having the contribution,

see more use cases,

listen to the users is is very important thing. Yeah. I mean,

if somebody opens an issue on our GitHub repo for market store, it gives us insight into what these people are using it for. Maybe something that we haven't thought of before

and maybe that's something that we can incorporate into our product, going forward in the algorithm training space.

It's almost

like user testing,

sort of by proxy.

And

what are some of the upcoming features that you have planned for market store and any particular use cases that those features might be aimed at?

Sure. Yeah. Actually, when there was a proprietary database, we already had this feature, and then we we decided to strip it out for the open source release. But as we mentioned before, we do the back testing on these algorithms, but also people who are wanting to live test and actually run and place trades live. So in order for an algorithm to be how, you know, maintain correct state, not only of its holdings and portfolio, but also of the market data itself and what the securities are doing throughout the day. We wanted to add some push capabilities,

to the to the system and allow clients to

subscribe either to

the data that's being directly written or the aggregated data that, is being created inside a market store.

So if a person's running an algorithm, they can be receiving that 1 minute bar, that 5 minute bar, that,

you know, 1 hour bar, as soon as it's there. And then and not have to pull or keep querying or all that stuff. Yeah. Particularly in the area of financial trading,

being able to have the lowest latency possible for being able to analyze and then execute those trades, I can see as being immensely valuable for the people who are interacting with the service. And so I can see on the client side of being able to retrieve the data, but also in market store being able to obtain the data from the actual markets themselves,

being able to maintain

that, high throughput, low latency

data stream would be immensely useful.

And are there any other topics that we didn't talk about yet that you think we should cover?

No. Much much not much in the the market itself. I just wanted to add that. So on on top of those, like, good things about the market. So, like, 1 of the things the 1 of the the reason that we open source a market store is so this alpacas

vision and mission.

And so, like, we are seeing, like, especially in the retail space,

retail trading space, like, we so the institutions

have become a lot of, like, the hardware oriented,

trading,

and they have a lot of, arms and

and

tools.

But when it comes to the retail trading space,

we constantly

see a need for, like, lack of access to those high quality data, high performance data, and

good amount of the toolset,

high sophisticated

AI technology. And that's why, like, we are building this. And,

of course, like, providing data at the store

is not only the same, but this is, like, online, like, in line with, what we, as, we want to achieve,

which is,

to say, like, we want to provide the the technology and and and power to the retail investors. If they're smart enough and they're good at trading, they should be able to compete with the the rest of the world, including institutions.

So, like, again, like, we want to see more

individual people, developers

use the Microsoft as touch it and see it and how fast

find it useful

or not useful,

and we'll just solve, those main problems,

as a company. And so

for anybody who wants to get in touch with either of you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question,

wondering if you can share your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I mean, no. Like, the

it's great seeing, like, so many different, databases,

coming up. Like,

I I I quickly saw the your blog, your series of the the podcast, and it's a great to see, like,

even in the in the space of the time series, there is there

is many

new and old. And so and and we are happy that that we kinda joined this list,

as part of the the database.

Yeah.

I mean, in terms of gap,

you know, 1 of the things that that we were looking to to fill was, you know, getting back to that and and why we open source it and,

you know, this this idea of,

breaking the barriers of entry,

for for people.

We didn't want

someone to have to be a,

a database or an SQL expert

or anything like that to be able to just okay. I'm gonna grab some crypto data

from, you know, GDAX or something. I'm gonna write it on disk, and then I'm gonna just press a button, run Market Store, and then just start writing some Python.

A lot of the other,

systems out there, it requires you to have this background knowledge of, like, how to even interact with it. And so if

and part of it is, you know, we we did write something that was specialized. Sure.

But it also

made it easy for somebody who, you know, just has a little bit of Python background or very maybe very little Python background to just get started on, on a project where they they think they can make themselves a little bit of money, compete with the big boys at Goldman Sachs or whatever else, and not a need to have that that background that's often required in, the data storage space.

Well, I want to thank the both of you for taking the time out of your day to join me and talk about the work that you're doing at Alpaca and on market store. So it definitely looks like an interesting project and 1 that I'm sure we'll see a lot of uptake. So thank you both for that, and I hope you enjoy the rest of your day. Appreciate it. Thank you, Deepak. Thank

you.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links