Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions
- How did you get involved in the area of data management?
- Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?
- What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?
- Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained?
- Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics?
- For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components?
- How does HPCC Systems manage persistence and scalability?
- What are the integration points available for extending and enhancing the HPCC Systems platform?
- What is involved in deploying and managing a production installation of HPCC Systems?
- The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data?
- How does the Thor engine manage data transformation and manipulation?
- What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration?
- For extraction and analysis of data can you talk through the capabilities of the Roxie engine?
- How are you using the HPCC Systems platform in your work at LexisNexis?
- Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade?
- How is the HPCC Systems project governed, and what is your approach to sustainability?
- What are some of the additional capabilities that are only available in the enterprise distribution?
- When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead?
- What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used?
- What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community?
- What do you have planned for the future of HPCC Systems?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- HPCC Systems
- LexisNexis Risk Solutions
- Risk Management
- Oracle DB
- Data Lake
- ECL IDE