In 2011 when we launched Tresata, it was centered around a massive bet – that Hadoop will become the ‘de-rigueur’ data processing and analytics stack in the world. Three and a half years later, that bet may seem prescient to some, but is simply a huge confirmation of our beliefs.
Not only have we managed to successfully launch a range of software products completely running in Hadoop, our success as an enterprise software company continues to serve as a reminder to the traditional stack vendors that Hadoop is not just the world’s most powerful compute technology, it is also the future of how software will be built, sold and delivered.
In 3 years, not only has Hadoop become the darling of the tech world, accepted by every tech company worth mention as a platform that is here to stay (even though they continue to undermine Hadoop by calling it a storage platform) and approved for use in every large enterprise in every industry sector by every enterprise architect we know.
What we still do not see is a raft of enterprise software companies who have used Hadoop as an end-to-end computational software platform. Something we have done quite successfully, making us definitely one of the first (if not THE first) full stack software products company who are building end-to-end analytics applications in Hadoop.
So in the spirit of what comes with all things open source, we decided to publish what parts of the HDFS ecosystem constitute our stack – the very stack we build our software products on. While we realize the HDFS/Hadoop ecosystems are incredibly rich, fast paced and rapidly evolving, what we have shared below are some of the pieces we have placed our bets on and why. This is not to say there are not other Hadoop projects that are also interesting and valuable, we just chose to not leverage them for a host of reasons–most of them commercially and technically driven. We have listed the projects we chose in order of the functions typically needed to deliver analytics software–as well as aid comparison with the traditional tech stack (not the right thing to do, but a lot of people still think in that stack)–and at the same time prove that the traditional stack is dead… or said nicely: irrelevant.
We use the Hadoop 1 API for basic filesysytem (HDFS) interaction, configuration and some low-level (highly optimized) Map-Reduce jobs. We restricted ourselves to the “old” mapred API since its source and bytecode are compatible across a very wide range of Hadoop 1.x and 2.x distributions (compile once, run anywhere).
Give Back – YES, over the years we have contributed to a variety of major and minor fixes
Scala is our primary development language. It is concise, elegant, functional, runs on the JVM, and has excellent Java interoperability.
Give Back – NO, not much wrong with scala
We use Avro containers as our primary storage format on HDFS besides bar-delimited text files. The main attractiveness of Avro is its resilience to changes in the data schema (added, renamed and remove columns), its type safety, and its compactness. Parquet is also on our radar but we haven’t had time to look at it yet.
Give Back – YES… we have made minor fixes and improvements
Akka 2.2.x/Spray 1.2.x
Our long-lived services are build on top of Akka and Spray. We considered Play instead of Spray but liked Spray better because it was more lightweight and focuses on APIs versus websites. Spray definitely has a learning curve (as do all the Scala DSLs that go a bit crazy… think SBT), but once you are used to it and have some templates to work off it is great. We now plan to use Akka beyond the highly concurrent serving layer for services and APIs: Akka is also suitable for certain business logic and algorithms. In particular we have an interest in using Akka together with HBase for algorithms that require random access.
Give Back – NO… not yet
Scalding 2.11.x/Cascading 2.x API
Scalding/Cascading is an abstraction over mapred 1.x that allows you to focus on your data transformation logic, while taking care of compiling it into Map-Reduce jobs for you. We started with Cascading 1.2 in Java but quickly shifted to Scalding and Scala. The bulk of our Map-Reduce codebase are written in Scalding. To get everything out of Scalding you sometimes need to drop-down into Cascading (for some Fields or Tuple logic, or the incidental Function or Aggregator).
Give Back – YES… we became Chris Wensel’s online buddies for a while… love you, Chris.
Algebird is an algebra library for Scala by opensourced by Twitter (by the same team that wrote Scalding). Being a fan of the “Twitter Big Data Stack” we heavily use Algebird in our workflows. Algebird is primarily used for writing efficient reduce algorithms that can be used inside distributed computation frameworks such as Map-Reduce and Spark.
Give Back – YES… we contribute fairly regularly to this project
Spark is a fast in-memory cluster computing framework that integrates well with HDFS (and lately also with YARN). Although spark is primarily intended for distributed in-memory batch computations, its model is generic enough that it can also work on datasets that do not fit in memory, by spilling to disk or even working entirely on disk. Spark has a elegant Scala API with many powerful operations that Map-Reduce lacks. Most notably spark supports co-location of datasets allowing for very efficient (map-side) joins.
We actually started out using spark as a potential Map-Reduce replacement about a year ago, focusing on very large datasets and have one of the first certified applications on it in the market. However, it turned out to be somewhat early days for that (we had many OOMs and lost executors), so instead we now focus on using spark for fast in-memory algorithms on somewhat smaller (but still large) datasets.
Give Back – YES… minor fixes and integration improvements
While we initially didn’t care for YARN too much, as most of our clients were happy running services side-by-side, and resource allocation across many services wasn’t their biggest worry. Also we found YARN to be complex and bulky. However clearly YARN is here to stay and we all will have to run applications inside it. The good news is that with careful dependency management we can make our mapred 1.x, Scalding/Cascading and Spark applications run unchanged (and without recompilation) on YARN as well as the standalone alternatives (MR1 and Spark standalone). We think that is pretty cool.
Give Back – No… not yet
HBase 0.94.x and 0.96.x
Our story for HBase is similar to YARN, in that we didn’t care for it too much at first (we actually liked Cassandra better, preferring availability over consistency in our designs), but HBase was on many of our client roadmaps, and given it integrates very well with the Hadoop stack, we started using it as our primary random-access store, typically with a REST API serving layer in front of it so our customers can even use it for real-time data access applications.
As you probably figured out by now, we like to compile our applications once and run them against as many different Hadoop distros as possible.
Give Back – No… not yet
we just don’t care for it that much – or what we think of Hive
Hive’s design is such that it basically does not expose a stable java API. It’s only API seems to be the end user (Hive SQL) API. This walled-garden design is not something we appreciate. Hive also comes with a boatload of (transitive) dependencies.
We initially tried to integrate with Hive (merely as a data access layer, never for computations) but recently dropped support for Hive integration altogether. Maybe things have improved somewhat in Hive land now that HCatalog provides a usable and stable Java API (still with all those dependencies), but it’s too little too late for us, as we have moved on, and I suspect so have many others (and if they haven’t they should).
Note that it’s still possible for our clients to load results from Tresata applications into Hive, and we can read many formats produced by Hive, but we just don’t try to automate this integration anymore from within our applications. Spark seems to have interesting SQL support. They initially also tried to integrate with Hive, and dropped that, I suspect for similar reasons to ours. Now they have their own SQL engine on top of Spark, which seems very neat. Also Cascading has it’s own SQL engine (Lingual) that is worth looking into.
Give Back – really…you need to ask that question
1. No proprietary software stack products
2. 100% Hadoop inside… or nothing
3. Full-stack software solution – with Tresata analytics software clients get the answers they want and do not need any SQL engines (fast or slow), relational DB’s or Hadoop DB’s, proprietary storage systems or anything else that promises to make Hadoop better with their proprietary Hadoop software
We hope this quick discussion gave a you a quick peek into the true power of what the Hadoop ecosystem is enabling – a new class of enterprise software systems that are automating the discovery and delivery of intelligence and completely destroying traditional software products as a result. Why pay hundreds of millions for a database when for tens of millions you can get intelligence applications from Tresata!