One of the interesting subplots of “Hadoop Strata World” aka “Cassandra Denial World” was the battle for mindshare over who the HDCIC (Head Dremel Clone In Charge) is.
The two contenders are https://github.com/cloudera/impal and Apache Drill (of course its hard to say Drill is a contender since it is nothing but a proposal) Both are attempting to produce the best Dremel clone
If I had to sum up Dremel, I would sum it up in one sentence “Make protobufs query-able”. Dremel does some clever stuff to break row orientated protobufs into a columnar form and store it in BigTable (for optimization on the read side as a more real time query system)
I had not recalled reading the dremel paper in a while, but after all the hoopla it was getting I decided to read it again. I came to a funny conclusion:
There is another open source Dremel implementation.
Like many of my “taco bell” exploits, my system does not require anything other then Hadoop and Hive. I know, I’m sorry, my system does not have 17 components and it’s not 5,000,000 lines of source code. So its not sexy, I won’t have 50 people in my company retweeting and calling it the most innovative thing in the last thousand years. I will not be unveiling it at a super huge conference with rap stars or anything like that.
Anyway here it is:
1) You write protobuf to a sequence file
2) You make a hive table declaring what protobuf objects are in the file
3) The protobuf is converted to nested struct types
4) You query the hive table using familiar hive syntax and lateral views
Let us see how this works by making some protobuf objects. Cars make great examples.
Now write those to a sequence file.
Then we create a hive table on top of this sequence file.
client.execute(“create table “+table+” “
client.execute(“load data local inpath ‘” + p.toString() + “‘ into table “+table+” “);
Then we query it.
Wow look at that! A system to query protobuf objects directly!
Why is this nice? Well. It works. You can query it. You do not need a stack of 40 components to make it work. It’s not a work in progress.
Dremel and its clones “normalize” the object on insert by writing it to multiple tables/ column families/what have you. Hive-protobuf takes the other approach. We write the entire row as is to Hadoop. This is a trade off we are all familiar with. We are trading insertion speed for query speed. Both approaches have merit, possibly both can be used together.
The inspiration for hive-protobuf came from the hive-avro work recently added to hive trunk. Which one could argue is another Dremel clone since it lets you query complex structured avro docs.