November 29th, 2012

One of the interesting subplots of “Hadoop Strata World” aka “Cassandra Denial World” was the battle for mindshare over who the HDCIC (Head Dremel Clone In Charge) is.

The two contenders are and Apache Drill (of course its hard to say Drill is a contender since it is nothing but a proposal) Both are attempting to produce the best Dremel clone

If I had to sum up Dremel, I would sum it up in one sentence “Make protobufs query-able”. Dremel does some clever stuff to break row orientated protobufs into a columnar form and store it in BigTable (for optimization on the read side as a more real time query system)

I had not recalled reading the dremel paper in a while, but after all the hoopla it was getting I decided to read it again. I came to a funny conclusion:

There is another open source Dremel implementation.

Like many of my “taco bell” exploits, my system does not require anything other then Hadoop and Hive. I know, I’m sorry, my system does not have 17 components and it’s not 5,000,000 lines of source code. So its not sexy, I won’t have 50 people in my company retweeting and calling it the most innovative thing in the last thousand years. I will not be unveiling it at a super huge conference with rap stars or anything like that.

Anyway here it is:

1) You write protobuf to a sequence file
2) You make a hive table declaring what protobuf objects are in the file
3) The protobuf is converted to nested struct types
4) You query the hive table using familiar hive syntax and lateral views

Let us see how this works by making some protobuf objects. Cars make great examples.

message TireMaker {
  optional string maker = 1;
  optional int32 price = 2;
message Tire {
  optional int32 tirePressure = 1;
  optional TireMaker tireMaker = 2;
message Accessory {
  optional string name = 1;
  optional string value = 2;
message Car {
  repeated Tire tires = 1;
  repeated Accessory accessories = 2;


Now write those to a sequence file.

SequenceFile.Writer w = SequenceFile.createWriter(this.getFileSystem(),
            new Configuration(), p, BytesWritable.class, BytesWritable.class);
    TireMaker.Builder maker = TireMaker.newBuilder();
    Car.Builder car = Car.newBuilder();
    Tire.Builder tire = Tire.newBuilder();
    BytesWritable key = new BytesWritable();
    BytesWritable value = new BytesWritable();
    ByteArrayOutputStream s = new ByteArrayOutputStream();;
    ByteArrayOutputStream t = new ByteArrayOutputStream();
    key.set(s.toByteArray(), 0, s.size());
    value.set(t.toByteArray(), 0, t.size());
    w.append(key, value);

Then we create a hive table on top of this sequence file.

    client.execute(“create table “+table+” “

            + ” ROW FORMAT SERDE ‘” + ProtobufDeserializer.class.getName() +“‘”
            + ” WITH SERDEPROPERTIES (‘KEY_SERIALIZE_CLASS’='” +Ex.Car.class.getName()
            + “‘ )”
            + ” STORED AS INPUTFORMAT ‘” +KVAsVSeqFileBinaryInputFormat.class.getName() + “‘”
            + ” OUTPUTFORMAT ‘” + SequenceFileOutputFormat.class.getName() +“‘”);

    client.execute(“load data local inpath ‘” + p.toString() + “‘ into table “+table+” “);

Then we query it.

    client.execute(“SELECT key FROM “+table);
    List<String> results = client.fetchAll();
    String expected=“{\”accessoriescount\”:0,\”accessorieslist\”:[],\”tirescount\”:1,\”tireslist\”:[{\”tiremaker\”:{\”maker\”:\”badyear\”,\”price\”:null},\”tirepressure\”:null}]}”;

Wow look at that! A system to query protobuf objects directly!

Why is this nice? Well. It works. You can query it. You do not need a stack of 40 components to make it work. It’s not a work in progress.

Dremel and its clones “normalize” the object on insert by writing it to multiple tables/ column families/what have you. Hive-protobuf takes the other approach. We write the entire row as is to Hadoop. This is a trade off we are all familiar with. We are trading insertion speed for query speed. Both approaches have merit, possibly both can be used together.

The inspiration for hive-protobuf came from the hive-avro work recently added to hive trunk. Which one could argue is another Dremel clone since it lets you query complex structured avro docs.