“Using the Cassandra Bulk Loader, Updated” was created by Yuki Morishita, Apache Cassandra Committer.
sstableloader back in 0.8.1, in order to do bulk loading data into Cassandra. When it was first introduced, we wrote a blog post about its usage along with generating SSTable to bulk load.
Now, Cassandra version 2.1.0 was released, and bulk loading has been evolved since the old blog post. Let’s see how the change makes our life easier than before.
Specific changes are:
sstableloaderno longer participates in gossip membership to get schema and ring information. Instead, it just contacts one of the nodes in the cluster and ask for it. This allows you to bulk load from the same machine where cassandra is running, since it no longer listens at the same port with cassandra.
- Internally, streaming protocol is re-designed. You can stream data more efficiently than before.
CQLSSTableWriteris introduced(CASSANDRA-5894). You can now create SSTables using familiar CQL.
In the old post, we showed two scenarios where
sstableloader is used. Let’s see how the changes work in those scenes.
I use Apache Cassandra ver 2.1.0 through out this example, from cluster to running
Example 1 – Loading existing SSTables
sstableloader has not changed much, but because it has to contact the node to get schema for loading SSTables, you have to specify the address(es) of the node by using
So for example, you want to bulk load to
As you can see, some stats are printed out after the bulk load.
Example 2 – Loading external data
Previously, we had example that creates SSTables from CSV using
UnsortedSimpleSSTableWriter and uses
sstableloader to load it to Cassandra cluster in the old post.
Schema there is created with thrift, and it has a simple, flat table structure.
For this updated post, let’s do more complex scenario with new
We will create real data from Yahoo! Finance to load historical prices of stocks in time-series manner.
If we take a look at CSV file for Yahoo!(YHOO), it has 7 fields in it.
Let’s use ticker symbol as our partition key, and ‘Date’ field as clustering key.
So schema looks like:
CREATE TABLE historical_prices (
PRIMARY KEY (ticker, date)
) WITH CLUSTERING ORDER BY (date DESC);
CLUSTERING ORDER BY to query recent data easily.
Generating SSTable using CQLSSTableWriter
How do you bulk load data to such a schema? If you choose to use
UnsortedSimpleSSTableWriter as we did in the old post, you have to manually construct each cell of complex type to fit to your CQL3 schema. This requires you to have deep knowledge of how CQL3 works internally.
All you need is DDL for table you want to bulk load, and INSERT statement to insert data to it.
You can see complete example on my github.
After you generating SSTable, you can just use
sstableloader to target cluster as described before.
There are still some limitations in
CQLSSTableWriter, like you cannot use it in parallel, or user defined types are not supported yet.
But we keep improving so stay tuned to Apache JIRA.
Generating SSTable and bulk loading have been improved over the past release. There are many new features available to make your life easier.
Start experimenting by yourself today!