Top Posts This Month
Upcoming Webinars

The Django Cassandra Engine – the Cassandra backend for Django

August 22, 2014

By 

 

 

django

 

 

Rafał Furmański Python Developer at Opera Software International

 
 
 

Django cassandra engine is a simple database wrapper for Django, providing very easy way to connect and interact with a Cassandra database, like every Django developer is used to. It uses cqlengine which is currently the best Cassandra Object Mapper for Python.
Basically my backend is everything you need to start using Cassandra and Django together.

 

Django + Cassandra
Django is the most mature and popular web framework for Python, with a great community and huge amount of pluggable apps and extensions. I was almost always able to find an interesting package on the Internet. The first thing I did when I found out that we’ll be using Cassandra in our new project was to search for an appropriate database backend for Django. I was really surprised that there were none, so I decided to write it on my own, with a goal  to make it very simple and easy to use. After some research about available Python clients I decided to use cqlengine over Pycassa and the official DataStax Python driver. Mainly because Pycassa does not support CQL, and cqlengine is very actively developed at the time. 

 

NoSQL databases don’t have official support for Django, and I think that’s not good, because they’re very popular and widely used these days. There are projects like django-nonrel that tries to fill this gap and add support for MongoDB or Google App Engine. In my opinion Cassandra is a great database and requires at least basic support for Django, too. My wrapper is an answer for that, and will be soon battle-tested.

 

Learning Cassandra
I have only a few months experience with Cassandra and I know that there is probably a long road ahead of me to know it better. We’re about to launch two big projects in the very near future so I hope to experience its outstanding performance and high availability. What I can say after these couple months time is that Cassandra is fairly easy to learn and deploy. It comes with great management and monitoring tool called OpsCenter that I like a lot!

 

Starting with the engine
First you have to install it, preferably using pip (pip install django-cassandra-engine). Then add ‘django_cassandra_engine’ to INSTALLED_APPS in settings.py and configure DATABASES setting as you’re used to. After that define some models, run ‘syncdb’ management command and you’re done! Everything is well documented on wrappers’ site as well as cqlengine.

 

If you would like to see a “must have” feature or find any bug in my wrapper just contact me – I’ll try to implement or fix it as soon as possible.  I would try to encourage all of you to learn Cassandra and don’t be afraid to use it, especially in Python and Django.

A series on Cassandra – Part 1: Getting rid of the SQL mentality

August 20, 2014

By 

At Websudos, Cassandra and Datastax Enterprise are core technologies in our area of competence, and moreover some of our favorite technologies in existence. There aren’t many things that can easily give you 100k writes per second, but if that’s what you are looking for you are in the right place.

We got to learn a lot about Cassandra, by using it in multiple projects and especially while writing the Scala DSL for it, our very own phantom.

From it’s humble beginnings as an afternoon project, phantom has become the de-facto standard for Scala based adopters of Cassandra, with full support for a type-safe CQL flavor, testing automation, automation of service discovery with ZooKeeper any many other cool features.

In this series of posts to follow, we get to share some of our  Cassandra experience and help you transition from a traditional SQL background to a NoSQL mindset, in an example driven series of Do’s and Don’ts.  Cassandra is an incredible combination of power and simplicity, and we’d love to show you all the nitpicks, and possibly infirm some of our own beliefs while at it.

 

The top 5 mindset changes in Cassandra adoption

1. Normalization of data is not necessary, duplication is ok.

In Cassandra data is meant to be completely de-normalized. Duplication of data is not only possible, but it’s also encouraged and often the only way to model the SQL equivalent. In Cassandra, writing is considered an extremely cheap operation, so wether you do 10 writes per second or 100k writes per second at capacity, Cassandra will offer you horizontal scalability at the remarkable rate of about 7k writes per second per node(as measured by Netflix). More details on performance can be found in this Planet Cassandra post or the famous Netflix benchmarking test, where at 288 nodes in a cluster the number of writes approached 1.1 million per second(no, not a typo).

 

2. Query expressiveness is traded for performance.

If you were to compare CQL to the traditional SQL standard, you would notice an incredible decrease in complexity. We get to feel that pain first hand with our new upcoming reactive SQL DSL, morpheus. CQL is very similar to SQL in terms of keywords, so it’s also a great familiar looking place for most engineers, if not all. However, you can no longer compare and search for matches on random columns, you have no joins and barely any complexity will ever exist in your queries. There is no such thing as a stored procedure.

What you get instead is a really simple and concise query language that’s both apparently limited and yet wonderfully empowering. For those who agree simplicity is a good thing, CQL definitely fits the bill and while I’m sure some may argue more complexity is necessary, the language is plentiful to express anything you can possibly want and it takes about 2 hours to master in it’s entirety. That’s a pretty cool use of an afternoon.

 

3. In Cassandra, you need to plan your queries in advance.

Thanks to new emerging technologies like MongoDB and also to the traditional relational model, most people have the bias of any index querying deeply ingrained in their thought process. You’d think it’s trivial to query by any field in your database, but that’s not how really things work in Cassandra. The performance boost has to come from somewhere, and it comes from Cassandra’s ability to do extremely little work to retrieve the data for your beloved queries.

Lets take the very simple use case below, where you have a table of people, and the fields are idname and firstName. More than simple enough for a text book default. Now say you want to query things by their id. Quite simple, make id the “Partition key”, and done, you can query by id. A “Partition key” is the way to allocate rows inside the Cassandra storage. It has some interesting properties, which we will cover in a later post. For now, just know the first part of the “Primary Key” is the “Partition Key”.

And the Phantom DSL equivalent:

 

Now, you want to query your People table by a first name, with a simple query as follows:

 

Looks quite simple and straight forward, a great reminder of the SQL equivalent SELECT * FROM People WHERE firstName = ‘whatever’. However, this is not possible in Cassandra, and phantom won’t even let you compile it. Why? Because the ‘firstName’ cannot be serialized to form the primary key hash. In other words, it’s not an index or even part of an index, it’s just “stuff to store” as far as Cassandra is concerned, but it’s not “stuff to query”. Cassandra has no way to re-create the hash of the row where you data exists out of the firstName, simple as that.

You really have to think of Cassandra as one giant and overpowered java.util.HashMap when you want to build indexes. I hope the Cassandra team doesn’t hold this against us, but it’s a good way to simplify. What does a HashMap do in essence? Jump to reference. That’s exactly what Cassandra does, and although it has some clever ways to build that “jump-to-reference” or the hash, such as Compound keys or Composite keys, it’s still a single hash per match model.

That’s why you plan in advance and you think really hard about how to simplify and make as little columns as possible part of your PrimaryKey. You will need to produce the full PrimaryKey every time you want to query, as otherwise obviously the Murmur3 hash on by default cannot be reproduced.

A few tricks of the trade.

Avoid using a column in an index wherever possible, the more columns you have the less flexibility you get and the harder it becomes to keep producing all the Primary Key data at query time. The values of any column part of a Primary cannot be updated either, so you are “stuck” with whatever you write.

Complex querying is expensive and the secret is simplicity. You can only do very simple things at a very large scale. The full page stored procedures in SQL are bad old memories in Cassandra. Querying several tables at one time to fetch your data and composing Futures to achieve that is quite common.

Avoid secondary indexes. Somewhat like SQL, Cassandra gives you an Index column. This was implemented rather as a marketing decision than a technical reality. Steer clear of using secondary indexes for anything remotely performance critical.

Querying those often requires the dreaded ALLOW FILTERING, which means getting the right matches will be done by Cassandra, but in memory, at query time. You can see how this gets really messy after the first few thousand records. Simply ENABLE TRACING at query time and you can witness the scale of the damage yourself.

 

4. Duplicate data and maintain consistency at application level

“Ok, so indexing and querying by random columns is difficult, but I just want to query by firstName.” There is a very simple solution to that problem, data duplication.

Basically, you create another table where the column you’d like to query by is the primary key and the “other column” or other columns are the piece of data they relate to in the original table.

In this example, we relate the firstName to the id of the person in the People table. The CQL looks like this:

 

And the Phantom DSL equivalent:

 

Now what you can do is quite nice. If you were using Scala with phantom, you can use Futures and compose them to achieve consistency at application level, but the same goes for any client side application capable of async execution.

The below example is intentionally verbose, but you can of course “for yield” or if you are particularly trustworthy of your network you can do parallel writes. In the below case we wait for the first write to complete before initiating a second. In the same pattern, you sync up every subsequent operation, with a few bumps along the way.

Deletes now need to run side by side for consistency purposes, updates to thePeopleByFirstNameTable are actually an INSERT followed by a DELETE, as you can no longer update the firstName in that table. It’s part of the primary now, or more specifically it’s the partition key. But with any decent client this is remarkably simple and surprisingly satisfying.

 

 

5. Consistency is important

Now that you’ve come a long way in your journey to CQL, it’s time to devolve yourself completely of the SQL performance limitations. Your local Postgress is well capable of taking things to 20 million records and giving you decent sort performance and query capability. This is all on the average machine, no fancy gear required.

But if you are doing “serious business”, you didn’t waste all this time just for fun. That’s where the last of the big issues comes into play, consistency. More detailed information is available here, but if you are looking for a rule of thumb, data that’s required to be immediately available lends itself to high consistency levels such as LOCAL_QUORUM orALL.  If you expect real time API calls over the writes you make, set the consistency high into the sky.

Don’t fear to pay the price of consistency even if Cassandra has to run around a bit more under the hood to ensure it, it’s often well worth the time cost when you come across things like large discrepancies between nodes, where some nodes still have the data and others successfully performed a delete, lets say(the so called “Zombies”). Cassandra advertises the model of eventual consistency, with “tunable consistency” bundled, but the default consistency level is often not enough.

Coming back to performance costs, sometimes is not worth paying for the extra network round trips and wait time, as at large scale it may cost quite a lot of money. But the rule of thumb is again simple and rewarding. If you are dealing with analytical data, reports and things you are going to process in Hadoop or Spark and you’re happy to get the results at a later point in time, you can ease off and save yourself a buck. That generally eases off the workload on the clusters enough to keep the P&L statements looking great and with that spare cash you get yourself a Datastax Enterprise license and get a whole lot of really cool features, many of which we will cover in this series.

 

THE END

That marks the end of our introduction to Apache Cassandra. We look forward to your feedback and comments and we hope you’ve found it interesting! Stay tuned for more in this series.

A series on Cassandra – Part 1: Getting rid of the SQL mentality” was written by Websudos.co.uk

Ilya Beyer CTO at nScaled
"We could not get the same performance as we achieved in Cassandra for our use-cases."
Ilya Beyer CTO at nScaled

NScaled is disaster recovery and backup as a service provider. We have multiple data centers worldwide. We’ve developed cloud-based, hub-and-spoke software platform that performs fully automated backup and disaster recovery.

When it comes to file backup, our solution offers highly efficient distributed storage. That’s where Cassandra comes in.

 

Selecting Cassandra

Couple of years ago, we decided to build our own cloud based file system that will be capable performing fast writes and reads, auto-versioning, retentions, de-duplication, WORM (Write Once Read Many) compliance, and many other functions.

Predominantly, we were looking for super fast writes for very large incoming payloads. In the backup world, 99% of your operations are “writes”.  You write a lot, and you read occasionally. Another important factor – ease-of-use. We tried Openstack Swift; It was hard to manage. We then tried Riak and HBase. We could not get the same performance as we achieved in Cassandra for our use-cases.

Cassandra at nScaled

We started all the way back with Cassandra 1.0, and have since upgraded to 1.2. We have our own data center in Dallas, where our Cassandra cluster is deployed. Today the cluster consists of 75 nodes.

 As mentioned earlier, we wanted to build de-duplication for our Cloud Archived File System. We want to make sure that identical file parts are not stored twice, even though multiple files may be using same blob. As a side effect of achieving deduplication, for every potential write operation, we
Screen Shot 2014-08-20 at 1.25.31 PMactually have to perform some reads. We designed our schema purposely to keep file chunks metadata on small SSD disks, where actual blobs would reside on large SATA drives.

This is where we give Cassandra some additional help, by storing small metadata on flash disks, in order to perform fast random reads.

Cloud Archived File System offers today automatic encryption, auto-versioning, direct file streaming, partial or full restore, mirroring, and many different retention policy types. We also had to build a retention policy lock for customer’s storage partitions that are marked WORM compliant. A user could not reduce a timespan of retention, once storage is marked WORM compliant. They can only increase a retention timespan.

Today, we’re storing close to 108 terabytes in Cassandra at a growing rate of 3-5 terabytes a month. With addition of new customers, naturally, the growth will become even bigger.

Performance-wise, we have no complaints with Cassandra. The system works well and as expected. We knew that writes are going to be tremendously fast, and 99% of our total operations is taken up by writes.

Lessons learned

Setting up our data model was very iterative, for us. We were learning Cassandra as we were defining requirements. We had to massage our data model carefully. It took some time to define all use-cases, so we could design an appropriate schema. Knowing your use-cases and requirements is highly important before you start modeling.

We wish that we knew that backed up data purging would be far more active. One thing that we didn’t predict, that our customers will be far more aggressive with file retention due to budget constraints. As our customers’ data grew, they reached their purchased entitlements. They start saying, “I understand, but I don’t want to be paying for more data”. So they start applying more aggressive retention policies. Our retention enforcement engine starts marking data with tombstones and leveled compaction in Cassandra 1.2 is not very efficient when it comes to vacating space. We really look forward to start moving to Cassandra 2.1, where compaction would do a better cleanup job of tombstone marked columns within for our Cloud Archived File System.

Multi-Datacenter Cassandra on 32 Raspberry Pi’s

August 20, 2014

By 

Here at Datastax, my fellow intern Daniel Chin and I built a 32 node DataStax Enterprise cluster running on Raspberry Pi’s! We are showcasing the always on, fault tolerant nature of Cassandra by letting anybody take down an entire data center with the press of a Big Red Button in our lobby.

lobby_wide

Being able to withstand a data center going down is not just an edge caseit is an absolute necessity for the highly available applications Cassandra powers. While the cloud is far more flexible for production use, nothing beats a big shiny hardware display for a demo.

Our main goal for this project was to take the abstract concept of fault tolerance and and make it something you can see in action and interact with. We built upon the work of Travis Price and the DataStax Sales Engineering team, who pioneered using Raspberry Pi’s to demonstrate Cassandra.

cluster_closeup_final

The Build Process:

The Hardware:

 We wanted the overall display to look clean and professional enough to be appropriate for the lobby at our headquarters, but expose enough of the technology to be a compelling and unique demo.

As DataStax is a software company, fabricating hardware came with a unique set of challenges. (“Hey, do we have a shop vac?” “No.”) I ended up drawing on my experience with Solidworks (a popular CAD program) from high school FIRST robotics to design all of the acrylic, and had it cut using a laser at a local machine shop. The assorted mounting hardware and the pedestal were sourced from McMaster Carr.

lots_of_pis

The Electronics:

Each Pi is running at its factory clock settings, and is completely unmodified. To avoid latency problems and to ensure our Pi’s stayed online, we transitioned off of WiFi and used ethernet cables and switches instead.

To get power to each Pi, we use micro USB cables that are connected to five port USB hubs that are then plugged in to two power strips, one for each data center. This makes it easy to set up, and doesn’t require building any custom power distribution rails.

Our large red button is connected to an Arduino that actuates a power relay to cut AC power to the network switch for Datacenter RED. The Arduino provides timing control, and makes the button inoperable during the network outage.

daniel_assembling

The Software:

The cluster is set up as a two datacenter DSE 4.5 cluster, with Opscenter 5.0 running to show the status of all the nodes. As we expected, running a high performance enterprise database on a computer with a single core 700mHz processor and 512MB of RAM is not trivial.

We are using vnodes, and have throttled all of the cassandra.yaml values to the lowest intensity we can to squeeze C* to within our hardware constraints.

With Cassandra running, each Pi has 8 to 11 megabytes of free RAM.

For reference, our documentation currently recommends 16 cores and 24GB of RAM for a production system.

cables_wall

While this cluster won’t be setting speed benchmarks any time soon, we hope that it gets people excited about Cassandra and its incredible always on capabilities!

pedestal_pic_teaser

To read more about the multi-datacenter capabilities of Apache Cassandra, check out “Data Replication in NoSQL Databases Explained“.

Multi-Datacenter Cassandra on 32 Raspberry Pi’s” was created by Brandon Van Ryswyk, Software Development Intern at DataStax & Computer Science Student at UC Berkeley, and Daniel Chin, Software Development Intern at DataStax & Electrical Engineering, Computer Science, and Mathematics Student at UC Berkeley.

Cassandra Day Seattle 2014 Video/Slides: High Throughput Analytics with Cassandra and Azure

August 18, 2014

By 

MetricsHub is a monitoring and scalability service for the public cloud, allowing customers to gather large amounts of data, analyze and act on it in real time. Taking advantage of Cassandra’s rapid ingestion rates, and the elastic scale of the Azure, MetricsHub analyzes billions of data points every day to reduce cost and improve availability for its customers.

Charles Lamanna currently works on the Windows Azure monitoring team to define the next generation of cloud monitoring and management. Charles was a Responsible for technical and business areas for MetricsHub. He was a member of founding team and developed the company from idea stage, to revenue and then to exit. MetricsHub was acquired by Microsoft on March 4th, 2013. The premium MetricsHub product was offered as a no charge service following

avatar.jpg.75x75px    

Be sure to check out all of the sessions from Cassandra Day Seattle at the Cassandra Day Seattle 2014 YouTube Playlist

1 2 3 131