Blog

Cassandra South Bay Meetup Slides: Social Security Company Nexgate’s Success Relies on Apache Cassandra

April 23, 2014

By 

For this meetup we were excited to be joined by two members of the Nexgate team, Rich Sutton, CTO and Harold Nguyen, Senior Data Scientist.

What You’ll Learn

The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we’ve ever scanned forever, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This is how Datastax and Cassandra make it easy
Rich Sutton, CTO at Nexgate
Rich Sutton is the CTO at Nexgate, where he leads all product development. Rich has been in the security industry for almost 20 years, building everything from desktop software to cloud services to network appliances.

Harold Nguyen, Senior Data Scientist at Nexgate
Harold Nguyen, Senior Data Scientist at Nexgate, is responsible for the product’s classification system, including natural language processing and text-based machine learning algorithms. Prior to his role, Harold was a Data Scientist at Barracuda Networks responsible for web filtering and email spam classification. He received his Ph.D. in experimental particle physics working at the Large Hadron Collider, and has a decade of experience in analysis of large datasets.

Post a Comment with Disqus

NoSQL Cassandra Week Starts on Monday April 28th

April 23, 2014

By 

Release of 10+ high-quality tech talks from Cassandra Day Silicon Valley 2014. Tips from engineers at top startups using Apache Cassandra. Exclusive NoSQL and Cassandra related events.

This is just a taste of what you can expect from Cassandra Week. From April 28th – May 4th, our friends at Hakka Labs are giving you access to cutting-edge online content, as well as invitations to in-person events and job opportunities for Cassandra developers and administrators.

Here’s what you can look forward to:

  • Use cases from engineers at Netflix, Spotify, Hulu, Eventbrite, and Ooyala
  • 10+ tech talks from Cassandra Day and engineering community meetups
  • Advanced tutorials about how to deploy, scale, and troubleshoot Apache Cassandra
  • Job postings from top companies
  • Invitations to events designed for developers

Sign-up on the Cassandra week virtual event page to stay in the loop and receive notifications for upcoming content releases.

Looking to get started with Apache Cassandra? Check out DataStax’s free virtual Cassandra training for Java developers or the Getting Started page to take Cassandra for a test run.

Post a Comment with Disqus

Infor’s PeopleAnswers Helps Hire Global Talent at Global Scale, with Apache Cassandra Multi-Datacenter Replication

April 21, 2014

By 

Infor
 

“Ultimately it was clear Cassandra was the right choice, given its proven scalability, its explicit design for the multi-datacenter use case, and its developer-friendly CQL model.” 

- Darrell Burgan, Chief Architect at  PeopleAnswers

Darrell Burgan Chief Architect at PeopleAnswers

 

Infor

Infor creates beautiful ERP software that businesses of all kinds use to run their business processes. We are the third largest ERP software company in the world. Infor has a very broad range of applications in its portfolio that target all kinds of business horizontals and verticals. If you have a business need, it is likely Infor has an application that is well suited to meet it.

One of Infor’s products is PeopleAnswers, which is a science-based talent management platform that companies use to help them improve the quality of their hiring processes, both in terms of reducing costs, as well as in terms of improving the performance of the people who are selected.

My role is as chief architect of the PeopleAnswers product. All my answers on this page are thus primarily for the PeopleAnswers product, although Cassandra’s role within Infor is likely to grow.

 

PeopleAnswers

Imagine a big corporation with tens of thousands of employees, in which hiring is going on all the time, both due to growth and due to employee turnover. For an organization like this, the hiring process can be a major cost. Further, organizations are keenly focused on ensuring they retain the right people, to maximize the performance of the company. For jobs where there are tens of thousands of active applicants at any time, and where the number of applicants might exceed openings ten to one, administering the hiring process can be difficult. Our product makes managing the hiring process easy.

The hiring process is also typically very subjective. How does a company know it is hiring the right people? Our software gives the hiring manager a rational way of determining which candidates are most likely to be the best performing candidates for any particular job. We employ a scientific employee assessment that measures a candidate’s “behavioral DNA” against the highest and lowest performers in each job. This provides a predictor of a candidate’s future performance, and gives hiring managers a solid basis for figuring out which candidates to consider first.

 

A need to scale

Put simply, we need a database that is designed to work globally, across data centers, and provide the kind of global scale that our application demands. Relational databases are great and will always serve the need within the data center, but they do not scale well to this level, which naturally led us to the NoSQL world. We evaluated quite a few NoSQL products, like MongoDB, HBase, Couchbase, and several others.

 

Cassandra for a global company

Ultimately it was clear Cassandra was the right choice, given its proven scalability, its explicit design for the multi-data-center use case, and its developer-friendly CQL model. Cassandra is important to Infor because Infor is a global company. We have data centers and customers around the world, who access our cloud-based products 24 hours a day, every day of the year. If our systems are unavailable, their business stops operating.

We need a database product that is capable of scaling to this level, that can handle the global distributed database scenario, and that never goes down. There are very few products at this level of any kind, and in our view Cassandra is the leader among them.

Deployment

We’re currently using Cassandra 1.2 and have been using it for nearly a year in production. We plan to upgrade to Cassandra 2.x in the next few months. Cassandra serves as the basis of our persistence tier for that data which must span multiple data centers.

The PeopleAnswers Cassandra cluster is small in absolute terms but growing rapidly. Our current plans have us placing as many as eight nodes per virtual data center, clustered across as many data centers as needed.

 

Getting started

My advice to people new to Cassandra is to take an incremental approach. Cassandra is really easy to get up and running, but it has deep configurability, and the sheer number of configuration options can be bewildering at first. Pick a small use case and use it in a production setting. Then as you grow comfortable with the differences between Cassandra and the relational technology you might be used to, grow your usage of Cassandra to match.

The other advice I’d give is to have developers think at the CQL level, and let your Cassandra DBAs be the experts about the low-level structure. CQL is a really powerful tool, one that no other NoSQL database matches. Just like with relational databases, developers can (and should) think about the data from the perspective of the problem they are trying to solve, and can consult with DBAs to optimize table and query design to take advantage of Cassandra’s unique capabilities.

 

Joining the community

The community has been stellar. It is very active and enthusiastic, and DataStax is doing a great job of encouraging it to grow. We do our part by hosting the local meetups for Cassandra users!

Post a Comment with Disqus

DataStax Community Edition 1.2.16 and 2.0.7 now available

April 21, 2014

By 

DataStax Community Edition 1.2.16 and 2.0.7, which includes Apache Cassandra 1.2.16 and 2.0.7 respectively, are now available on the Planet Cassandra Downloads page. Here are the changes for DataStax Community Edition 1.2.16: CHANGES.txt
Here are the changes for DataStax Community Edition 2.0.7 : CHANGES.txt

Post a Comment with Disqus

Python Driver Overview Using Twissandra

April 17, 2014

By 

 

Python Driver Overview Using Twissandra” was created by Lyuben Todorov, Software Engineer at DataStax.



Twissandra, a Twitter clone using Cassandra for storage, has had a makeover to use the new python driver. This allowed the clone to make the switch from the thrift API to using CQL3 over the native protocol. Let’s go through some examples of using the python driver, taken from the updated Twissandra code.

Twissandra Datamodel Overview

Twissandra is composed of six tables that store users, tweets, tweet order (of the user and their timeline) and who users follow (and are followed by). Since we can’t use joins in Cassandra, tables are partially denormalised to allow for necessary flexibility, meaning there are more writes to make reads more performant.

Twissandra ER Diagram

The users table simply stores usernames and passwords:

Tracking latest tweets

Tweets are stored in a simple table where the primary key is a UUID column, ensuring the tweet’s uniqueness. We don’t track when the tweet was added in this table as that’s handled by the user’s timeline (see the userline and timeline table creation below).

TimeUUIDs are used for tracking the time of the tweet, to ensure uniqueness in the primary key, as they are composed of a random component and a timestamp. This allows us to retrieve unique tweets by time and also allows for tracking when the tweet was added. Cassandra sorts the timeline and userline based on the clustering key time. Since the aim is to retrieve the latest tweets WITH CLUSTERING ORDER BY (time DESC) is added to the table creation statements to invert the sorting.

Because the username is the partition key, we can easily select the most recent tweets for a specific user. The LIMIT clause can then be added to enforce a limit on how many tweets are retrieved:

An important note. The data-model presented here is only partially denormalized. Denormalizing the tweets table completely into the timeline and userline tables would improve query time, by letting us directly query the tweets from them, instead of requiring a second set of SELECT’s to retrieve the content of the tweets

Tracking Followers
The followers table allows for retrieval of the users that are following you. The friends table allows for retrieval of the users that you follow. The primary key for both tables is a composite key. This is important because the first component of the composite key, the partition key, decides how to split data around the cluster. One set of replicas will store all the data for a specific user. The second component is the clustering key which is used to store data in a particular order on disk. Although the ordering itself isn’t important for either table, the clustering key means all rows for a particular user will be stored contiguously on disk. This optimises reading a user’s friends or followers by allowing for a sequential disk read.

To retrieve all the followers or friends for a specific user, the username is added to the WHERE clause just like in SQL. Something worth noting is that we can use the username in the WHERE clause because it’s part of the primary key.

 

Setting up a connection

To connect to Cassandra we first import the driver’s Cluster class. The next step is to create a cluster and a session. We then supply the list of IPs for nodes in the cluster and tell the session what keyspace to connect to. Note that sessions automatically manage a pool of connections so they should be long-lived and re-used for multiple requests.

 

Some CRUD

The various things that twitter can do, whether it’s inserting a tweet, retrieving your followers, updating your password or unfollowing someone, are examples of create / read / update and delete operations that can be carried out on Cassandra.

Tweeting – Create
Adding tweets is done via Twissandra’s save_tweet function where four kinds of queries are carried out:

  1. Insert the tweet
  2. Update the current user’s userline with the tweet_id
  3. Update the public userline with the tweet_id
  4. Update the timelines of all of the user’s followers with the tweet_id

Inserting a message into Twissandra

Inserting the tweet message is as simple as supplying the username, the message, and generating a UUID. Note, if we didn’t need to save the UUID for use in later inserts, it could have been created using the uuid() function available in Cassandra 2.0. For a full list of CQL3 functions take a look at the DataStax docs.

Adding to the user’s and public userlines requires a username, the tweet’s ID and a time uuid:

Finally to complete the tweeting process, the tweet has to be inserted into each one of your follower’s timelines. This requires the username of the follower, the tweet’s creation time in the form of a Time UUID and the tweet’s ID in the form of a UUID.

Retrieving Tweets – Read

Retrieving tweets is done using one of two functions in Twissandra. The get_timeline and get_userline functions are both calls to _get_line. Retrieving either all of our tweets or all of someone else’s tweets is done via _get_line. To carry out the querying we require a username, a tweet starting time and the number of tweets to fetch. Since we don’t want to fetch the entire feed, first the range of tweets that we want to retrieve is selected.
Retrieving messages from Twissandra

If we need to start our page further back than the latest tweets, the less-than predicate, time < %s, can be used to retrieve tweets further back in the timeline.

Again, because we want to page through the timeline rather than retrieving all of it in a single query, we want to check if we reached the end of the timeline, and if not to store a marker to tell us where to start the page during the next query.

Once the array of tweet IDs is retrieved, they are used to fetch the actual tweets.

Queries are sometimes executed using session.execute and other times session.execute_async is used instead. The difference between the two is that execute waits for a response before returning whilst execute_async returns a “future” so it can send multiple messages concurrently, without waiting for responses, therefore there is no guarantee on the order of the responses. The returned ResponseFuture can be used to verify the query’s success for both serial and concurrent queries. On failure an exception would be raised.

Changing Password – Update
Updates and inserts have mostly identical behavior with Cassandra. They both blindly overwrite existing (or non-existing) data. Twissandra doesn’t use UPDATE statements but for completeness here is a theoretical example of updating a password:

Unfollowing – Delete

Removing a user from your feed requires two queries since in CQL3 there are no foreign keys to enforce relationships between the friends and followers table. The first query removes the user from your feed while the second tells them you are no longer following them. Prior tweets from this user won’t however be deleted from your timeline.

 

Enhancing Twissandra With New Cassandra Features

Modelling in Cassandra frequently requires denormalization as there is no joining of tables. Denormalization can be summed up as the process of adding redundant data to tables in order to optimise read performance. The frequent use-case in the relational model of having users with multiple email addresses is usually modelled by creating a user table and an email table where there is a one-to-many relationship. Cassandra’s alternative is to use CQL3 collections where a column can store a list, set or a map of fields. If Twissandra’s user table also required each user’s email address (see example below) and allowed for more than one, the set collection could be used to store them.

Light Weight Transactions

Lightweight transactions weight transactions (LWT) are another piece of functionality that was added to satisfy commonly used patterns requiring strong consistency, like for example the need to ensure that a username is unique before allowing someone to register said username. LWT aren’t available in version 0 of the python driver but are on their way in the new version 2.0. release. But here is an example of what inserting a username would look like using a LWT from cqlsh. We execute the INSERT as usual, but also append IF NOT EXISTS

LWT can also be used to verify a row exists by appending IF EXISTS to the end of the query:

Post a Comment with Disqus
1 2 3 113