Blog

Infor’s PeopleAnswers Helps Hire Global Talent at Global Scale, with Apache Cassandra Multi-Datacenter Replication

April 21, 2014

By 

Infor
 

“Ultimately it was clear Cassandra was the right choice, given its proven scalability, its explicit design for the multi-datacenter use case, and its developer-friendly CQL model.” 

- Darrell Burgan, Chief Architect at  PeopleAnswers

Darrell Burgan Chief Architect at PeopleAnswers

 

Infor

Infor creates beautiful ERP software that businesses of all kinds use to run their business processes. We are the third largest ERP software company in the world. Infor has a very broad range of applications in its portfolio that target all kinds of business horizontals and verticals. If you have a business need, it is likely Infor has an application that is well suited to meet it.

One of Infor’s products is PeopleAnswers, which is a science-based talent management platform that companies use to help them improve the quality of their hiring processes, both in terms of reducing costs, as well as in terms of improving the performance of the people who are selected.

My role is as chief architect of the PeopleAnswers product. All my answers on this page are thus primarily for the PeopleAnswers product, although Cassandra’s role within Infor is likely to grow.

 

PeopleAnswers

Imagine a big corporation with tens of thousands of employees, in which hiring is going on all the time, both due to growth and due to employee turnover. For an organization like this, the hiring process can be a major cost. Further, organizations are keenly focused on ensuring they retain the right people, to maximize the performance of the company. For jobs where there are tens of thousands of active applicants at any time, and where the number of applicants might exceed openings ten to one, administering the hiring process can be difficult. Our product makes managing the hiring process easy.

The hiring process is also typically very subjective. How does a company know it is hiring the right people? Our software gives the hiring manager a rational way of determining which candidates are most likely to be the best performing candidates for any particular job. We employ a scientific employee assessment that measures a candidate’s “behavioral DNA” against the highest and lowest performers in each job. This provides a predictor of a candidate’s future performance, and gives hiring managers a solid basis for figuring out which candidates to consider first.

 

A need to scale

Put simply, we need a database that is designed to work globally, across data centers, and provide the kind of global scale that our application demands. Relational databases are great and will always serve the need within the data center, but they do not scale well to this level, which naturally led us to the NoSQL world. We evaluated quite a few NoSQL products, like MongoDB, HBase, Couchbase, and several others.

 

Cassandra for a global company

Ultimately it was clear Cassandra was the right choice, given its proven scalability, its explicit design for the multi-data-center use case, and its developer-friendly CQL model. Cassandra is important to Infor because Infor is a global company. We have data centers and customers around the world, who access our cloud-based products 24 hours a day, every day of the year. If our systems are unavailable, their business stops operating.

We need a database product that is capable of scaling to this level, that can handle the global distributed database scenario, and that never goes down. There are very few products at this level of any kind, and in our view Cassandra is the leader among them.

Deployment

We’re currently using Cassandra 1.2 and have been using it for nearly a year in production. We plan to upgrade to Cassandra 2.x in the next few months. Cassandra serves as the basis of our persistence tier for that data which must span multiple data centers.

The PeopleAnswers Cassandra cluster is small in absolute terms but growing rapidly. Our current plans have us placing as many as eight nodes per virtual data center, clustered across as many data centers as needed.

 

Getting started

My advice to people new to Cassandra is to take an incremental approach. Cassandra is really easy to get up and running, but it has deep configurability, and the sheer number of configuration options can be bewildering at first. Pick a small use case and use it in a production setting. Then as you grow comfortable with the differences between Cassandra and the relational technology you might be used to, grow your usage of Cassandra to match.

The other advice I’d give is to have developers think at the CQL level, and let your Cassandra DBAs be the experts about the low-level structure. CQL is a really powerful tool, one that no other NoSQL database matches. Just like with relational databases, developers can (and should) think about the data from the perspective of the problem they are trying to solve, and can consult with DBAs to optimize table and query design to take advantage of Cassandra’s unique capabilities.

 

Joining the community

The community has been stellar. It is very active and enthusiastic, and DataStax is doing a great job of encouraging it to grow. We do our part by hosting the local meetups for Cassandra users!

Post a Comment with Disqus

DataStax Community Edition 1.2.16 and 2.0.7 now available

April 21, 2014

By 

DataStax Community Edition 1.2.16 and 2.0.7, which includes Apache Cassandra 1.2.16 and 2.0.7 respectively, are now available on the Planet Cassandra Downloads page. Here are the changes for DataStax Community Edition 1.2.16: CHANGES.txt
Here are the changes for DataStax Community Edition 2.0.7 : CHANGES.txt

Post a Comment with Disqus

Python Driver Overview Using Twissandra

April 17, 2014

By 

 

Python Driver Overview Using Twissandra” was created by Lyuben Todorov, Software Engineer at DataStax.



Twissandra, a Twitter clone using Cassandra for storage, has had a makeover to use the new python driver. This allowed the clone to make the switch from the thrift API to using CQL3 over the native protocol. Let’s go through some examples of using the python driver, taken from the updated Twissandra code.

Twissandra Datamodel Overview

Twissandra is composed of six tables that store users, tweets, tweet order (of the user and their timeline) and who users follow (and are followed by). Since we can’t use joins in Cassandra, tables are partially denormalised to allow for necessary flexibility, meaning there are more writes to make reads more performant.

Twissandra ER Diagram

The users table simply stores usernames and passwords:

Tracking latest tweets

Tweets are stored in a simple table where the primary key is a UUID column, ensuring the tweet’s uniqueness. We don’t track when the tweet was added in this table as that’s handled by the user’s timeline (see the userline and timeline table creation below).

TimeUUIDs are used for tracking the time of the tweet, to ensure uniqueness in the primary key, as they are composed of a random component and a timestamp. This allows us to retrieve unique tweets by time and also allows for tracking when the tweet was added. Cassandra sorts the timeline and userline based on the clustering key time. Since the aim is to retrieve the latest tweets WITH CLUSTERING ORDER BY (time DESC) is added to the table creation statements to invert the sorting.

Because the username is the partition key, we can easily select the most recent tweets for a specific user. The LIMIT clause can then be added to enforce a limit on how many tweets are retrieved:

An important note. The data-model presented here is only partially denormalized. Denormalizing the tweets table completely into the timeline and userline tables would improve query time, by letting us directly query the tweets from them, instead of requiring a second set of SELECT’s to retrieve the content of the tweets

Tracking Followers
The followers table allows for retrieval of the users that are following you. The friends table allows for retrieval of the users that you follow. The primary key for both tables is a composite key. This is important because the first component of the composite key, the partition key, decides how to split data around the cluster. One set of replicas will store all the data for a specific user. The second component is the clustering key which is used to store data in a particular order on disk. Although the ordering itself isn’t important for either table, the clustering key means all rows for a particular user will be stored contiguously on disk. This optimises reading a user’s friends or followers by allowing for a sequential disk read.

To retrieve all the followers or friends for a specific user, the username is added to the WHERE clause just like in SQL. Something worth noting is that we can use the username in the WHERE clause because it’s part of the primary key.

 

Setting up a connection

To connect to Cassandra we first import the driver’s Cluster class. The next step is to create a cluster and a session. We then supply the list of IPs for nodes in the cluster and tell the session what keyspace to connect to. Note that sessions automatically manage a pool of connections so they should be long-lived and re-used for multiple requests.

 

Some CRUD

The various things that twitter can do, whether it’s inserting a tweet, retrieving your followers, updating your password or unfollowing someone, are examples of create / read / update and delete operations that can be carried out on Cassandra.

Tweeting – Create
Adding tweets is done via Twissandra’s save_tweet function where four kinds of queries are carried out:

  1. Insert the tweet
  2. Update the current user’s userline with the tweet_id
  3. Update the public userline with the tweet_id
  4. Update the timelines of all of the user’s followers with the tweet_id

Inserting a message into Twissandra

Inserting the tweet message is as simple as supplying the username, the message, and generating a UUID. Note, if we didn’t need to save the UUID for use in later inserts, it could have been created using the uuid() function available in Cassandra 2.0. For a full list of CQL3 functions take a look at the DataStax docs.

Adding to the user’s and public userlines requires a username, the tweet’s ID and a time uuid:

Finally to complete the tweeting process, the tweet has to be inserted into each one of your follower’s timelines. This requires the username of the follower, the tweet’s creation time in the form of a Time UUID and the tweet’s ID in the form of a UUID.

Retrieving Tweets – Read

Retrieving tweets is done using one of two functions in Twissandra. The get_timeline and get_userline functions are both calls to _get_line. Retrieving either all of our tweets or all of someone else’s tweets is done via _get_line. To carry out the querying we require a username, a tweet starting time and the number of tweets to fetch. Since we don’t want to fetch the entire feed, first the range of tweets that we want to retrieve is selected.
Retrieving messages from Twissandra

If we need to start our page further back than the latest tweets, the less-than predicate, time < %s, can be used to retrieve tweets further back in the timeline.

Again, because we want to page through the timeline rather than retrieving all of it in a single query, we want to check if we reached the end of the timeline, and if not to store a marker to tell us where to start the page during the next query.

Once the array of tweet IDs is retrieved, they are used to fetch the actual tweets.

Queries are sometimes executed using session.execute and other times session.execute_async is used instead. The difference between the two is that execute waits for a response before returning whilst execute_async returns a “future” so it can send multiple messages concurrently, without waiting for responses, therefore there is no guarantee on the order of the responses. The returned ResponseFuture can be used to verify the query’s success for both serial and concurrent queries. On failure an exception would be raised.

Changing Password – Update
Updates and inserts have mostly identical behavior with Cassandra. They both blindly overwrite existing (or non-existing) data. Twissandra doesn’t use UPDATE statements but for completeness here is a theoretical example of updating a password:

Unfollowing – Delete

Removing a user from your feed requires two queries since in CQL3 there are no foreign keys to enforce relationships between the friends and followers table. The first query removes the user from your feed while the second tells them you are no longer following them. Prior tweets from this user won’t however be deleted from your timeline.

 

Enhancing Twissandra With New Cassandra Features

Modelling in Cassandra frequently requires denormalization as there is no joining of tables. Denormalization can be summed up as the process of adding redundant data to tables in order to optimise read performance. The frequent use-case in the relational model of having users with multiple email addresses is usually modelled by creating a user table and an email table where there is a one-to-many relationship. Cassandra’s alternative is to use CQL3 collections where a column can store a list, set or a map of fields. If Twissandra’s user table also required each user’s email address (see example below) and allowed for more than one, the set collection could be used to store them.

Light Weight Transactions

Lightweight transactions weight transactions (LWT) are another piece of functionality that was added to satisfy commonly used patterns requiring strong consistency, like for example the need to ensure that a username is unique before allowing someone to register said username. LWT aren’t available in version 0 of the python driver but are on their way in the new version 2.0. release. But here is an example of what inserting a username would look like using a LWT from cqlsh. We execute the INSERT as usual, but also append IF NOT EXISTS

LWT can also be used to verify a row exists by appending IF EXISTS to the end of the query:

Post a Comment with Disqus

Cassandra 2.0 Support for DataStax C# Driver

April 16, 2014

By 

We’re glad to release today the first beta of our C# Driver 2.0 that supports Apache Cassandra 2.0 and DataStax Enterprise 4.0, while remaining fully compatible with Cassandra 1.2 and DSE versions relying on it. This driver is intended to be aligned with the feature set that comes in our Java Driver 2.0. It

In practice this means that C# developers can now enjoy:

We have several other improvements and changes to come in the next coming weeks as we’ll iterate through several beta versions:

  • Task based API
  • Using some Interfaces instead of Classes in the API to make it easier to mock every parts of the driver
  • Automatic paging is a feature that has been introduced in Cassandra 2.0. It’s not part of this first beta but will be included in the next one.

This new C# Driver 2.0.0-beta1 is now available on NuGet. Feel free to give it a try!

 

Post a Comment with Disqus

UPDATE: Cassandra Migration Yields Insane (10x) Performance Improvements at Rekko

April 16, 2014

By 

Cassandra Migration Yields Insane Performance Improvements” was created by Robert Thanh Parker, CEO and Founder at Rekko.

Team,

Wanted to let you guys know that I posted the message on our FB page. It’s amazing stuff and something the team has worked incredibly hard on both refactoring our code and architecting and optimizing our infrastructure over Cassandra. More importantly, this update solves two remaining issues:

(1) From time to time, we’d see spikes in load time, while unusual,  resulted from a key update lock in Mongo. The new infrastructure removes this 100% and performance should be and is completely smooth since the update. Average server response times are less than 5ms for our most complex campaign delivery requests. Previously they were at closer to 50ms with spikes that were much higher.

(2) Data manageability. We collect inordinate amounts of data (much more than most Analytics providers), complicating the management of visitors while we scale. This makes performance an ongoing challenge. The new infrastructure largely solves this, but more importantly, at a cost structure that will continue to allow us to deliver increasingly more sophisticated technology at lower costs. A key tenant of our long-term vision of bringing our technology to every online piece of real-estate in the world. Big step there.

newrelicapril15

Here’s the quick post:

New Update: New Rekko Big Data Engine updated. Core services go live and we’re 10x faster overnight!

Speed. Speed. Speed.
We were fast before, but now we’ve concluded a major infrastructure refactoring. Our vision is to make accessible and automate big data personalization for small and medium sized businesses. This is a huge and crucial update towards our goal.

Lowering the Cost/Customer
Driving down the cost of providing enterprise level technology and services such that SMBs can EASILY leverage them is the most essential step in taking this technology mainstream. The first HD plasma TV I saw cost $29,999. The one I just bought cost significantly less. Our first Rekko customers paid $42k/month, our new ones pay a world less.

The Migration.
After a period of running Cassandra DB simultaneously with Mongo DB, the team completed the majority of our migration last night – we’re now completely live on Cassandra. While there are some small portions of infrastructure that will continue to use Mongo, almost everything material is now migrated.

The Results.
To summarize, the slowest of response times on Cassandra (for real-time profiling and campaign delivery) average more than 10x better than fastest we had utilizing the Mongo DB code and infrastructure. We’re now able to intelligently deliver dynamic, tailored content to the a visitor’s browser in less than 1/200th of a second. Consistently, and without spikes.

This is our last infrastructure step prior to rolling out…

#GoBigorGoHome

Some days, we just smile because we know the hard work pushed us three steps forward. Today is one of those days…

Best,
Parker
Founder, Rekko

If you’re interested in learning more about migrating from MongoDB to Apache Cassandra, visit the MongoDB to Cassandra Migration page for resources and how-to’s.

Post a Comment with Disqus
1 2 3 113