April 8th, 2014

By 

 

New York Times
 

“Simplicity has helped us make nyt⨍aбrik global, reliable, fast, and efficient. It scales up and down on a minute’s notice to meet demand.”

- Michael Laing, Systems Architect at The New York Times

Michael Laing Systems Architect at The New York Times

 

The New York Times

The New York Times is the premier news organization and has a worldwide online presence.

I am the systems architect of the nyt⨍aбrik platform that supports the online functions.

 

NY Times’ messaging + update system

nyt⨍aбrik is a “chat” system for things — our client and partner devices, our systems and services.

It is simple and tries to do only a few tasks very well — connect millions of devices to our services, route billions of messages quickly and efficiently, remember every message.

BreakingNews

Simplicity has helped us make nyt⨍aбrik global, reliable, fast, and efficient. It scales up and down on a minute’s notice to meet demand.

 

Choosing Cassandra

Open source, multi-region support, scalability, and reliability/availability were the primary criterion. We used DynamoDB originally, but converted as it is not multi-region. Riak similarly does not have an open source multi-region capability. And we liked the look of CQL3 and the asynchronous protocol.

Our volumes vary widely which requires scalability. The news must get out, requiring consistent speed and reliability/availability. Our messaging architecture is flat and wide — we wanted a cache to match.

 

Cassandra for cache

Cassandra is the global caching layer for nyt⨍aбrik. We use Cassandra 2.0.6.

If a service wants to deliver a message to a user device but that device is not connected — we cache the message for later delivery.  Many clients want to receive the latest versions of certain kinds of messages — “breaking news” for example — whenever they connect. These messages are cached as well and served on connection based upon client preference (also cached).

The cache is useful for analyzing the messages that flow through nyt⨍aбrik. And often we want to retrieve a certain message or see who read it.

 

Deployment

We currently run a small cluster — as small as we can get away with: 12 nodes in production across 6 zones in 2 AWS regions: Oregon and Dublin. And our volumes through nyt⨍aбrik are small: 10 to 100 M messages per day. And the messages are small: 1 – 5 KB typically — large message bodies are pushed to S3 / CloudFront and passed by reference. All messages have a ttl — 3 days by default — we never do explicit deletes. We will be adding more regions — and as we start gathering events and provide more client messaging services, volumes will grow rapidly.

 

Words of wisdom

Start with the latest release of Cassandra 2 — it is sufficiently stable now.

Use CQL3. Try not to get confused by the old terminology many people still use.

Learn the physical structures, read path, write path, etc. so you can design high performing tables that support your operations consistently and well.

 

Community

The community has been good. Very active; very responsive. Take the time to read the JIRA issues so you know what “features” to avoid for now.

 

P.S.

Open source is great. See you at OSCON in Portland, OR.