Redwood City, CA: DataStax Enterprise Analytics – Apache Spark and Cassandra
Date(s) - January 29, 2015 - January 30, 2015
Time -
All Day
Description: This course is designed to provide a comprehensive foundation for data analytics with Spark and Cassandra using DataStax Enterprise. It covers Spark and Cassandra integration, architectural and deployment aspects, Spark Core concepts and operations on generic and key-value datasets, and real-life scenarios and recipes of Cassandra data processing, analysis, and analytics using Spark.
Length: 2 days
Prerequisites: Prior experience with Apache Cassandra (CASCOR), Scala, and Linux
Audience: Data scientists, data management and business intelligence professionals
Environment: Pre-configured DSE Analytics EC2 cluster, related tooling, and exercise files (everything provided by DataStax)

Learning Objectives

Introduction to Data Analytics with Cassandra and Spark

  • Introduce big data analytics with Cassandra and Spark
  • Setup, configure, start, and tune DSE Cassandra and Spark
  • Survey Cassandra and Spark tools
  • Lab 1: Working with the DSE Analytics cluster

Spark Essentials

  • Introduce the Resilient Distributed Dataset
  • Describe main features of the Spark-Cassandra Connector
  • Use basic operations on RDDs
  • Use shared variables
  • Describe lineage graphs, lazy evaluation, and persistence
  • Lab 2: Working with Spark and Cassandra RDDs

Operations on Key-Value Pair RDDs

  • Introduce Pair RDDs
  • Use aggregation, grouping, and sorting
  • Use joins, intersection, union, and difference
  • Understand and control partitioning
  • Lab 3: Using Spark Pair RDDs to Join and Aggregate Data in Cassandra

Spark Applications: Cassandra Data Processing, Analysis and Analytics

  • Create and run standalone Spark applications
  • Use Spark for data processing
  • Use Spark for data analysis and data analytics
  • Lab 4: Implementing a Collaborative Filtering Approach to Music Recommendations Using Spark and Cassandra