Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown

6/27/2022

Reading time:3

Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown, we discussed Spark predicate pushdown in the context of the Spark Cassandra connector. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Predicate Pushdown

Predicate Pushdown in SparkSQL

In general, predicate pushdown is an optimization that SparkSQL applies to a query that interacts with an external datasource. It involves pushing down filtering to be done by the underlying data store being queried, rather than doing that processing in Spark. It applies to several types of external data sources including Cassandra, whether that connection is via the Spark Cassandra connector or Datastax Enterprise. We can also use the optimization when connecting to JDBC data sources like when using the PostgreSQL driver. It also applies when connecting to Hive data sources and when reading from Parquet data sources.

Predicate pushdown happens when a query going ultimately to an external data source includes a valid predicate. A predicate is a clause that is part of the query that selects a specific subset of the data in a table based on the values in each row. A query with no predicates returns all of the data within a particular table. Any time we formulate a query that includes where or filter steps working on data from an external database, predicate pushdown may occur.

This works both for spark sql queries made directly that include a where clause, but it also includes Spark dataframes loaded from external sources that have filters or where methods applied to them. Since Spark uses lazy evaluation, loading a dataframe from an external source for one query does not confer any benefit to future queries unless we cache the dataframe containing all the data, so that it can be queried and filtered later. Loading only the data we need into Spark confers a performance benefit since we don’t need to load the entire table in and the filtering doesn’t need to take place using Spark resources.

Requirement for Cassandra

The rules for predicate pushdown are different in Cassandra for partition key columns and clustering columns. Spark can push partition key columns down as long as we include all of the partition key columns in the filter, and no more than one equivalence predicate exists per column. Although you can use an IN clause to specify multiple restrictions for a particular column. For clustering key columns, only the last predicate in the filter can be a non-equivalence predicate. If there is more than one predicate for a column, the predicates cannot be equivalence predicates. Queries that fail to follow these rules get their associated tables loaded fully into Spark and then are filtered down using Spark resources.

We can use the explain method to see how a Spark query evolves from the provided query to the optimized plan. In these explanations, we see a PushedFilters section when predicate pushdown occurs.

Pushdown in Spark Cassandra Connector

The Spark Cassandra connector also has the ability to push down other types of query. Older versions of the Spark Cassandra connector included specific functions like cassandraCount to perform pushed-down versions of count queries. The alternative would be to load the table into Spark, potentially filter that dataframe, and then use the count method to get the desired value. Now, this same sequence triggers predicate pushdown and the query only returns the individual counts that Spark puts together.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown

Related Articles

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

cassandra

spark

kafka

Apache Cassandra Lunch #84: Data & Analytics Platform: Cassandra, Spark, Kafka

11/4/2022

cassandra

spark

Can Spark Applications Coexist with NoSQL Databases? | Capital One

11/4/2022

proxy

cassandra

spark

Migrate to Azure Managed Instance for Apache Cassandra using Apache Spark

8/18/2022

Explore Further

sparksql

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

cassandra

acid

open.source

cassandra

GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®

1/16/2025

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

migration

proxy

cassandra

GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.

11/1/2024

Predicate Pushdown

Predicate Pushdown in SparkSQL

Requirement for Cassandra

Pushdown in Spark Cassandra Connector

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?