Cassandra introduced SBR (Spark Bulk Reaader) with CEP-28. I have couple of queries to understand if it is a good candidate for my use case.
My use case - I have a service (let's say S1) that uses Cassandra for persistence. I want to export all the data for joining it with another services S2's data. However, the use case is to find out what all records in S2 are not being referenced in S1 any more and delete such records from S2. Which means, my export of S1 (Cassandra) needs to have all the data originally present in its Cassandra persistence. I can tolerate if a change made in S1 in last few days (let's say 10 days) is missing because I can always take a backup of S2 10 days earlier than S1. However, I cannot tolerate any other kind of data to be missing from the backup of S1 e.g. missing a record which had been created 6 months back or corruption of data. Such a miss from S1 would lead to a data loss.
My Queries -
- Is SBR a good option to use to export data from service S1? CEP-28 mentions "analytics workloads" to be the motivation which makes me wonder if it's a good fit for my use case.
- I failed to find if SBR is still in Beta or not. Can someone confirm which state it is in as of today and direct me to an official announcement?
- The approach in CEP-28 says that it's going to use a sidecar and a library. Does that mean, any changes made to the main Cassandra process may leave (or may already have left) the sidecar/library incompatible with some version of Cassandra which may cause incorrect backup for me leading to a possibility of data loss?