问题
We have a Cassandra schema with more than 50 columns and we are inserting data into it from multiple data sources by transforming the data using Spark (Data frames not rdd).
We are running into the issue of many tombstones as our data is sparse.
Already tried the spark.cassandra.output.ignoreNulls=true
but its not working. What would be right config to not write null values in cassandra?
I am using zeppelin to run my spark code and push data to C*
回答1:
Figured out the solution to this:
A hint is present in the document: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md under Setting Connector Specific Options on Datasets
topic.
exact code looks like this:
transformedData.write.format("org.apache.spark.sql.cassandra").option("header","false").option("spark.cassandra.output.ignoreNulls", true).mode("append").options(Map( "table" -> table_name, "keyspace" -> keyspace_name)).save()
来源:https://stackoverflow.com/questions/57659876/ignore-nulls-with-data-frame-using-spark-datastax-connector