Ignore Nulls with Data frame using spark datastax connector

问题

We have a Cassandra schema with more than 50 columns and we are inserting data into it from multiple data sources by transforming the data using Spark (Data frames not rdd).

We are running into the issue of many tombstones as our data is sparse.

Already tried the spark.cassandra.output.ignoreNulls=true but its not working. What would be right config to not write null values in cassandra?

I am using zeppelin to run my spark code and push data to C*

回答1:

Figured out the solution to this:

A hint is present in the document: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md under Setting Connector Specific Options on Datasets topic.

exact code looks like this:

transformedData.write.format("org.apache.spark.sql.cassandra").option("header","false").option("spark.cassandra.output.ignoreNulls", true).mode("append").options(Map( "table" -> table_name, "keyspace" -> keyspace_name)).save()

来源：https://stackoverflow.com/questions/57659876/ignore-nulls-with-data-frame-using-spark-datastax-connector

标签

dataframe

apache-spark

cassandra

datastax

connector

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!