Apache Spark Dataframe - Load data from nth line of a CSV file

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"

m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100

Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?

Dataset<Row> df = spark.read()
            .format("csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .load("\home\user\data\20170326.csv");

Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?

You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.

df = spark.read()
          .format("csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .option("comment", "m")
          .load("\home\user\data\20170326.csv")

来源：https://stackoverflow.com/questions/43029020/apache-spark-dataframe-load-data-from-nth-line-of-a-csv-file

标签

apache-spark

apache-spark-sql

spark-dataframe

apache-spark-2.0

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!