Apache Spark throws NullPointerException when encountering missing feature
I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value for 'x0'. At first, I'm reading features from csv file into DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer: import pyspark_csv as pycsv from pyspark.ml.feature import StringIndexer sc.addPyFile('pyspark_csv.py') features = pycsv.csvToDataFrame(sqlCtx, sc.textFile('tmp.csv')) indexer =