I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I\'m trying to replace Null or invalid values present in a column
In order to replace the NULL values with a given string I've used fill
function present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-
List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);
You can use .na.fill
function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).
Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame
You can choose the columns, and you choose the value you want to replace the null or NaN.
In your case it will be something like:
val df2 = df.na.fill("a", Seq("Name"))
.na.fill("a2", Seq("Place"))
You can use DataFrame.na.fill()
to replace the null with some value
To update at once you can do as
val map = Map("Name" -> "a", "Place" -> "a2")
df.na.fill(map).show()
But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with like
function.
You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.
So if you already know the value that you want to replace Null with...:
String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)
You can do the same for the rest of your columns.