I have a SparkSQL DataFrame.
Some entries in this data are empty but they don\'t behave like NULL or NA. How could I remove them? Any ideas?
In R I can easi
It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:
testFrame <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
testFrame$c <- cast(testFrame$b,"string")
resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
resultFrame$c <- NULL
This omits the entire row where the element in column b is missing.
SparkR Column provides a long list of useful methods including isNull and isNotNull:
> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)
Id Age
1 1 21
2 2 18
3 3 NA
> filter(people, isNotNull(people$Age)) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
> filter(people, isNull(people$Age)) %>% head()
Id Age
1 4 NA
Please keep in mind that there is no distinction between NA and NaN in SparkR.
If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:
> fillna(people, 99) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
4 4 99
> dropna(people) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:
> people_with_names_local <- data.frame(
Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
4 4 NA <NA>
> dropna(people_with_names, minNonNulls=2) %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob