Spark Scala - java.util.NoSuchElementException & Data Cleaning

白昼怎懂夜的黑 提交于 2019-12-19 18:36:57

问题


I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email). Thus, when I try to show() beyond a certain point or save() my data frame, I get a java.util.NoSuchElementException because sentiment() must have returned nothing at that row.

My initial code is loading the data, and applying sentiment() as shown in spark-corenlp API.

       val customSchema = StructType(Array(
                        StructField("contactId", StringType, true),
                        StructField("email", StringType, true))
                        )

// Load dataframe   
val df = sqlContext.read
                        .format("com.databricks.spark.csv")
                        .option("delimiter","\t")          // Delimiter is tab
                        .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
                        .schema(customSchema)              // Schema of the table
                        .load("emails")                        // Input file


    val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe

I tried to filter for null and NaN values:

val sentFiltered = sent.filter('sentiment.isNotNull)
                .filter(!'sentiment.isNaN)
                .filter(col("sentiment").between(0,4))

I even tried to do it via SQL query:

sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")

I don't know what input is making the spark-corenlp crash. How can I find out? Else, how can I filter these non existing values from col("sentiment")? Or else, should I try catching the Exception and ignore the row? Is this even possible?

来源:https://stackoverflow.com/questions/38230285/spark-scala-java-util-nosuchelementexception-data-cleaning

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!