问题
I am loading a csv file having 1 million records using pyspark, but getting the error. TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000)
I checked if any of my record in the file has data greater than 1000000 characters, but none of the record is like that. maximum record length in my file is 850.
Please help....
CODE SNIPPET:
input_df = spark.read.format('com.databricks.spark.csv').option("delimiter","\001").option("quote",u"\u0000").load(INPUT_PATH)
input_df.write.mode('overwrite').format('orc').save(TARGET_LOC)
SAMPLE DATA
A B C
-- -- --
a xyz"a 123
b pqr 456
c ABC"z 789
回答1:
You can change the parser setting limiting the number of character per columns, using
option("maxCharsPerColumn", "-1")
so this now should work as:
spark.read.format('com.databricks.spark.csv').option("delimiter","\001").option("quote",u"\u0000").option("maxCharsPerColumn", "-1").load(INPUT_PATH)
else you can also try changing your parser:
.option("parserLib", "commons")
来源:https://stackoverflow.com/questions/49108541/pyspark-textparsingexception-while-loading-a-file