Read SAS sas7bdat data with Spark

与世无争的帅哥 提交于 2020-06-16 11:53:06

问题


I have a SAS table and I try to read it with Spark. I've try to use this https://github.com/saurfang/spark-sas7bdat like but I couldn't get it to work.

Here is the code:

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)
df = sqlContext.read.format("com.github.saurfang.sas.spark").load("my_table.sas7bdat")

It returns this error:

Py4JJavaError: An error occurred while calling o878.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)

Caused by: java.lang.ClassNotFoundException: com.github.saurfang.sas.spark.DefaultSource
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)...

Any ideas?


回答1:


It looks like the package was not imported correctly. You have to use --packages saurfang:spark-sas7bdat:2.0.0-s_2.10 when running spark-submit or pyspark. See: https://spark-packages.org/package/saurfang/spark-sas7bdat

You could also download the JAR file from that page, and run your pyspark or spark-submit command with --jars /path/to/jar




回答2:


I had tried the above two methods but didn't work for me as the data frame is not accessible even for df.count() and throws up an error. I had a 5768 X 6432 data frame.

Solution: Convert sas7bdat into flat file CSV or txt with a delimiter of your choice I have done it on txt with pipe delimiter as my data could have had commas.

read the sas7bdat and use it to get the schema.

df= spark.read.format("com.github.saurfang.sas.spark").load("PATH/SAS_DATA.sas7bdat")
vartype = df.schema

now pass this schema when reading txt file

df2 = spark.read.format('csv').option('header','True').option('delimiter','|').schema(vartype).load("path/SAS_DATA.txt")

Works for me



来源:https://stackoverflow.com/questions/51949414/read-sas-sas7bdat-data-with-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!