问题
I am working on an Azure Datalake. I want to access cassandra from my pyspark script. I tried :
> pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Ivy Default Cache set to: /home/opnf/.ivy2/cache
The jars for the packages stored in: /home/opnf/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.5.5.0-157/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
anguenot#pyspark-cassandra added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found anguenot#pyspark-cassandra;0.7.0 in spark-packages
found com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 in central
found org.joda#joda-convert;1.2 in central
found commons-beanutils#commons-beanutils;1.9.3 in central
found commons-collections#commons-collections;3.2.2 in central
found com.twitter#jsr166e;1.1.0 in central
found io.netty#netty-all;4.0.33.Final in central
found joda-time#joda-time;2.3 in central
found org.scala-lang#scala-reflect;2.11.8 in central
found net.razorvine#pyrolite;4.10 in central
found net.razorvine#serpent;1.12 in central
:: resolution report :: resolve 710ms :: artifacts dl 33ms
:: modules in use:
anguenot#pyspark-cassandra;0.7.0 from spark-packages in [default]
com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 from central in [default]
com.twitter#jsr166e;1.1.0 from central in [default]
commons-beanutils#commons-beanutils;1.9.3 from central in [default]
commons-collections#commons-collections;3.2.2 from central in [default]
io.netty#netty-all;4.0.33.Final from central in [default]
joda-time#joda-time;2.3 from central in [default]
net.razorvine#pyrolite;4.10 from central in [default]
net.razorvine#serpent;1.12 from central in [default]
org.joda#joda-convert;1.2 from central in [default]
org.scala-lang#scala-reflect;2.11.8 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 11 | 0 | 0 | 0 || 11 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 11 already retrieved (0kB/40ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/anguenot_pyspark-cassandra-0.7.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.datastax.spark_spark-cassandra-connector_2.11-2.0.6.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_pyrolite-4.10.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.joda_joda-convert-1.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/joda-time_joda-time-2.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_serpent-1.12.jar added multiple times to distributed cache.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.2.2.5.5.0-157
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>> import pyspark_cassandra
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra
Apparently, there is no problem during the loading, but at the end, i still cannot import the package. What could be the reason ?
回答1:
The use of the package is a little bit different than what is described in the documentation.
There is no need to import the package. Instead, if you want to read a dataframe, use :
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="my_table", keyspace="my_keyspace")\
.load()
If you want to write, use :
df.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(
table="my_table",
keyspace="my_keyspace",
)\
.save()
(with mode('overwrite')
, you may have to add the method .option('confirm.truncate',True)
)
来源:https://stackoverflow.com/questions/49878798/access-cassandra-from-pyspark