access cassandra from pyspark

问题

I am working on an Azure Datalake. I want to access cassandra from my pyspark script. I tried :

> pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Ivy Default Cache set to: /home/opnf/.ivy2/cache
The jars for the packages stored in: /home/opnf/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.5.5.0-157/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
anguenot#pyspark-cassandra added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found anguenot#pyspark-cassandra;0.7.0 in spark-packages
        found com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 in central
        found org.joda#joda-convert;1.2 in central
        found commons-beanutils#commons-beanutils;1.9.3 in central
        found commons-collections#commons-collections;3.2.2 in central
        found com.twitter#jsr166e;1.1.0 in central
        found io.netty#netty-all;4.0.33.Final in central
        found joda-time#joda-time;2.3 in central
        found org.scala-lang#scala-reflect;2.11.8 in central
        found net.razorvine#pyrolite;4.10 in central
        found net.razorvine#serpent;1.12 in central
:: resolution report :: resolve 710ms :: artifacts dl 33ms
        :: modules in use:
        anguenot#pyspark-cassandra;0.7.0 from spark-packages in [default]
        com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 from central in [default]
        com.twitter#jsr166e;1.1.0 from central in [default]
        commons-beanutils#commons-beanutils;1.9.3 from central in [default]
        commons-collections#commons-collections;3.2.2 from central in [default]
        io.netty#netty-all;4.0.33.Final from central in [default]
        joda-time#joda-time;2.3 from central in [default]
        net.razorvine#pyrolite;4.10 from central in [default]
        net.razorvine#serpent;1.12 from central in [default]
        org.joda#joda-convert;1.2 from central in [default]
        org.scala-lang#scala-reflect;2.11.8 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   11  |   0   |   0   |   0   ||   11  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 11 already retrieved (0kB/40ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/anguenot_pyspark-cassandra-0.7.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.datastax.spark_spark-cassandra-connector_2.11-2.0.6.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_pyrolite-4.10.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.joda_joda-convert-1.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/joda-time_joda-time-2.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_serpent-1.12.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.2.2.5.5.0-157
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkSession available as 'spark'.
>>> import pyspark_cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra

Apparently, there is no problem during the loading, but at the end, i still cannot import the package. What could be the reason ?

回答1:

The use of the package is a little bit different than what is described in the documentation.

There is no need to import the package. Instead, if you want to read a dataframe, use :

sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="my_table", keyspace="my_keyspace")\
    .load()

If you want to write, use :

df.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(
        table="my_table", 
        keyspace="my_keyspace",
    )\
    .save()

(with mode('overwrite'), you may have to add the method .option('confirm.truncate',True))

来源：https://stackoverflow.com/questions/49878798/access-cassandra-from-pyspark

标签

apache-spark

cassandra

pyspark