add a python external library in Pyspark

I'm using pyspark (1.6) and i want to use databricks:spark-csv library. For this i've tried different ways with no success

1- i've tried to add a jar i downloaded from, and run

pyspark --jars THE_NAME_OF_THE_JAR
df ='com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv')

But got this error :

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
File "/usr/hdp/", line 137, in load
return self._df(self._jreader.load(path))
 File "/usr/hdp/", line 813, in __call__
 File "/usr/hdp/", line 45, in deco
return f(*a, **kw)
 File "/usr/hdp/", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o53.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks:spark-csv. Please find packages at
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at py4j.reflection.MethodInvoker.invoke(
    at py4j.reflection.ReflectionEngine.invoke(
    at py4j.Gateway.invoke(
    at py4j.commands.AbstractCommand.invokeMethod(
    at py4j.commands.CallCommand.execute(
Caused by: java.lang.ClassNotFoundException: com.databricks:spark-csv.DefaultSource
    at java.lang.ClassLoader.loadClass(
    at java.lang.ClassLoader.loadClass(
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
    at scala.util.Try.orElse(Try.scala:82)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
    ... 14 more

2- second way : i downloaded a library zip file from

and run :

/bin/pyspark --py-files
df ='com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv')

But got the same error. 3- third way:

 pyspark --packages com.databricks:spark-csv_2.11:1.5.0

But it doesn't work too, i got this :

Python 2.7.13 |Anaconda 4.3.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: and
Ivy Default Cache set to: /home/F18076/.ivy2/cache
The jars for the packages stored in: /home/F18076/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]


Spark 1.6 includes spark-csv module, so you don't any external libraries


Actually, from what I remembered you just need to put the jar file in the folder where you are running pyspark. Then afterwards you just need to run your code:

df = ('com.databricks.spark.csv')
     .options(header='true', inferschema='true')
     .load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv') )

So download the jar file from here. When I worked with Apache Spark 1.6.1. I used to download this version: spark-csv_2.10-1.4.0.jar because of Scala 2.10.


For me, using Spark 1.6.3, the following works:

pyspark --packages com.databricks:spark-csv_2.10:1.5.0

After running the above, the console output includes:

com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found com.databricks#spark-csv_2.10;1.5.0 in central
    found org.apache.commons#commons-csv;1.1 in central
    found com.univocity#univocity-parsers;1.5.1 in central

Note that unless you specifically built Spark 1.x against Scala 2.11 (and you would know if you did), you need to use spark-csv_2.10:1.5.0, not spark-csv_2.11:1.5.0.

If you don't want to have to add --packages com.databricks:spark-csv_2.10:1.5.0 every time you invoke pyspark, you can also configure the packages in $SPARK_HOME/conf/spark-defaults.conf (you may need to create the file if you've never set anything there before) by adding the following:

spark.jars.packages               com.databricks:spark-csv_2.10:1.5.0

Finally, it used to be the case that with older versions of Spark 1.x (I think 1.4 and 1.5 at least), you could just set the environment variable PYSPARK_SUBMIT_ARGS, e.g.:

export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.10:1.5.0 pyspark-shell"

and then, invoking pyspark would add the desired dependencies automatically. However, this no longer works in Spark 1.6.3.

None of this is necessary for Spark 2.x, as spark-csv has been inlined into Spark 2.

