I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.
When I try to import any of them I get the below error:
>>> from
Is this on standalone (i.e. laptop/desktop) or in a cluster environment (e.g. AWS EMR)?
If on your laptop/desktop, pip install shapely should work just fine. You may need to check your environment variables for your default python environment(s). For example, if you typically use Python 3 but use Python 2 for pyspark, then you would not have shapely available for pyspark.
If in a cluster environment such as in AWS EMR, you can try:
import os
def myfun(x):`
os.system("pip install shapely")
return x
rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes
rdd.map(lambda x: myfun(x)).collect()
## call each cluster to run the code to import the library
"I know the module isn't present, but I want to know how can these packages be brought to my pyspark libraries."
On EMR, if you want pyspark to be pre-prepared with whatever other libraries and configurations you want, you can use a bootstrap step to make those adjustments. Aside from that, you can't "add" a library to pyspark without compiling Spark in Scala (which would be a pain to do if you're not savvy with SBT).