How do I get Python libraries in pyspark?

后端 未结 4 1909
闹比i
闹比i 2020-12-08 09:02

I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.

When I try to import any of them I get the below error:

>>> from          


        
4条回答
  •  不思量自难忘°
    2020-12-08 09:20

    Is this on standalone (i.e. laptop/desktop) or in a cluster environment (e.g. AWS EMR)?

    1. If on your laptop/desktop, pip install shapely should work just fine. You may need to check your environment variables for your default python environment(s). For example, if you typically use Python 3 but use Python 2 for pyspark, then you would not have shapely available for pyspark.

    2. If in a cluster environment such as in AWS EMR, you can try:

      import os
      
      def myfun(x):`
              os.system("pip install shapely")
              return x
      rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes
      rdd.map(lambda x: myfun(x)).collect() 
      ## call each cluster to run the code to import the library
      

    "I know the module isn't present, but I want to know how can these packages be brought to my pyspark libraries."

    On EMR, if you want pyspark to be pre-prepared with whatever other libraries and configurations you want, you can use a bootstrap step to make those adjustments. Aside from that, you can't "add" a library to pyspark without compiling Spark in Scala (which would be a pain to do if you're not savvy with SBT).

提交回复
热议问题