How do I get Python libraries in pyspark?

后端 未结 4 1911
闹比i
闹比i 2020-12-08 09:02

I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.

When I try to import any of them I get the below error:

>>> from          


        
4条回答
  •  借酒劲吻你
    2020-12-08 09:36

    This is how I get it worked in our AWS EMR cluster (It should be same in any other cluster as well). I created the following shell script and executed it as a bootstrap-actions:

    #!/bin/bash
    # shapely installation
    wget http://download.osgeo.org/geos/geos-3.5.0.tar.bz2
    tar jxf geos-3.5.0.tar.bz2
    cd geos-3.5.0 && ./configure --prefix=$HOME/geos-bin && make && make install
    sudo cp /home/hadoop/geos-bin/lib/* /usr/lib
    sudo /bin/sh -c 'echo "/usr/lib" >> /etc/ld.so.conf'
    sudo /bin/sh -c 'echo "/usr/lib/local" >> /etc/ld.so.conf'
    sudo /sbin/ldconfig
    sudo /bin/sh -c 'echo -e "\nexport LD_LIBRARY_PATH=/usr/lib" >> /home/hadoop/.bashrc'
    source /home/hadoop/.bashrc
    sudo pip install shapely
    echo "Shapely installation complete"
    pip install https://pypi.python.org/packages/74/84/fa80c5e92854c7456b591f6e797c5be18315994afd3ef16a58694e1b5eb1/Geohash-1.0.tar.gz
    #
    exit 0
    

    Note: Instead of running as a bootstrap-actions this script can be executed independently in every node in a cluster. I have tested both scenarios.

    Following is a sample pyspark and shapely code (Spark SQL UDF) to ensure above commands are working as expected:

    Python 2.7.10 (default, Dec  8 2015, 18:25:23) 
    [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
          /_/
    
    Using Python version 2.7.10 (default, Dec  8 2015 18:25:23)
    SparkContext available as sc, HiveContext available as sqlContext.
    >>> from pyspark.sql.functions import udf
    >>> from pyspark.sql.types import StringType
    >>> from shapely.wkt import loads as load_wkt
    >>> def parse_region(region):
    ...     from shapely.wkt import loads as load_wkt
    ...     reverse_coordinate = lambda coord: ' '.join(reversed(coord.split(':')))
    ...     coordinate_list = map(reverse_coordinate, region.split(', '))
    ...     if coordinate_list[0] != coordinate_list[-1]:
    ...         coordinate_list.append(coordinate_list[0])
    ...     return str(load_wkt('POLYGON ((%s))' % ','.join(coordinate_list)).wkt)
    ... 
    >>> udf_parse_region=udf(parse_region, StringType())
    16/09/06 22:18:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
    16/09/06 22:18:34 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
    >>> df = sqlContext.sql('select id, bounds from  limit 10')
    >>> df2 = df.withColumn('bounds1', udf_parse_region('bounds'))
    >>> df2.first()
    Row(id=u'0089d43a-1b42-4fba-80d6-dda2552ee08e', bounds=u'33.42838509594465:-119.0533447265625, 33.39170168789402:-119.0203857421875, 33.29992542601392:-119.0478515625', bounds1=u'POLYGON ((-119.0533447265625 33.42838509594465, -119.0203857421875 33.39170168789402, -119.0478515625 33.29992542601392, -119.0533447265625 33.42838509594465))')
    >>> 
    

    Thanks, Hussain Bohra

提交回复
热议问题