pyspark

pyspark in Ipython notebook raises Py4JNetworkError

断了今生、忘了曾经 提交于 2020-01-02 04:54:06
问题 I was using IPython notebook to run PySpark with just adding the following to the notebook: import os os.chdir('../data_files') import sys import pandas as pd %pylab inline from IPython.display import Image os.environ['SPARK_HOME']="spark-1.3.1-bin-hadoop2.6" sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') ) sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') ) sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.8.2.1-src.zip') ) from

How to copy and convert parquet files to csv

两盒软妹~` 提交于 2020-01-02 04:45:27
问题 I have access to a hdfs file system and can see parquet files with hadoop fs -ls /user/foo How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row. 回答1: Try var df = spark.read.parquet("/path/to/infile.parquet") df.write.csv("/path/to/outfile.csv") Relevant API documentation: pyspark.sql.DataFrameReader.parquet pyspark.sql.DataFrameWriter.csv Both /path/to/infile.parquet and /path/to

Pyspark Creating timestamp column

本小妞迷上赌 提交于 2020-01-02 02:55:27
问题 I am using spark 2.1.0. I am not able to create timestamp column in pyspark I am using below code snippet. Please help df=df.withColumn('Age',lit(datetime.now())) I am getting assertion error:col should be Column Please help 回答1: Assuming you have dataframe from your code snippet and you want same timestamp for all your rows. Let me create some dummy dataframe. >>> dict = [{'name': 'Alice', 'age': 1},{'name': 'Again', 'age': 2}] >>> df = spark.createDataFrame(dict) >>> import time >>> import

SparkUI for pyspark - corresponding line of code for each stage?

懵懂的女人 提交于 2020-01-02 02:55:09
问题 I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code. Is there a way I can figure out which Stage is corresponding to which line of the pyspark code? Thanks! 来源: https://stackoverflow.com/questions/38315344/sparkui-for-pyspark

psutil in Apache Spark

╄→гoц情女王★ 提交于 2020-01-02 02:28:28
问题 I'm using PySpark 1.5.2. I got UserWarning Please install psutil to have better support with spilling after I issue the command .collect() Why is this warning showed? How can I install psutil ? 回答1: pip install psutil If you need to install specifically for python 2 or 3, try using pip2 or pip3 ; it works for both major versions. Here is the PyPI package for psutil. 回答2: y can clone or download the psutil project in the following link: https://github.com/giampaolo/psutil.git then run setup.py

load parquet file and keep same number hdfs partitions

只谈情不闲聊 提交于 2020-01-02 00:32:09
问题 I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M. Total size hdfs dfs -du -s -h /df 5.1 G 15.3 G /df hdfs dfs -du -h /df 43.6 M 130.7 M /df/pid=0 43.5 M 130.5 M /df/pid=1 ... 43.6 M 130.9 M /df/pid=119 I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions. df = spark.read.parquet('df') df.rdd.getNumPartitions() 60 HDFS settings: 'parquet

spark in python: creating an rdd by loading binary data with numpy.fromfile

对着背影说爱祢 提交于 2020-01-01 19:54:30
问题 The spark python api currently has limited support for loading large binary data files, and so I tried to get numpy.fromfile to help me out. I first got a list of filenames I'd like to load, e.g.: In [9] filenames Out[9]: ['A0000.dat', 'A0001.dat', 'A0002.dat', 'A0003.dat', 'A0004.dat'] I can load these files without problems with a crude iterative unionization, for i in range(len(filenames)): rdd = sc.parallelize([np.fromfile(filenames[i], dtype="int16", count=-1, sep='')]) if i==0: allRdd =

Expand array-of-structs into columns in PySpark

ⅰ亾dé卋堺 提交于 2020-01-01 19:45:08
问题 I have a Spark dataframe, originating from Google Analytics, that looks like the following: id customDimensions (Array<Struct>) 100 [ {"index": 1, "value": "Earth"}, {"index": 2, "value": "Europe"}] 101 [ {"index": 1, "value": "Mars" }] I also have a "custom dimensions metadata" dataframe that looks like this: index name 1 planet 2 continent I'd to use the indexes in the metadata df in order to expand my custom dimensions into columns. The result should look like the following: id planet

PySpark reduceByKey causes out of memory

百般思念 提交于 2020-01-01 19:38:11
问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)

PySpark reduceByKey causes out of memory

馋奶兔 提交于 2020-01-01 19:38:08
问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)