pyspark | 易学教程

pyspark in Ipython notebook raises Py4JNetworkError

阅读更多关于 pyspark in Ipython notebook raises Py4JNetworkError

问题 I was using IPython notebook to run PySpark with just adding the following to the notebook: import os os.chdir('../data_files') import sys import pandas as pd %pylab inline from IPython.display import Image os.environ['SPARK_HOME']="spark-1.3.1-bin-hadoop2.6" sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') ) sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') ) sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.8.2.1-src.zip') ) from

How to copy and convert parquet files to csv

阅读更多关于 How to copy and convert parquet files to csv

问题 I have access to a hdfs file system and can see parquet files with hadoop fs -ls /user/foo How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row. 回答1: Try var df = spark.read.parquet("/path/to/infile.parquet") df.write.csv("/path/to/outfile.csv") Relevant API documentation: pyspark.sql.DataFrameReader.parquet pyspark.sql.DataFrameWriter.csv Both /path/to/infile.parquet and /path/to

Pyspark Creating timestamp column

阅读更多关于 Pyspark Creating timestamp column

问题 I am using spark 2.1.0. I am not able to create timestamp column in pyspark I am using below code snippet. Please help df=df.withColumn('Age',lit(datetime.now())) I am getting assertion error:col should be Column Please help 回答1: Assuming you have dataframe from your code snippet and you want same timestamp for all your rows. Let me create some dummy dataframe. >>> dict = [{'name': 'Alice', 'age': 1},{'name': 'Again', 'age': 2}] >>> df = spark.createDataFrame(dict) >>> import time >>> import

SparkUI for pyspark - corresponding line of code for each stage?

阅读更多关于 SparkUI for pyspark - corresponding line of code for each stage?

问题 I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code. Is there a way I can figure out which Stage is corresponding to which line of the pyspark code? Thanks! 来源： https://stackoverflow.com/questions/38315344/sparkui-for-pyspark

psutil in Apache Spark

阅读更多关于 psutil in Apache Spark

问题 I'm using PySpark 1.5.2. I got UserWarning Please install psutil to have better support with spilling after I issue the command .collect() Why is this warning showed? How can I install psutil ? 回答1: pip install psutil If you need to install specifically for python 2 or 3, try using pip2 or pip3 ; it works for both major versions. Here is the PyPI package for psutil. 回答2: y can clone or download the psutil project in the following link: https://github.com/giampaolo/psutil.git then run setup.py

load parquet file and keep same number hdfs partitions

阅读更多关于 load parquet file and keep same number hdfs partitions

问题 I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M. Total size hdfs dfs -du -s -h /df 5.1 G 15.3 G /df hdfs dfs -du -h /df 43.6 M 130.7 M /df/pid=0 43.5 M 130.5 M /df/pid=1 ... 43.6 M 130.9 M /df/pid=119 I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions. df = spark.read.parquet('df') df.rdd.getNumPartitions() 60 HDFS settings: 'parquet

spark in python: creating an rdd by loading binary data with numpy.fromfile

阅读更多关于 spark in python: creating an rdd by loading binary data with numpy.fromfile

问题 The spark python api currently has limited support for loading large binary data files, and so I tried to get numpy.fromfile to help me out. I first got a list of filenames I'd like to load, e.g.: In [9] filenames Out[9]: ['A0000.dat', 'A0001.dat', 'A0002.dat', 'A0003.dat', 'A0004.dat'] I can load these files without problems with a crude iterative unionization, for i in range(len(filenames)): rdd = sc.parallelize([np.fromfile(filenames[i], dtype="int16", count=-1, sep='')]) if i==0: allRdd =

Expand array-of-structs into columns in PySpark

阅读更多关于 Expand array-of-structs into columns in PySpark

问题 I have a Spark dataframe, originating from Google Analytics, that looks like the following: id customDimensions (Array<Struct>) 100 [ {"index": 1, "value": "Earth"}, {"index": 2, "value": "Europe"}] 101 [ {"index": 1, "value": "Mars" }] I also have a "custom dimensions metadata" dataframe that looks like this: index name 1 planet 2 continent I'd to use the indexes in the metadata df in order to expand my custom dimensions into columns. The result should look like the following: id planet

PySpark reduceByKey causes out of memory

阅读更多关于 PySpark reduceByKey causes out of memory

问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)

PySpark reduceByKey causes out of memory

阅读更多关于 PySpark reduceByKey causes out of memory