pyspark | 易学教程

How to explode multiple columns of a dataframe in pyspark

阅读更多关于 How to explode multiple columns of a dataframe in pyspark

问题 I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same. Name Age Subjects Grades [Bob] [16] [Maths,Physics,Chemistry] [A,B,C] I want to explode the dataframe in such a way that i get the following output- Name Age Subjects Grades Bob 16 Maths A Bob 16 Physics B Bob 16 Chemistry C How can I achieve this? 回答1: This works, import pyspark.sql.functions as F from pyspark.sql.types import * df = sql.createDataFrame( [(['Bob'],

Access to Spark from Flask app

阅读更多关于 Access to Spark from Flask app

问题 I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it's own server. I don't think that the Spark context is running within the script. How do I get Spark working in the following example? from flask import Flask, request from pyspark import SparkConf, SparkContext app = Flask(__name__) conf = SparkConf() conf.setMaster("local") conf.setAppName("SparkContext1") conf.set("spark.executor.memory", "1g") sc = SparkContext

Spark SQL Row_number() PartitionBy Sort Desc

阅读更多关于 Spark SQL Row_number() PartitionBy Sort Desc

问题 I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: from pyspark import HiveContext from pyspark.sql.types import * from pyspark.sql import Row, functions as F from pyspark.sql.window import Window data_cooccur.select("driver", "also_item", "unit_count", F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count")).alias("rowNum")).show() That gives me this

Spark: Convert column of string to an array

阅读更多关于 Spark: Convert column of string to an array

问题 How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema scala> test.printSchema root |-- a: long (nullable = true) |-- b: string (nullable = true) +---+---+ | a| b| +---+---+ | 1|2,3| +---+---+ | 2|4,5| +---+---+ To: scala> test1.printSchema root |-- a: long (nullable = true) |-- b: array (nullable = true) | |-- element: long (containsNull = true) +---+-----+ | a| b | +---+-----+ | 1|[2,3]| +---+-----+ | 2|[4,5]| +---+-----+ Please share

GroupBy and concat array columns pyspark

阅读更多关于 GroupBy and concat array columns pyspark

问题 I have this data frame df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store", "values"]) +-----+---------+ |store| values| +-----+---------+ | 1|[1, 2, 3]| | 1|[4, 5, 6]| | 2| [2]| | 2| [3]| +-----+---------+ and I would like to convert into the follwing df: +-----+------------------+ |store| values | +-----+------------------+ | 1|[1, 2, 3, 4, 5, 6]| | 2| [2, 3]| +-----+------------------+ I did this: from pyspark.sql import functions as F df.groupBy("store")

Extract document-topic matrix from Pyspark LDA Model

阅读更多关于 Extract document-topic matrix from Pyspark LDA Model

问题 I have successfully trained an LDA model in spark, via the Python API: from pyspark.mllib.clustering import LDA model=LDA.train(corpus,k=10) This works completely fine, but I now need the document -topic matrix for the LDA model, but as far as I can tell all I can get is the word -topic, using model.topicsMatrix() . Is there some way to get the document-topic matrix from the LDA model, and if not, is there an alternative method (other than implementing LDA from scratch) in Spark to run an LDA

How do I read a parquet in PySpark written from Spark?

阅读更多关于 How do I read a parquet in PySpark written from Spark?

问题 I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet: partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet") I then go to my Python notebook to read in the data: df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet") and I get the following error: AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d:/

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

阅读更多关于 PySpark groupByKey returning pyspark.resultiterable.ResultIterable

问题 I am trying to figure out why my groupByKey is returning the following: [(0, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a210>), (1, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a4d0>), (2, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a390>), (3, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a290>), (4, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a450>), (5, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a350>)

Spark rdd write in global list

阅读更多关于 Spark rdd write in global list

问题 How to write in global list with rdd? Li = [] Fn(list): If list.value == 4: Li.append(1) rdd.mapValues(lambda x:fn(x)) When I try to print Li the result is: [] What I'm trying to do is to transform another global liste Li1 while transforming the rdd object. However, when I do this I have always an empty list in the end. Li1 is never transformed. 回答1: The reason why you get Li value set to [] after executing mapValue s - is because Spark serializes Fn function (and all global variables that it

Problems with pySpark Columnsimilarities

阅读更多关于 Problems with pySpark Columnsimilarities

问题 tl;dr How do I use pySpark to compare the similarity of rows? I have a numpy array where I would like to compare the similarities of each row to one another print (pdArray) #[[ 0. 1. 0. ..., 0. 0. 0.] # [ 0. 0. 3. ..., 0. 0. 0.] # [ 0. 0. 0. ..., 0. 0. 7.] # ..., # [ 5. 0. 0. ..., 0. 1. 0.] # [ 0. 6. 0. ..., 0. 0. 3.] # [ 0. 0. 0. ..., 2. 0. 0.]] Using scipy I can compute cosine similarities as follow... pyspark.__version__ # '2.2.0' from sklearn.metrics.pairwise import cosine_similarity