pyspark | 易学教程

PySpark Overwrite added sc.addPyFile

阅读更多关于 PySpark Overwrite added sc.addPyFile

问题 I have these 2 files saved under this path: C:\code\sample1\main.py def method(): return "this is sample method 1" C:\code\sample2\main.py def method(): return "this is sample method 2" and then i run this: from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext() spark = SparkSession(sc) sc.addPyFile("~/code/sample1/main.py") main1 = __import__("main") print(main1.method()) # this is sample method 1 sc.addPyFile("~/code/sample2/main.py") # Error The error is

PySpark Overwrite added sc.addPyFile

阅读更多关于 PySpark Overwrite added sc.addPyFile

How to vectorize json data for KMeans?

阅读更多关于 How to vectorize json data for KMeans?

问题 I have a number of questions and choices which users are going to answer. They have the format like this: question_id, text, choices And for each user I store the answered questions and selected choice by each user as a json in mongodb: {user_id: "", "question_answers" : [{"question_id": "choice_id", ..}] } Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the

How to vectorize json data for KMeans?

阅读更多关于 How to vectorize json data for KMeans?

percentage count per group and pivot with pyspark

阅读更多关于 percentage count per group and pivot with pyspark

问题 I have dataframe with columns from and to.Both are country codes and they show starting country and destination country. +----+---+ |from| to| +----+---+ | TR| tr| | TR| tr| | TR| tr| | TR| gr| | ES| tr| | GR| tr| | CZ| it| | LU| it| | AR| it| | DE| it| | IT| it| | IT| it| | US| it| | GR| fr| Is there a way to get a dataframe that shows the percentage of each destination country per country of origin, with column all the destination country code? the percentage must be out of the total

pyspark use dataframe inside udf

阅读更多关于 pyspark use dataframe inside udf

问题 I have two dataframes df1 +---+---+----------+ | n|val| distances| +---+---+----------+ | 1| 1|0.27308652| | 2| 1|0.24969208| | 3| 1|0.21314497| +---+---+----------+ and df2 +---+---+----------+ | x1| x2| w| +---+---+----------+ | 1| 2|0.03103427| | 1| 4|0.19012526| | 1| 10|0.26805446| | 1| 8|0.26825935| +---+---+----------+ I want to add a new column to df1 called gamma , which will contain the sum of the w value from df2 when df1.n == df2.x1 OR df1.n == df2.x2 I tried to use udf, but

Cannot create Dataframe in PySpark

阅读更多关于 Cannot create Dataframe in PySpark

问题 I want to create a Dataframe in PySpark with the following code from pyspark.sql import * from pyspark.sql.types import * temp = Row("DESC", "ID") temp1 = temp('Description1323', 123) print temp1 schema = StructType([StructField("DESC", StringType(), False), StructField("ID", IntegerType(), False)]) df = spark.createDataFrame(temp1, schema) But i am receiving the following error: TypeError: StructType can not accept object 'Description1323' in type type 'str' Whats wrong with my code? 回答1:

Split column of list into multiple columns in the same PySpark dataframe

阅读更多关于 Split column of list into multiple columns in the same PySpark dataframe

问题 I have the following dataframe which contains 2 columns: 1st column has column names 2nd Column has list of values. +--------------------+--------------------+ | Column| Quantile| +--------------------+--------------------+ | rent|[4000.0, 4500.0, ...| | is_rent_changed|[0.0, 0.0, 0.0, 0...| | phone|[7.022372888E9, 7...| | Area_house|[1000.0, 1000.0, ...| | bedroom_count|[1.0, 1.0, 1.0, 1...| | bathroom_count|[1.0, 1.0, 1.0, 1...| | maintenance_cost|[0.0, 0.0, 0.0, 0...| | latitude|[12

spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

阅读更多关于 spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

问题 While trying to read data from oracle database using spark on AWS EMR, I am getting this error message: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver. Can someone let me know if anyone faced this issue and how they resolved it? pyspark --driver-class-path /home/hadoop/ojdbc7.jar --jars /home/hadoop/ojdbc7.jar from pyspark import SparkContext, HiveContext, SparkConf from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options

PySpark: Add a new column with a tuple created from columns

阅读更多关于 PySpark: Add a new column with a tuple created from columns

问题 Here I have a dateframe created as follow, df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], ["Id","V1","V2","V3"]) It looks like +---+---+---+---+ | Id| V1| V2| V3| +---+---+---+---+ | a| 5| R| X| | b| 7| G| S| | c| 8| G| S| +---+---+---+---+ I'm looking to add a column that is a tuple consisting of V1,V2,V3. The result should look like +---+---+---+---+-------+ | Id| V1| V2| V3|V_tuple| +---+---+---+---+-------+ | a| 5| R| X|(5,R,X)| | b| 7| G| S|(7,G,S)| | c| 8|