pyspark

PySpark Overwrite added sc.addPyFile

六眼飞鱼酱① 提交于 2020-01-02 21:58:03
问题 I have these 2 files saved under this path: C:\code\sample1\main.py def method(): return "this is sample method 1" C:\code\sample2\main.py def method(): return "this is sample method 2" and then i run this: from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext() spark = SparkSession(sc) sc.addPyFile("~/code/sample1/main.py") main1 = __import__("main") print(main1.method()) # this is sample method 1 sc.addPyFile("~/code/sample2/main.py") # Error The error is

PySpark Overwrite added sc.addPyFile

↘锁芯ラ 提交于 2020-01-02 21:57:13
问题 I have these 2 files saved under this path: C:\code\sample1\main.py def method(): return "this is sample method 1" C:\code\sample2\main.py def method(): return "this is sample method 2" and then i run this: from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext() spark = SparkSession(sc) sc.addPyFile("~/code/sample1/main.py") main1 = __import__("main") print(main1.method()) # this is sample method 1 sc.addPyFile("~/code/sample2/main.py") # Error The error is

How to vectorize json data for KMeans?

杀马特。学长 韩版系。学妹 提交于 2020-01-02 19:47:25
问题 I have a number of questions and choices which users are going to answer. They have the format like this: question_id, text, choices And for each user I store the answered questions and selected choice by each user as a json in mongodb: {user_id: "", "question_answers" : [{"question_id": "choice_id", ..}] } Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the

How to vectorize json data for KMeans?

↘锁芯ラ 提交于 2020-01-02 19:46:24
问题 I have a number of questions and choices which users are going to answer. They have the format like this: question_id, text, choices And for each user I store the answered questions and selected choice by each user as a json in mongodb: {user_id: "", "question_answers" : [{"question_id": "choice_id", ..}] } Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the

percentage count per group and pivot with pyspark

妖精的绣舞 提交于 2020-01-02 19:41:48
问题 I have dataframe with columns from and to.Both are country codes and they show starting country and destination country. +----+---+ |from| to| +----+---+ | TR| tr| | TR| tr| | TR| tr| | TR| gr| | ES| tr| | GR| tr| | CZ| it| | LU| it| | AR| it| | DE| it| | IT| it| | IT| it| | US| it| | GR| fr| Is there a way to get a dataframe that shows the percentage of each destination country per country of origin, with column all the destination country code? the percentage must be out of the total

pyspark use dataframe inside udf

北城以北 提交于 2020-01-02 18:38:20
问题 I have two dataframes df1 +---+---+----------+ | n|val| distances| +---+---+----------+ | 1| 1|0.27308652| | 2| 1|0.24969208| | 3| 1|0.21314497| +---+---+----------+ and df2 +---+---+----------+ | x1| x2| w| +---+---+----------+ | 1| 2|0.03103427| | 1| 4|0.19012526| | 1| 10|0.26805446| | 1| 8|0.26825935| +---+---+----------+ I want to add a new column to df1 called gamma , which will contain the sum of the w value from df2 when df1.n == df2.x1 OR df1.n == df2.x2 I tried to use udf, but

Cannot create Dataframe in PySpark

夙愿已清 提交于 2020-01-02 09:40:12
问题 I want to create a Dataframe in PySpark with the following code from pyspark.sql import * from pyspark.sql.types import * temp = Row("DESC", "ID") temp1 = temp('Description1323', 123) print temp1 schema = StructType([StructField("DESC", StringType(), False), StructField("ID", IntegerType(), False)]) df = spark.createDataFrame(temp1, schema) But i am receiving the following error: TypeError: StructType can not accept object 'Description1323' in type type 'str' Whats wrong with my code? 回答1:

Split column of list into multiple columns in the same PySpark dataframe

只愿长相守 提交于 2020-01-02 07:19:47
问题 I have the following dataframe which contains 2 columns: 1st column has column names 2nd Column has list of values. +--------------------+--------------------+ | Column| Quantile| +--------------------+--------------------+ | rent|[4000.0, 4500.0, ...| | is_rent_changed|[0.0, 0.0, 0.0, 0...| | phone|[7.022372888E9, 7...| | Area_house|[1000.0, 1000.0, ...| | bedroom_count|[1.0, 1.0, 1.0, 1...| | bathroom_count|[1.0, 1.0, 1.0, 1...| | maintenance_cost|[0.0, 0.0, 0.0, 0...| | latitude|[12

spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

百般思念 提交于 2020-01-02 06:10:29
问题 While trying to read data from oracle database using spark on AWS EMR, I am getting this error message: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver. Can someone let me know if anyone faced this issue and how they resolved it? pyspark --driver-class-path /home/hadoop/ojdbc7.jar --jars /home/hadoop/ojdbc7.jar from pyspark import SparkContext, HiveContext, SparkConf from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options

PySpark: Add a new column with a tuple created from columns

送分小仙女□ 提交于 2020-01-02 05:28:08
问题 Here I have a dateframe created as follow, df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], ["Id","V1","V2","V3"]) It looks like +---+---+---+---+ | Id| V1| V2| V3| +---+---+---+---+ | a| 5| R| X| | b| 7| G| S| | c| 8| G| S| +---+---+---+---+ I'm looking to add a column that is a tuple consisting of V1,V2,V3. The result should look like +---+---+---+---+-------+ | Id| V1| V2| V3|V_tuple| +---+---+---+---+-------+ | a| 5| R| X|(5,R,X)| | b| 7| G| S|(7,G,S)| | c| 8|