pyspark | 易学教程

Add column from one dataframe to another WITHOUT JOIN

阅读更多关于 Add column from one dataframe to another WITHOUT JOIN

问题 Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows Let's say I have a dataframe of M features id, salary, age, etc. +----+--------+------------+--------------+ | id | salary | age | zone | .... +----+--------+------------+--------------+ I have perform certain operations on each feature to arrive at something like this +----+--------+------------+--------------+-

Pyspark with Elasticsearch

阅读更多关于 Pyspark with Elasticsearch

问题 I'm using Pyspark with Elasticsearch. I've noticed that when you create an RDD, it doesn't get executed prior to any collecting, counting or any other 'final' operation. Is there away to execute and cache the transformed RDD as I use the transformed RDD's result for other things as well. 回答1: Like I said in the comment section, All transformations in Spark are lazy , in that they do not compute their results right away. Instead, they just remember the transformations applied to some base

In Pyspark HiveContext what is the equivalent of SQL OFFSET?

阅读更多关于 In Pyspark HiveContext what is the equivalent of SQL OFFSET?

问题 Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal? For some context the pyspark code starts with from pyspark.sql import HiveContext hiveContext = HiveContext(sc) hiveContext.sql("select ...

PySpark reduceByKey aggregation after collect_list on a column

阅读更多关于 PySpark reduceByKey aggregation after collect_list on a column

问题 I want to take following example to do my aggregation according to 'states' collected by collect_list. example code: states = sc.parallelize(["TX","TX","CA","TX","CA"]) states.map(lambda x:(x,1)).reduceByKey(operator.add).collect() #printed output: [('TX', 3), ('CA', 2)] my code: from pyspark import SparkContext,SparkConf from pyspark.sql.session import SparkSession from pyspark.sql.functions import collect_list import operator conf = SparkConf().setMaster("local") conf = conf.setAppName(

SPARK SQL fails if there is no specified partition path available

阅读更多关于 SPARK SQL fails if there is no specified partition path available

问题 I am using Hive Metastore in EMR. I am able to query the table manually through HiveSQL . But When i use the same table in Spark Job, it says Input path does not exist: s3:// Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://.... I have deleted my above partition path in s3://.. but it still works in my Hive without Dropping Partition at table level. but its not working in pyspark anyways Here is my full code from pyspark import SparkContext,

How to save a spark dataframe as a text file without Rows in pyspark?

阅读更多关于 How to save a spark dataframe as a text file without Rows in pyspark?

问题 I have a dataframe "df" with the columns ['name', 'age'] I saved the dataframe using df.rdd.saveAsTextFile("..") to save it as an rdd. I loaded the saved file and then collect() gives me the following result. a = sc.textFile("\mee\sample") a.collect() Output: [u"Row(name=u'Alice', age=1)", u"Row(name=u'Alice', age=2)", u"Row(name=u'Joe', age=3)"] This is not an rdd of Rows. a.map(lambda g:g.age).collect() AttributeError: 'unicode' object has no attribute 'age' Is there any way to save the

How can I pass datasets between %pyspark interpreter and %python interpreters in Zeppelin?

阅读更多关于 How can I pass datasets between %pyspark interpreter and %python interpreters in Zeppelin?

问题 I'm writing a code where I'm fetching a dataset using an internal library and %pyspark interpreter. However I am unable to pass the dataset to %python interpreter. I tried using string variables and it is working fine, but with dataset I'm using the following code to put dataset in a zeppelin context- z.put("input_data",input_data) and it is throwing the following error: AttributeError: 'DataFrame' object has no attribute '_get_object_id' . Can you please tell me how can I do this? Thanks in

replace column values in spark dataframe based on dictionary similar to np.where

阅读更多关于 replace column values in spark dataframe based on dictionary similar to np.where

问题 My data frame looks like - no city amount 1 Kenora 56% 2 Sudbury 23% 3 Kenora 71% 4 Sudbury 41% 5 Kenora 33% 6 Niagara 22% 7 Hamilton 88% It consist of 92M records. I want my data frame looks like - no city amount new_city 1 Kenora 56% X 2 Niagara 23% X 3 Kenora 71% X 4 Sudbury 41% Sudbury 5 Ottawa 33% Ottawa 6 Niagara 22% X 7 Hamilton 88% Hamilton Using python I can manage it(using np.where ) but not getting any results in pyspark. Any help? I have done so far - #create dictionary city_dict

Pyspark - ValueError: could not convert string to float / invalid literal for float()

阅读更多关于 Pyspark - ValueError: could not convert string to float / invalid literal for float()

问题 I am trying to use data from a spark dataframe as the input for my k-means model. However I keep getting errors. (Check section after code) My spark dataframe and looks like this (and has around 1M rows): ID col1 col2 Latitude Longitude 13 ... ... 22.2 13.5 62 ... ... 21.4 13.8 24 ... ... 21.8 14.1 71 ... ... 28.9 18.0 ... ... ... .... .... Here is my code: from pyspark.ml.clustering import KMeans from pyspark.ml.linalg import Vectors df = spark.read.csv("file.csv") spark_rdd = df.rdd.map

How to write streaming dataframe to PostgreSQL?

阅读更多关于 How to write streaming dataframe to PostgreSQL?

问题 I have a streaming dataframe that I am trying to write into a database. There is documentation for writing an rdd or df into Postgres. But, I am unable to find examples or documentation on how it is done in Structured streaming. I have read the documentation https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch , but I couldn't understand where I would create a jdbc connection and how I would write it to the database. def foreach_batch_function(df,