pyspark-sql

How to transform DataFrame per one column to create two new columns in pyspark?

我的未来我决定 提交于 2019-12-06 15:27:03
I have a dataframe "x", In which their are two columns "x1" and "x2" x1(status) x2 kv,true 45 bm,true 65 mp,true 75 kv,null 450 bm,null 550 mp,null 650 I want to convert this dataframe into a format in which data is filtered according to its status and value x1 true null kv 45 450 bm 65 550 mp 75 650 Is there a way to do this, I am using pyspark datadrame Mariusz Yes, there is a way. First split the first column by , using split function, then split this dataframe into two dataframes (using where twice) and simply join this new dataframes on first column.. In Spark API for Scala it'd be as

How to use foreach sink in pyspark?

你离开我真会死。 提交于 2019-12-06 15:09:13
How can I use foreach in Python Spark structured streaming to trigger ops on output. query = wordCounts\ .writeStream\ .outputMode('update')\ .foreach(func)\ .start() def func(): ops(wordCounts) TL;DR It is not possible to use foreach method in pyspark. Quoting the official documentation of Spark Structured Streaming (highlighting mine): The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java . Support for the foreach sink in Python has been added in Spark 2.4.0 and the documentation has been updated: http:

How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?

扶醉桌前 提交于 2019-12-06 13:54:27
问题 Here my DataFrame looks like this: +----------------+-------------+ | Business_Date| Code| +----------------+-------------+ |1539129600000000| BSD| |1539129600000000| BTN| |1539129600000000| BVI| |1539129600000000| BWP| |1539129600000000| BYB| +----------------+-------------+ I wanted to convert the Business_Date column from bigint to timestamp value while loading data into hive table. How can I do this? 回答1: You can use pyspark.sql.functions.from_unixtime() which will Converts the number of

Difference between Caching mechanism in Spark SQL

∥☆過路亽.° 提交于 2019-12-06 13:32:47
I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any collect is applied to it ? In Spark SQL there is a difference in caching if you use directly SQL or you use

How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

女生的网名这么多〃 提交于 2019-12-06 09:03:41
How to overwrite RDD output objects any existing path when we are saving time. test1: 975078|56691|2.000|20171001_926_570_1322 975078|42993|1.690|20171001_926_570_1322 975078|46462|2.000|20171001_926_570_1322 975078|87815|1.000|20171001_926_570_1322 rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2]))) rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1") The first time it is saving properly. now again I removed one line from the input file and saving RDD same location, it

Pyspark - how to backfill a DataFrame?

杀马特。学长 韩版系。学妹 提交于 2019-12-06 07:32:15
How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 3.0 2017-01-04 NaN 2017-01-05 5.0 Backfill the dataframe df = df.fillna(method='bfill')

Create a dataframe from a list in pyspark.sql

本小妞迷上赌 提交于 2019-12-06 04:19:37
问题 I am totally lost in a wired situation. Now I have a list li li = example_data.map(lambda x: get_labeled_prediction(w,x)).collect() print li, type(li) the output is like, [(0.0, 59.0), (0.0, 51.0), (0.0, 81.0), (0.0, 8.0), (0.0, 86.0), (0.0, 86.0), (0.0, 60.0), (0.0, 54.0), (0.0, 54.0), (0.0, 84.0)] <type 'list'> When I try to create a dataframe from this list m = sqlContext.createDataFrame(l, ["prediction", "label"]) It threw the error message TypeError Traceback (most recent call last)

How to add sparse vectors after group by, using Spark SQL?

余生颓废 提交于 2019-12-06 04:02:54
问题 I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this : 001436800277225 ["9161492","9161787","9378531"] 009092130698762 ["9394697"] 010003000431538 ["9394697","9426473","9428530"] 010156461231357 ["9350394","9414181"] 010216216021063 ["9173862","9247870"] 010720006581483 ["9018786"] 011199797794333 ["9017977","9091134","9142852","9325464","9331913"] 011337201765123 ["9161294","9198693"] 011414545455156 ["9168185","9178348"

How to transform JSON strings in columns of dataframe in PySpark?

孤人 提交于 2019-12-06 03:04:51
I have a pyspark dataframe as shown below +--------------------+---+ | _c0|_c1| +--------------------+---+ |{"object":"F...| 0| |{"object":"F...| 1| |{"object":"F...| 2| |{"object":"E...| 3| |{"object":"F...| 4| |{"object":"F...| 5| |{"object":"F...| 6| |{"object":"S...| 7| |{"object":"F...| 8| The column _c0 contains a string in dictionary form. '{"object":"F","time":"2019-07-18T15:08:16.143Z","values":[0.22124142944812775,0.2147877812385559,0.16713131964206696,0.3102800250053406,0.31872493028640747,0.3366488814353943,0.25324496626853943,0.14537988603115082,0.12684473395347595,0

Using Python's reduce() to join multiple PySpark DataFrames

我是研究僧i 提交于 2019-12-05 20:02:53
Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes, ) whereas this one doesn't: joined_df = list_of_dataframes[0] joined_df.cache() for right_df in