pyspark

pySpark adding columns from a list

非 Y 不嫁゛ 提交于 2021-02-08 07:38:35
问题 I have a datafame and would like to add columns to it, based on values from a list. The list of my values will vary from 3-50 values. I'm new to pySpark and I'm trying to append these values as new columns (empty) to my df. I've seen recommended code of how to add [one column][1] to a dataframe but not multiple from a list. mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName',

Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

大兔子大兔子 提交于 2021-02-08 07:21:15
问题 I'm currently dealing with the following source data in a JSON file: { "unique_key_1": { "some_value_1": 1, "some_value_2": 2 }, "unique_key_2": { "some_value_1": 2, "some_value_2": 3 } "unique_key_3": { "some_value_1": 2, "some_value_2": 1 } ... } Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark: key | some

Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

我是研究僧i 提交于 2021-02-08 07:20:57
问题 I'm currently dealing with the following source data in a JSON file: { "unique_key_1": { "some_value_1": 1, "some_value_2": 2 }, "unique_key_2": { "some_value_1": 2, "some_value_2": 3 } "unique_key_3": { "some_value_1": 2, "some_value_2": 1 } ... } Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark: key | some

Spark computation for suggesting new friendships

南笙酒味 提交于 2021-02-08 07:17:41
问题 I'm using Spark for fun and to learn new things about MapReduce. So, I'm trying to write a program suggesting new friendships (i.e., a sort of recommendation system). The suggestion of a friendship between two individuals is performed if they are not connected yet and have a lot of friends in common. The friendship text file has a structure similar to the following: 1 2,4,11,12,15 2 1,3,4,5,9,10 3 2,5,11,15,20,21 4 1,2,3 5 2,3,4,15,16 ... where the syntax is: ID_SRC1<TAB>ID_DST1,ID_DST2,... .

Remove duplicates from PySpark array column

こ雲淡風輕ζ 提交于 2021-02-08 06:48:50
问题 I have a PySpark Dataframe that contains an ArrayType(StringType()) column. This column contains duplicate strings inside the array which I need to remove. For example, one row entry could look like [milk, bread, milk, toast] . Let's say my dataframe is named df and my column is named arraycol . I need something like: df = df.withColumn("arraycol_without_dupes", F.remove_dupes_from_array("arraycol")) My intution was that there exists a simple solution to this, but after browsing stackoverflow

Pyspark: Calculate streak of consecutive observations

走远了吗. 提交于 2021-02-08 06:44:26
问题 I have a Spark (2.4.0) data frame with a column that has just two values (either 0 or 1 ). I need to calculate the streak of consecutive 0 s and 1 s in this data, resetting the streak to zero if the value changes. An example: from pyspark.sql import (SparkSession, Window) from pyspark.sql.functions import (to_date, row_number, lead, col) spark = SparkSession.builder.appName('test').getOrCreate() # Create dataframe df = spark.createDataFrame([ ('2018-01-01', 'John', 0, 0), ('2018-01-01', 'Paul

Json file to pyspark dataframe

99封情书 提交于 2021-02-08 06:14:09
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

廉价感情. 提交于 2021-02-08 06:13:49
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

扶醉桌前 提交于 2021-02-08 06:12:38
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

PySpark - compare single list of integers to column of lists

隐身守侯 提交于 2021-02-08 05:44:26
问题 I'm trying to check which entries in a spark dataframe (column with lists) contain the largest quantity of values from a given list. The best approach I've came up with is iterating over a dataframe with rdd.foreach() and comparing a given list to every entry using python's set1.intersection(set2) . My question is does spark have any built-in functionality for this so iterating with .foreach could be avoided? Thanks for any help! P.S. my dataframe looks like this: +-------------+-------------