pyspark | 易学教程

pySpark adding columns from a list

阅读更多关于 pySpark adding columns from a list

问题 I have a datafame and would like to add columns to it, based on values from a list. The list of my values will vary from 3-50 values. I'm new to pySpark and I'm trying to append these values as new columns (empty) to my df. I've seen recommended code of how to add [one column][1] to a dataframe but not multiple from a list. mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName',

Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

阅读更多关于 Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

问题 I'm currently dealing with the following source data in a JSON file: { "unique_key_1": { "some_value_1": 1, "some_value_2": 2 }, "unique_key_2": { "some_value_1": 2, "some_value_2": 3 } "unique_key_3": { "some_value_1": 2, "some_value_2": 1 } ... } Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark: key | some

Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

阅读更多关于 Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

Spark computation for suggesting new friendships

阅读更多关于 Spark computation for suggesting new friendships

问题 I'm using Spark for fun and to learn new things about MapReduce. So, I'm trying to write a program suggesting new friendships (i.e., a sort of recommendation system). The suggestion of a friendship between two individuals is performed if they are not connected yet and have a lot of friends in common. The friendship text file has a structure similar to the following: 1 2,4,11,12,15 2 1,3,4,5,9,10 3 2,5,11,15,20,21 4 1,2,3 5 2,3,4,15,16 ... where the syntax is: ID_SRC1<TAB>ID_DST1,ID_DST2,... .

Remove duplicates from PySpark array column

阅读更多关于 Remove duplicates from PySpark array column

问题 I have a PySpark Dataframe that contains an ArrayType(StringType()) column. This column contains duplicate strings inside the array which I need to remove. For example, one row entry could look like [milk, bread, milk, toast] . Let's say my dataframe is named df and my column is named arraycol . I need something like: df = df.withColumn("arraycol_without_dupes", F.remove_dupes_from_array("arraycol")) My intution was that there exists a simple solution to this, but after browsing stackoverflow

Pyspark: Calculate streak of consecutive observations

阅读更多关于 Pyspark: Calculate streak of consecutive observations

问题 I have a Spark (2.4.0) data frame with a column that has just two values (either 0 or 1 ). I need to calculate the streak of consecutive 0 s and 1 s in this data, resetting the streak to zero if the value changes. An example: from pyspark.sql import (SparkSession, Window) from pyspark.sql.functions import (to_date, row_number, lead, col) spark = SparkSession.builder.appName('test').getOrCreate() # Create dataframe df = spark.createDataFrame([ ('2018-01-01', 'John', 0, 0), ('2018-01-01', 'Paul

Json file to pyspark dataframe

阅读更多关于 Json file to pyspark dataframe

问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

阅读更多关于 Json file to pyspark dataframe

Json file to pyspark dataframe

阅读更多关于 Json file to pyspark dataframe

PySpark - compare single list of integers to column of lists

阅读更多关于 PySpark - compare single list of integers to column of lists

问题 I'm trying to check which entries in a spark dataframe (column with lists) contain the largest quantity of values from a given list. The best approach I've came up with is iterating over a dataframe with rdd.foreach() and comparing a given list to every entry using python's set1.intersection(set2) . My question is does spark have any built-in functionality for this so iterating with .foreach could be avoided? Thanks for any help! P.S. my dataframe looks like this: +-------------+-------------