apache-spark

SPARK - Joining 2 dataframes on values in an array

非 Y 不嫁゛ 提交于 2021-02-08 06:40:54
问题 I can't find an easy and elegant solution to this one. I have a df1 with this column : |-- guitars: array (nullable = true) | |-- element: long (containsNull = true) I have a df2 made of guitars, and an id matching with the Long in my df 1. root |-- guitarId: long (nullable = true) |-- make: string (nullable = true) |-- model: string (nullable = true) |-- type: string (nullable = true) I want to join my two dfs, obviously, and instead of having an array of long, I want an array of struct

SPARK - Joining 2 dataframes on values in an array

时光怂恿深爱的人放手 提交于 2021-02-08 06:40:12
问题 I can't find an easy and elegant solution to this one. I have a df1 with this column : |-- guitars: array (nullable = true) | |-- element: long (containsNull = true) I have a df2 made of guitars, and an id matching with the Long in my df 1. root |-- guitarId: long (nullable = true) |-- make: string (nullable = true) |-- model: string (nullable = true) |-- type: string (nullable = true) I want to join my two dfs, obviously, and instead of having an array of long, I want an array of struct

Count number of words in each sentence Spark Dataframes

女生的网名这么多〃 提交于 2021-02-08 06:25:17
问题 I have a Spark Dataframe where each row has a review. +--------------------+ | reviewText| +--------------------+ |Spiritually and m...| |This is one my mu...| |This book provide...| |I first read THE ...| +--------------------+ I have tried: SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText'))) SplitSentences = SplitSentences.select(SplitSentences.split_sent) Then I created the function: def word_count(text): return len(text.split()) wordcount_udf = udf(lambda x:

Spark: write JSON several files from DataFrame based on separation by column value

北慕城南 提交于 2021-02-08 06:24:35
问题 Suppose I have this DataFrame ( df ): user food affinity 'u1' 'pizza' 5 'u1' 'broccoli' 3 'u1' 'ice cream' 4 'u2' 'pizza' 1 'u2' 'broccoli' 3 'u2' 'ice cream' 1 Namely each user has a certain (computed) affinity to a series of foods. The DataFrame is built from several What I need to do is create a JSON file for each user , with their affinities. For instance, for user 'u1', I want to have file for user 'u1' containing [ {'food': 'pizza', 'affinity': 5}, {'food': 'broccoli', 'affinity': 3}, {

Json file to pyspark dataframe

99封情书 提交于 2021-02-08 06:14:09
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

廉价感情. 提交于 2021-02-08 06:13:49
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

扶醉桌前 提交于 2021-02-08 06:12:38
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

How to integrate Spark and Kafka for direct stream

元气小坏坏 提交于 2021-02-08 05:58:16
问题 I am having difficulties creating a basic spark streaming application. Right now, am trying it on my local machine. I have done following setup. -Setup Zookeeper -Setup Kafka ( Version : kafka_2.10-0.9.0.1) -Created a topic using below command kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test -Started producer and consumer on two different cmd terminals using below commands Producer : kafka-console-producer.bat --broker-list localhost:9092

PySpark - compare single list of integers to column of lists

隐身守侯 提交于 2021-02-08 05:44:26
问题 I'm trying to check which entries in a spark dataframe (column with lists) contain the largest quantity of values from a given list. The best approach I've came up with is iterating over a dataframe with rdd.foreach() and comparing a given list to every entry using python's set1.intersection(set2) . My question is does spark have any built-in functionality for this so iterating with .foreach could be avoided? Thanks for any help! P.S. my dataframe looks like this: +-------------+-------------

Converting String RDD to Int RDD

こ雲淡風輕ζ 提交于 2021-02-08 05:38:42
问题 I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD I tried the below: val intArr = sc .textFile("Downloads/data/train.csv") .map(line=>line.split(",")) .map(_.toInt) But I am getting the error: error: value toInt is not a member of Array[String] I need to convert to int rdd because down the line i need to do the below val vectors = intArr.map(p => Vectors.dense(p)) which requires the type to be integer