apache-spark

Spark: Extract dataframe from logical plan

雨燕双飞 提交于 2021-01-03 06:29:44
问题 This line of code converts a dataFrame to a logical plan val logical = df.queryExecution.logical Can we do the opposite, meaning extracting from the logical plan, the dataframes used ? 回答1: in Dataset object there is a method: def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan) so if you have a logical plan, you can transform it into a DataFrame by calling Dataset.ofRows(sparkSession, logical) 来源: https://stackoverflow.com/questions/43763246/spark-extract-dataframe-from-logical

Spark: Extract dataframe from logical plan

∥☆過路亽.° 提交于 2021-01-03 06:29:07
问题 This line of code converts a dataFrame to a logical plan val logical = df.queryExecution.logical Can we do the opposite, meaning extracting from the logical plan, the dataframes used ? 回答1: in Dataset object there is a method: def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan) so if you have a logical plan, you can transform it into a DataFrame by calling Dataset.ofRows(sparkSession, logical) 来源: https://stackoverflow.com/questions/43763246/spark-extract-dataframe-from-logical

Spark Streaming reach dataframe columns and add new column looking up to Redis

跟風遠走 提交于 2021-01-01 17:49:09
问题 In my previous question(Spark Structured Streaming dynamic lookup with Redis ) , i succeeded to reach redis with mapparttions thanks to https://stackoverflow.com/users/689676/fe2s I tried to use mappartitions but i could not solve one point, how i can reach per row column in the below code part while iterating. Because i want to enrich my per-row against my lookup fields kept in Redis. I found something like this, but how i can reach dataframe columns and add new column looking up to Redis.

Spark Streaming reach dataframe columns and add new column looking up to Redis

女生的网名这么多〃 提交于 2021-01-01 17:46:55
问题 In my previous question(Spark Structured Streaming dynamic lookup with Redis ) , i succeeded to reach redis with mapparttions thanks to https://stackoverflow.com/users/689676/fe2s I tried to use mappartitions but i could not solve one point, how i can reach per row column in the below code part while iterating. Because i want to enrich my per-row against my lookup fields kept in Redis. I found something like this, but how i can reach dataframe columns and add new column looking up to Redis.

How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

隐身守侯 提交于 2021-01-01 08:13:35
问题 I am struggling to load classes from JARs into my Scala-Spark kernel Jupyter notebook. I have jars at this location: /home/hadoop/src/main/scala/com/linkedin/relevance/isolationforest/ with contents listed as follows: -rwx------ 1 hadoop hadoop 7170 Sep 11 20:54 BaggedPoint.scala -rw-rw-r-- 1 hadoop hadoop 186719 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1.jar -rw-rw-r-- 1 hadoop hadoop 1482 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1-javadoc.jar -rw-rw-r-- 1 hadoop hadoop 20252 Sep 11

How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

独自空忆成欢 提交于 2021-01-01 08:12:06
问题 I am struggling to load classes from JARs into my Scala-Spark kernel Jupyter notebook. I have jars at this location: /home/hadoop/src/main/scala/com/linkedin/relevance/isolationforest/ with contents listed as follows: -rwx------ 1 hadoop hadoop 7170 Sep 11 20:54 BaggedPoint.scala -rw-rw-r-- 1 hadoop hadoop 186719 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1.jar -rw-rw-r-- 1 hadoop hadoop 1482 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1-javadoc.jar -rw-rw-r-- 1 hadoop hadoop 20252 Sep 11

How to yield pandas dataframe rows to spark dataframe

泄露秘密 提交于 2021-01-01 08:10:36
问题 Hi I'm making transformation, I have created some_function(iter) generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B'] to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code) Input Spark DataFrame respond_sdf.show() +-------------------------------------------------------------------+ |content | +----------------------------------------------------------

Using tensorflow.keras model in pyspark UDF generates a pickle error

最后都变了- 提交于 2021-01-01 07:02:47
问题 I would like to use a tensorflow.keras model in a pysark pandas_udf. However, I get a pickle error when the model is being serialized before sending it to the workers. I am not sure I am using the best method to perform what I want, therefore I will expose a minimal but complete example. Packages: tensorflow-2.2.0 (but error is triggered to all previous versions too) pyspark-2.4.5 The import statements are: import pandas as pd import numpy as np from tensorflow.keras.models import Sequential

Apache Spark’s Structured Streaming with Google PubSub

为君一笑 提交于 2021-01-01 04:14:32
问题 I'm using Spark Dstream to pull and process data from Google PubSub. I'm looking for a way to move to structured streaming but still using Pub/Sub. Also, I should mention that my messages are Snappy compressed in Pub/Sub. I found this issue which claims that using Pub/Sub with structured streaming is not supported. Is someone has encountered this problem? Is it possible to implement custom Receiver to read the data from Pub/Sub Thanks 回答1: The feature request you referenced is still accurate:

How to load tar.gz files in streaming datasets?

二次信任 提交于 2021-01-01 03:51:56
问题 I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data. I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files. Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream? The code I use to process the files is this: val schema = Encoders.product[RawData].schema val trackerData = spark .readStream .option(