apache-spark | 易学教程

Spark: Extract dataframe from logical plan

阅读更多关于 Spark: Extract dataframe from logical plan

问题 This line of code converts a dataFrame to a logical plan val logical = df.queryExecution.logical Can we do the opposite, meaning extracting from the logical plan, the dataframes used ? 回答1: in Dataset object there is a method: def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan) so if you have a logical plan, you can transform it into a DataFrame by calling Dataset.ofRows(sparkSession, logical) 来源： https://stackoverflow.com/questions/43763246/spark-extract-dataframe-from-logical

Spark: Extract dataframe from logical plan

阅读更多关于 Spark: Extract dataframe from logical plan

Spark Streaming reach dataframe columns and add new column looking up to Redis

阅读更多关于 Spark Streaming reach dataframe columns and add new column looking up to Redis

问题 In my previous question(Spark Structured Streaming dynamic lookup with Redis ) , i succeeded to reach redis with mapparttions thanks to https://stackoverflow.com/users/689676/fe2s I tried to use mappartitions but i could not solve one point, how i can reach per row column in the below code part while iterating. Because i want to enrich my per-row against my lookup fields kept in Redis. I found something like this, but how i can reach dataframe columns and add new column looking up to Redis.

Spark Streaming reach dataframe columns and add new column looking up to Redis

阅读更多关于 Spark Streaming reach dataframe columns and add new column looking up to Redis

How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

阅读更多关于 How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

问题 I am struggling to load classes from JARs into my Scala-Spark kernel Jupyter notebook. I have jars at this location: /home/hadoop/src/main/scala/com/linkedin/relevance/isolationforest/ with contents listed as follows: -rwx------ 1 hadoop hadoop 7170 Sep 11 20:54 BaggedPoint.scala -rw-rw-r-- 1 hadoop hadoop 186719 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1.jar -rw-rw-r-- 1 hadoop hadoop 1482 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1-javadoc.jar -rw-rw-r-- 1 hadoop hadoop 20252 Sep 11

How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

阅读更多关于 How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

How to yield pandas dataframe rows to spark dataframe

阅读更多关于 How to yield pandas dataframe rows to spark dataframe

问题 Hi I'm making transformation, I have created some_function(iter) generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B'] to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code) Input Spark DataFrame respond_sdf.show() +-------------------------------------------------------------------+ |content | +----------------------------------------------------------

Using tensorflow.keras model in pyspark UDF generates a pickle error

阅读更多关于 Using tensorflow.keras model in pyspark UDF generates a pickle error

问题 I would like to use a tensorflow.keras model in a pysark pandas_udf. However, I get a pickle error when the model is being serialized before sending it to the workers. I am not sure I am using the best method to perform what I want, therefore I will expose a minimal but complete example. Packages: tensorflow-2.2.0 (but error is triggered to all previous versions too) pyspark-2.4.5 The import statements are: import pandas as pd import numpy as np from tensorflow.keras.models import Sequential

Apache Spark’s Structured Streaming with Google PubSub

阅读更多关于 Apache Spark’s Structured Streaming with Google PubSub

问题 I'm using Spark Dstream to pull and process data from Google PubSub. I'm looking for a way to move to structured streaming but still using Pub/Sub. Also, I should mention that my messages are Snappy compressed in Pub/Sub. I found this issue which claims that using Pub/Sub with structured streaming is not supported. Is someone has encountered this problem? Is it possible to implement custom Receiver to read the data from Pub/Sub Thanks 回答1: The feature request you referenced is still accurate:

How to load tar.gz files in streaming datasets?

阅读更多关于 How to load tar.gz files in streaming datasets?

问题 I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data. I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files. Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream? The code I use to process the files is this: val schema = Encoders.product[RawData].schema val trackerData = spark .readStream .option(