pyspark | 易学教程

Writing Spark Structure Streaming data into Cassandra

阅读更多关于 Writing Spark Structure Streaming data into Cassandra

问题 I want to write Structure Streaming Data into Cassandra using Pyspark API. My data flow is like below: Nifi -> Kafka -> Spark Structure Streaming -> Cassandra I have tried below way: query = df.writeStream\ .format("org.apache.spark.sql.cassandra")\ .option("keyspace", "demo")\ .option("table", "test")\ .start() But getting below error message: "org.apache.spark.sql.cassandra" does not support streaming write. Also another approach I have tried: [ Source - DSE 6.0 Administrator Guide] query =

Filter PySpark DataFrame by checking if string appears in column

阅读更多关于 Filter PySpark DataFrame by checking if string appears in column

问题 I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below: root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors or the nth, the row should be included if their name appears. So something along

What hashing function does Spark use for HashingTF and how do I duplicate it?

阅读更多关于 What hashing function does Spark use for HashingTF and how do I duplicate it?

问题 Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms. 1) what function does it use to do the hashing? 2) How can I achieve the same hashed value from Python? 3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this? 回答1: If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows: def indexOf(self, term):

How to print the decision path / rules used to predict sample of a specific row in PySpark?

阅读更多关于 How to print the decision path / rules used to predict sample of a specific row in PySpark?

问题 How to print the decision path of a specific sample in a Spark DataFrame? Spark Version: '2.3.1' The below code prints the decision path of the whole model, how to make it print a decision path of a specific sample? For example, the decision path of the row where tagvalue ball equals 2 import pyspark.sql.functions as F from pyspark.ml import Pipeline, Transformer from pyspark.sql import DataFrame from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import

spark reading data from mysql in parallel

阅读更多关于 spark reading data from mysql in parallel

问题 Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows: df=sqlContext.read.format('jdbc')\ .options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""", dbtable='tbl', numPartitions=4 )\ .load() df2=df.withColumn('updated_date',to_date(df.updated_at)) df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date']) My problem is that it open only one connection to

pyspark: rolling average using timeseries data

阅读更多关于 pyspark: rolling average using timeseries data

问题 I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week. Here's an example: %pyspark import datetime from pyspark.sql import functions as F df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars",

Locally reading S3 files through Spark (or better: pyspark)

阅读更多关于 Locally reading S3 files through Spark (or better: pyspark)

问题 I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or

Enable case sensitivity for spark.sql globally

阅读更多关于 Enable case sensitivity for spark.sql globally

问题 The option spark.sql.caseSensitive controls whether column names etc should be case sensitive or not. It can be set e.g. by spark_session.sql('set spark.sql.caseSensitive=true') and is false per default. It does not seem to be possible to enable it globally in $SPARK_HOME/conf/spark-defaults.conf with spark.sql.caseSensitive: True though. Is that intended or is there some other file to set sql options? Also in the source it is stated that it is highly discouraged to enable this at all. What

Enable case sensitivity for spark.sql globally

阅读更多关于 Enable case sensitivity for spark.sql globally

PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

阅读更多关于 PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

问题 I have a concept I hope you can help to clarify: What's the difference between the following three ways of referring to a column in PySpark dataframe. I know different situations need different forms, but not sure why. df.col : e.g. F.count(df.col) df['col'] : e.g. df['col'] == 0 F.col('col') : e.g. df.filter(F.col('col').isNull()) Thanks a lot! 回答1: In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source)