pyspark

Writing Spark Structure Streaming data into Cassandra

£可爱£侵袭症+ 提交于 2019-12-29 09:21:46
问题 I want to write Structure Streaming Data into Cassandra using Pyspark API. My data flow is like below: Nifi -> Kafka -> Spark Structure Streaming -> Cassandra I have tried below way: query = df.writeStream\ .format("org.apache.spark.sql.cassandra")\ .option("keyspace", "demo")\ .option("table", "test")\ .start() But getting below error message: "org.apache.spark.sql.cassandra" does not support streaming write. Also another approach I have tried: [ Source - DSE 6.0 Administrator Guide] query =

Filter PySpark DataFrame by checking if string appears in column

此生再无相见时 提交于 2019-12-29 09:03:57
问题 I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below: root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors or the nth, the row should be included if their name appears. So something along

What hashing function does Spark use for HashingTF and how do I duplicate it?

╄→尐↘猪︶ㄣ 提交于 2019-12-29 08:46:16
问题 Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms. 1) what function does it use to do the hashing? 2) How can I achieve the same hashed value from Python? 3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this? 回答1: If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows: def indexOf(self, term):

How to print the decision path / rules used to predict sample of a specific row in PySpark?

不打扰是莪最后的温柔 提交于 2019-12-29 07:47:13
问题 How to print the decision path of a specific sample in a Spark DataFrame? Spark Version: '2.3.1' The below code prints the decision path of the whole model, how to make it print a decision path of a specific sample? For example, the decision path of the row where tagvalue ball equals 2 import pyspark.sql.functions as F from pyspark.ml import Pipeline, Transformer from pyspark.sql import DataFrame from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import

spark reading data from mysql in parallel

こ雲淡風輕ζ 提交于 2019-12-29 03:33:11
问题 Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows: df=sqlContext.read.format('jdbc')\ .options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""", dbtable='tbl', numPartitions=4 )\ .load() df2=df.withColumn('updated_date',to_date(df.updated_at)) df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date']) My problem is that it open only one connection to

pyspark: rolling average using timeseries data

烈酒焚心 提交于 2019-12-29 03:15:27
问题 I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week. Here's an example: %pyspark import datetime from pyspark.sql import functions as F df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars",

Locally reading S3 files through Spark (or better: pyspark)

Deadly 提交于 2019-12-28 12:14:28
问题 I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or

Enable case sensitivity for spark.sql globally

≡放荡痞女 提交于 2019-12-28 06:59:20
问题 The option spark.sql.caseSensitive controls whether column names etc should be case sensitive or not. It can be set e.g. by spark_session.sql('set spark.sql.caseSensitive=true') and is false per default. It does not seem to be possible to enable it globally in $SPARK_HOME/conf/spark-defaults.conf with spark.sql.caseSensitive: True though. Is that intended or is there some other file to set sql options? Also in the source it is stated that it is highly discouraged to enable this at all. What

Enable case sensitivity for spark.sql globally

陌路散爱 提交于 2019-12-28 06:59:10
问题 The option spark.sql.caseSensitive controls whether column names etc should be case sensitive or not. It can be set e.g. by spark_session.sql('set spark.sql.caseSensitive=true') and is false per default. It does not seem to be possible to enable it globally in $SPARK_HOME/conf/spark-defaults.conf with spark.sql.caseSensitive: True though. Is that intended or is there some other file to set sql options? Also in the source it is stated that it is highly discouraged to enable this at all. What

PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

痴心易碎 提交于 2019-12-28 06:51:46
问题 I have a concept I hope you can help to clarify: What's the difference between the following three ways of referring to a column in PySpark dataframe. I know different situations need different forms, but not sure why. df.col : e.g. F.count(df.col) df['col'] : e.g. df['col'] == 0 F.col('col') : e.g. df.filter(F.col('col').isNull()) Thanks a lot! 回答1: In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source)