pyspark

Spark Streaming not reading from Kafka topics

孤街醉人 提交于 2021-02-20 02:49:25
问题 I have set up Kafka and Spark on Ubuntu. I am trying to read kafka topics through Spark Streaming using pyspark(Jupyter notebook). Spark is neither reading the data nor throwing any error. Kafka producer and consumer are communicating with each other on terminal. Kafka is configured with 3 partitions on port 9092,9093,9094. Messages are getting stored in kafka topics. Now, I want to read it through Spark Streaming. I am not sure what I am missing. Even I have explored it on internet, but

Pyspark Schema for Json file

你。 提交于 2021-02-19 08:14:06
问题 I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting arrayfield:[{"name":"somename"},{"address" : "someadress"}] Right now the data is as below arrayfield:[] what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading

java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

被刻印的时光 ゝ 提交于 2021-02-19 08:09:05
问题 I run the following python script with spark-submit, r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list) r_labeled = r.map(f_0).flatMap(f_1) r_labeled.map(lambda x: x[3]).collect() It gets java.lang.OutOfMemoryError, specifically on the collect() action of the last line, java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io

Check for empty row within spark dataframe?

梦想与她 提交于 2021-02-19 07:55:06
问题 Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row. So i am running the following and for some reason it gives me an OK output: check_empty = lambda row : not any([False if k is None else True for k in row]) check_empty_udf = sf.udf(check_empty, BooleanType()) df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show() I am missing

Pyspark: weighted average by a column

孤者浪人 提交于 2021-02-19 07:39:47
问题 For example, I have a dataset like this test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"), ])\ .toDF("orderid", "customerid", "price", "transactiondate", "location") test.show() and I can obtain the customer-region order count matrix by overall_stat = test.groupBy("customerid").agg(count("orderid"))\ .withColumnRenamed("count

Pyspark: weighted average by a column

梦想与她 提交于 2021-02-19 07:39:13
问题 For example, I have a dataset like this test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"), ])\ .toDF("orderid", "customerid", "price", "transactiondate", "location") test.show() and I can obtain the customer-region order count matrix by overall_stat = test.groupBy("customerid").agg(count("orderid"))\ .withColumnRenamed("count

Pyspark Dataframe get unique elements from column with string as list of elements

我的未来我决定 提交于 2021-02-19 07:34:05
问题 I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column: Here is an example - df - | col1 | col2 | col3 | | "a" | "b" |"[q,r]"| | "c" | "f" |"[s,r]"| Here is my expected response: resp = [q, r, s] Any idea how to get there? My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow. But so far I am not able to do so. I

How do I prevent pyspark from interpreting commas as a delimiter in a csv field having JSON object as its value

↘锁芯ラ 提交于 2021-02-19 05:31:36
问题 I am trying to read a comma delimited csv file using pyspark version 2.4.5 and Databrick's spark-csv module. One of the field in the csv file has a json object as its value. The contents of the csv are as below test.csv header_col_1, header_col_2, header_col_3 one, two, three one, {“key1”:“value1",“key2”:“value2",“key3”:“value3”,“key4”:“value4"}, three Other solutions that I found had read options defined as "escape": '"' , and 'delimiter': "," . This seems not to be working as the commas in

PySpark takeOrdered Multiple Fields (Ascending and Descending)

余生长醉 提交于 2021-02-19 05:20:26
问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

PySpark takeOrdered Multiple Fields (Ascending and Descending)

落花浮王杯 提交于 2021-02-19 05:19:30
问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That