pyspark | 易学教程

Spark Streaming not reading from Kafka topics

阅读更多关于 Spark Streaming not reading from Kafka topics

问题 I have set up Kafka and Spark on Ubuntu. I am trying to read kafka topics through Spark Streaming using pyspark(Jupyter notebook). Spark is neither reading the data nor throwing any error. Kafka producer and consumer are communicating with each other on terminal. Kafka is configured with 3 partitions on port 9092,9093,9094. Messages are getting stored in kafka topics. Now, I want to read it through Spark Streaming. I am not sure what I am missing. Even I have explored it on internet, but

Pyspark Schema for Json file

阅读更多关于 Pyspark Schema for Json file

问题 I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting arrayfield:[{"name":"somename"},{"address" : "someadress"}] Right now the data is as below arrayfield:[] what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading

java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

阅读更多关于 java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

问题 I run the following python script with spark-submit, r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list) r_labeled = r.map(f_0).flatMap(f_1) r_labeled.map(lambda x: x[3]).collect() It gets java.lang.OutOfMemoryError, specifically on the collect() action of the last line, java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io

Check for empty row within spark dataframe?

阅读更多关于 Check for empty row within spark dataframe?

问题 Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row. So i am running the following and for some reason it gives me an OK output: check_empty = lambda row : not any([False if k is None else True for k in row]) check_empty_udf = sf.udf(check_empty, BooleanType()) df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show() I am missing

Pyspark: weighted average by a column

阅读更多关于 Pyspark: weighted average by a column

问题 For example, I have a dataset like this test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"), ])\ .toDF("orderid", "customerid", "price", "transactiondate", "location") test.show() and I can obtain the customer-region order count matrix by overall_stat = test.groupBy("customerid").agg(count("orderid"))\ .withColumnRenamed("count

Pyspark: weighted average by a column

阅读更多关于 Pyspark: weighted average by a column

Pyspark Dataframe get unique elements from column with string as list of elements

阅读更多关于 Pyspark Dataframe get unique elements from column with string as list of elements

问题 I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column: Here is an example - df - | col1 | col2 | col3 | | "a" | "b" |"[q,r]"| | "c" | "f" |"[s,r]"| Here is my expected response: resp = [q, r, s] Any idea how to get there? My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow. But so far I am not able to do so. I

How do I prevent pyspark from interpreting commas as a delimiter in a csv field having JSON object as its value

阅读更多关于 How do I prevent pyspark from interpreting commas as a delimiter in a csv field having JSON object as its value

问题 I am trying to read a comma delimited csv file using pyspark version 2.4.5 and Databrick's spark-csv module. One of the field in the csv file has a json object as its value. The contents of the csv are as below test.csv header_col_1, header_col_2, header_col_3 one, two, three one, {“key1”:“value1",“key2”:“value2",“key3”:“value3”,“key4”:“value4"}, three Other solutions that I found had read options defined as "escape": '"' , and 'delimiter': "," . This seems not to be working as the commas in

PySpark takeOrdered Multiple Fields (Ascending and Descending)

阅读更多关于 PySpark takeOrdered Multiple Fields (Ascending and Descending)

问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

PySpark takeOrdered Multiple Fields (Ascending and Descending)

阅读更多关于 PySpark takeOrdered Multiple Fields (Ascending and Descending)