pyspark | 易学教程

Find number of rows in a given week in PySpark

阅读更多关于 Find number of rows in a given week in PySpark

问题 I have a PySpark dataframe, a small portion of which is given below: +------+-----+-------------------+-----+ | name| type| timestamp|score| +------+-----+-------------------+-----+ | name1|type1|2012-01-10 00:00:00| 11| | name1|type1|2012-01-10 00:00:10| 14| | name1|type1|2012-01-10 00:00:20| 2| | name1|type1|2012-01-10 00:00:30| 3| | name1|type1|2012-01-10 00:00:40| 55| | name1|type1|2012-01-10 00:00:50| 10| | name5|type1|2012-01-10 00:01:00| 5| | name2|type2|2012-01-10 00:01:10| 8| | name5

Spark map is only one task while it should be parallel (PySpark)

阅读更多关于 Spark map is only one task while it should be parallel (PySpark)

问题 I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form: def doSomething(point,centers): for center in centers.value: if(distance(point,center)<1): return(center) return(None) preppedData.map(lambda x:doSomething(x,centers)).take(5) The preppedData RDD is cached

Distributed for loop in pyspark dataframe

阅读更多关于 Distributed for loop in pyspark dataframe

问题 Context : My company is in Spark 2.2 so it's not possible to use pandas_udf for distributed column processing I have dataframes that contain thousands of columns(features) and millions of records df = spark.createDataFrame([(1,"AB", 100, 200,1), (2, "AC", 150,200,2), (3,"AD", 80,150,0)],["Id","Region","Salary", "HouseHoldIncome", "NumChild"]) I would like to perform certain summary and statistics on each column in a parallel manner and wonder what is the best way to achieve this #The point is

Kafka Stream to Spark Stream python

阅读更多关于 Kafka Stream to Spark Stream python

问题 We have Kafka stream which use Avro. I need to connect it to Spark Stream. I use bellow code as Lev G suggest. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, valueDecoder=MessageSerializer.decode_message) I got bellow error when i execute it through spark-submit. 2018-10-09 10:49:27 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 12 for reason Container marked as failed: container_1537396420651_0008_01_000013 on

pyspark change day in datetime column

阅读更多关于 pyspark change day in datetime column

问题 what is wrong with this code trying to change day of a datetime columns import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes import datetime sc = pyspark.SparkContext(appName="test") sqlcontext = pyspark.SQLContext(sc) rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)), ('b',datetime.datetime(2014, 1, 27, 0, 0)), ('c',datetime.datetime(2014, 1, 31, 0, 0))]) testdf = sqlcontext.createDataFrame(rdd, ["id", "date"]) print(testdf.show()) print

pyspark change day in datetime column

阅读更多关于 pyspark change day in datetime column

Use Spark fileoutputcommitter.algorithm.version=2 with AWS Glue

阅读更多关于 Use Spark fileoutputcommitter.algorithm.version=2 with AWS Glue

问题 I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 Is it possible to use this configuration with AWS Glue? 回答1: Option 1 : Glue uses spark context you can set hadoop configuration to aws glue as well. since internally dynamic frame is kind of dataframe. sc._jsc.hadoopConfiguration().set("mykey","myvalue") I think you neeed to add the correspodning class also like this sc._jsc

Remove duplicate rows, regardless of new information -PySpark

阅读更多关于 Remove duplicate rows, regardless of new information -PySpark

问题 Say I have a dataframe like so: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm 4 imgix.com/lks032m 4 imgix.com/903248 I'd like to end up with: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark? 回答1: Group by on col('ID') Use collect_list with agg to aggregate the list Call getItem(0)

Remove duplicate rows, regardless of new information -PySpark

阅读更多关于 Remove duplicate rows, regardless of new information -PySpark

Reading avro messages from Kafka in spark streaming/structured streaming

阅读更多关于 Reading avro messages from Kafka in spark streaming/structured streaming

问题 I am using pyspark for the first time. Spark Version : 2.3.0 Kafka Version : 2.2.0 I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3. I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark. Thanks. 回答1: