pyspark

Find number of rows in a given week in PySpark

孤者浪人 提交于 2020-01-16 05:36:08
问题 I have a PySpark dataframe, a small portion of which is given below: +------+-----+-------------------+-----+ | name| type| timestamp|score| +------+-----+-------------------+-----+ | name1|type1|2012-01-10 00:00:00| 11| | name1|type1|2012-01-10 00:00:10| 14| | name1|type1|2012-01-10 00:00:20| 2| | name1|type1|2012-01-10 00:00:30| 3| | name1|type1|2012-01-10 00:00:40| 55| | name1|type1|2012-01-10 00:00:50| 10| | name5|type1|2012-01-10 00:01:00| 5| | name2|type2|2012-01-10 00:01:10| 8| | name5

Spark map is only one task while it should be parallel (PySpark)

时光总嘲笑我的痴心妄想 提交于 2020-01-16 03:56:09
问题 I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form: def doSomething(point,centers): for center in centers.value: if(distance(point,center)<1): return(center) return(None) preppedData.map(lambda x:doSomething(x,centers)).take(5) The preppedData RDD is cached

Distributed for loop in pyspark dataframe

大城市里の小女人 提交于 2020-01-15 12:17:28
问题 Context : My company is in Spark 2.2 so it's not possible to use pandas_udf for distributed column processing I have dataframes that contain thousands of columns(features) and millions of records df = spark.createDataFrame([(1,"AB", 100, 200,1), (2, "AC", 150,200,2), (3,"AD", 80,150,0)],["Id","Region","Salary", "HouseHoldIncome", "NumChild"]) I would like to perform certain summary and statistics on each column in a parallel manner and wonder what is the best way to achieve this #The point is

Kafka Stream to Spark Stream python

青春壹個敷衍的年華 提交于 2020-01-15 12:15:08
问题 We have Kafka stream which use Avro. I need to connect it to Spark Stream. I use bellow code as Lev G suggest. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, valueDecoder=MessageSerializer.decode_message) I got bellow error when i execute it through spark-submit. 2018-10-09 10:49:27 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 12 for reason Container marked as failed: container_1537396420651_0008_01_000013 on

pyspark change day in datetime column

穿精又带淫゛_ 提交于 2020-01-15 10:53:39
问题 what is wrong with this code trying to change day of a datetime columns import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes import datetime sc = pyspark.SparkContext(appName="test") sqlcontext = pyspark.SQLContext(sc) rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)), ('b',datetime.datetime(2014, 1, 27, 0, 0)), ('c',datetime.datetime(2014, 1, 31, 0, 0))]) testdf = sqlcontext.createDataFrame(rdd, ["id", "date"]) print(testdf.show()) print

pyspark change day in datetime column

那年仲夏 提交于 2020-01-15 10:51:22
问题 what is wrong with this code trying to change day of a datetime columns import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes import datetime sc = pyspark.SparkContext(appName="test") sqlcontext = pyspark.SQLContext(sc) rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)), ('b',datetime.datetime(2014, 1, 27, 0, 0)), ('c',datetime.datetime(2014, 1, 31, 0, 0))]) testdf = sqlcontext.createDataFrame(rdd, ["id", "date"]) print(testdf.show()) print

Use Spark fileoutputcommitter.algorithm.version=2 with AWS Glue

 ̄綄美尐妖づ 提交于 2020-01-15 10:37:10
问题 I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 Is it possible to use this configuration with AWS Glue? 回答1: Option 1 : Glue uses spark context you can set hadoop configuration to aws glue as well. since internally dynamic frame is kind of dataframe. sc._jsc.hadoopConfiguration().set("mykey","myvalue") I think you neeed to add the correspodning class also like this sc._jsc

Remove duplicate rows, regardless of new information -PySpark

纵然是瞬间 提交于 2020-01-15 10:15:39
问题 Say I have a dataframe like so: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm 4 imgix.com/lks032m 4 imgix.com/903248 I'd like to end up with: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark? 回答1: Group by on col('ID') Use collect_list with agg to aggregate the list Call getItem(0)

Remove duplicate rows, regardless of new information -PySpark

北城余情 提交于 2020-01-15 10:15:07
问题 Say I have a dataframe like so: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm 4 imgix.com/lks032m 4 imgix.com/903248 I'd like to end up with: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark? 回答1: Group by on col('ID') Use collect_list with agg to aggregate the list Call getItem(0)

Reading avro messages from Kafka in spark streaming/structured streaming

Deadly 提交于 2020-01-15 10:07:09
问题 I am using pyspark for the first time. Spark Version : 2.3.0 Kafka Version : 2.2.0 I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3. I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark. Thanks. 回答1: