pyspark | 易学教程

spark-submit - cannot pickle class in package, but can pickle 'same' class in root folder

阅读更多关于 spark-submit - cannot pickle class in package, but can pickle 'same' class in root folder

问题 In my Python-based Spark task 'main.py', I reference a protobuf generated class 'a_pb2.py'. If I place all files in the root directory like / - main.py - a_pb2.py and zip a_pb2.py into 'proto.zip', then run spark-submit --py-files=proto.zip main.py everything runs as expected. However, if I move the protobuf classes to a package, organizing my files like / - main.py - /protofiles - __init__.py - a_pb2.py and zip /protofiles into 'proto.zip', then run spark-submit --py-files=proto.zip main.py

Rename columns with special characters in python or Pyspark dataframe

阅读更多关于 Rename columns with special characters in python or Pyspark dataframe

问题 I have a data frame in python/pyspark. The columns have special characters like dot(.) spaces brackets(()) and parenthesis {}. in their names. Now I want to rename the column names in such a way that if there are dot and spaces replace them with underscore and if there are () and {} then remove them from the column names. I have done this df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns)) with this I was able to replace the dot and spaces with underscores with Unable to do the

How to write numpy arrays directly to s3 in a deep learning application backed by spark

阅读更多关于 How to write numpy arrays directly to s3 in a deep learning application backed by spark

问题 We are generating ~10k numpy arrays using keras and then finally we have to save those arrays as .npy files to s3. But the problem is for saving to s3 inside the map function of spark we have to create intermediate file.What we want is instead of creating intermediate files directly stream them to s3. I used this "Cottoncandy" library but then its not working inside spark map function and throwing error as:- pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock

How to add multiple columns in Apache Spark

阅读更多关于 How to add multiple columns in Apache Spark

问题 Here is my input data with four columns with space as the delimiter. I want to add the second and third column and print the result sachin 200 10 2 sachin 900 20 2 sachin 500 30 3 Raju 400 40 4 Mike 100 50 5 Raju 50 60 6 My code is in the mid way from pyspark import SparkContext sc = SparkContext() def getLineInfo(lines): spLine = lines.split(' ') name = str(spLine[0]) cash = int(spLine[1]) cash2 = int(spLine[2]) cash3 = int(spLine[3]) return (name,cash,cash2) myFile = sc.textFile("D:\PYSK

how to use nextval() in a postgres jdbc driver for pyspark?

阅读更多关于 how to use nextval() in a postgres jdbc driver for pyspark?

问题 I have a table named "mytable" in Postgres with two columns, id (bigint) and value (varchar(255)). id gets its value from a sequence using nextval('my_sequence') . A PySpark application takes a dataframe and uses the Postgres JDBC jar (postgresql-42.1.4.jar) to insert the dataframe into "mytable". I'm creating the id column using: df.withColumn('id', lit("nextval('my_sequence')")) Postgres is interpreting the column as a 'varying character'. I can see that there are ways for calling Postgres

Pyspark string pattern from columns values and regexp expression

阅读更多关于 Pyspark string pattern from columns values and regexp expression

Selecting empty array values from a Spark DataFrame

阅读更多关于 Selecting empty array values from a Spark DataFrame

问题 Given a DataFrame with the following rows: rows = [ Row(col1='abc', col2=[8], col3=[18], col4=[16]), Row(col2='def', col2=[18], col3=[18], col4=[]), Row(col3='ghi', col2=[], col3=[], col4=[])] I'd like to remove rows with an empty array for each of col2 , col3 and col4 (i.e. the 3rd row). For example I might expect this code to work: df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect() I have two problems how to combine where clauses with and but more importantly...

How use on Array

阅读更多关于 How use on Array

问题 I have a pyspark Dataframe, that contain 4 columns. I want to extract some string from one column, it's type is Array of strings . I used regexp_extract function, but it's returned an error because the regexp_extract accept only a strings. example dataframe: id | last_name | age | Identificator ------------------------------------------------------------------ 12 | AA | 23 | "[""AZE","POI","76759","T86420","ADAPT"]" ------------------------------------------------------------------ 24 | BB |

Spark Mongo connector, MongoShardedPartitioner does not work

阅读更多关于 Spark Mongo connector, MongoShardedPartitioner does not work

问题 For testing purposes, I have configured a 4-node cluster, each of them has a Spark Worker and a MongoDB Shard. These are the details: Four Debian 9 servers (named visa0, visa1, visa2, visa3) Spark(v2.4.0) cluster on 4 nodes (visa1: master, visa0..3: slaves) MongoDB (v3.2.11) sharded cluster con 4 nodes ( config server replica set on visa1..3, mongos on visa1, shard servers: visa0..3 ) I'm using MongoDB Spark connector installed with "spark-shell --packages org.mongodb.spark:mongo-spark

pyspark sql query : count distinct values with conditions

阅读更多关于 pyspark sql query : count distinct values with conditions

问题 I have a dataframe as below : +-----------+------------+-------------+-----------+ | id_doctor | id_patient | consumption | type_drug | +-----------+------------+-------------+-----------+ | d1 | p1 | 12.0 | bhd | | d1 | p2 | 10.0 | lsd | | d1 | p1 | 6.0 | bhd | | d1 | p1 | 14.0 | carboxyl | | d2 | p1 | 12.0 | bhd | | d2 | p1 | 13.0 | bhd | | d2 | p2 | 12.0 | lsd | | d2 | p1 | 6.0 | bhd | | d2 | p2 | 12.0 | bhd | +-----------+------------+-------------+-----------+ I want to count distinct