pyspark

spark-submit - cannot pickle class in package, but can pickle 'same' class in root folder

天涯浪子 提交于 2019-12-24 02:18:58
问题 In my Python-based Spark task 'main.py', I reference a protobuf generated class 'a_pb2.py'. If I place all files in the root directory like / - main.py - a_pb2.py and zip a_pb2.py into 'proto.zip', then run spark-submit --py-files=proto.zip main.py everything runs as expected. However, if I move the protobuf classes to a package, organizing my files like / - main.py - /protofiles - __init__.py - a_pb2.py and zip /protofiles into 'proto.zip', then run spark-submit --py-files=proto.zip main.py

Rename columns with special characters in python or Pyspark dataframe

孤人 提交于 2019-12-24 02:17:13
问题 I have a data frame in python/pyspark. The columns have special characters like dot(.) spaces brackets(()) and parenthesis {}. in their names. Now I want to rename the column names in such a way that if there are dot and spaces replace them with underscore and if there are () and {} then remove them from the column names. I have done this df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns)) with this I was able to replace the dot and spaces with underscores with Unable to do the

How to write numpy arrays directly to s3 in a deep learning application backed by spark

笑着哭i 提交于 2019-12-24 02:04:53
问题 We are generating ~10k numpy arrays using keras and then finally we have to save those arrays as .npy files to s3. But the problem is for saving to s3 inside the map function of spark we have to create intermediate file.What we want is instead of creating intermediate files directly stream them to s3. I used this "Cottoncandy" library but then its not working inside spark map function and throwing error as:- pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock

How to add multiple columns in Apache Spark

非 Y 不嫁゛ 提交于 2019-12-24 02:02:11
问题 Here is my input data with four columns with space as the delimiter. I want to add the second and third column and print the result sachin 200 10 2 sachin 900 20 2 sachin 500 30 3 Raju 400 40 4 Mike 100 50 5 Raju 50 60 6 My code is in the mid way from pyspark import SparkContext sc = SparkContext() def getLineInfo(lines): spLine = lines.split(' ') name = str(spLine[0]) cash = int(spLine[1]) cash2 = int(spLine[2]) cash3 = int(spLine[3]) return (name,cash,cash2) myFile = sc.textFile("D:\PYSK

how to use nextval() in a postgres jdbc driver for pyspark?

时光总嘲笑我的痴心妄想 提交于 2019-12-24 01:41:35
问题 I have a table named "mytable" in Postgres with two columns, id (bigint) and value (varchar(255)). id gets its value from a sequence using nextval('my_sequence') . A PySpark application takes a dataframe and uses the Postgres JDBC jar (postgresql-42.1.4.jar) to insert the dataframe into "mytable". I'm creating the id column using: df.withColumn('id', lit("nextval('my_sequence')")) Postgres is interpreting the column as a 'varying character'. I can see that there are ways for calling Postgres

Pyspark string pattern from columns values and regexp expression

…衆ロ難τιáo~ 提交于 2019-12-24 01:13:22
问题 Hi I have dataframe with 2 columns : +----------------------------------------+----------+ | Text | Key_word | +----------------------------------------+----------+ | First random text tree cheese cat | tree | | Second random text apple pie three | text | | Third random text burger food brain | brain | | Fourth random text nothing thing chips | random | +----------------------------------------+----------+ I want to generate a 3rd columns with a word appearing before the key_word from the

Selecting empty array values from a Spark DataFrame

一世执手 提交于 2019-12-24 00:59:56
问题 Given a DataFrame with the following rows: rows = [ Row(col1='abc', col2=[8], col3=[18], col4=[16]), Row(col2='def', col2=[18], col3=[18], col4=[]), Row(col3='ghi', col2=[], col3=[], col4=[])] I'd like to remove rows with an empty array for each of col2 , col3 and col4 (i.e. the 3rd row). For example I might expect this code to work: df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect() I have two problems how to combine where clauses with and but more importantly...

How use on Array

那年仲夏 提交于 2019-12-24 00:58:47
问题 I have a pyspark Dataframe, that contain 4 columns. I want to extract some string from one column, it's type is Array of strings . I used regexp_extract function, but it's returned an error because the regexp_extract accept only a strings. example dataframe: id | last_name | age | Identificator ------------------------------------------------------------------ 12 | AA | 23 | "[""AZE","POI","76759","T86420","ADAPT"]" ------------------------------------------------------------------ 24 | BB |

Spark Mongo connector, MongoShardedPartitioner does not work

冷暖自知 提交于 2019-12-24 00:56:19
问题 For testing purposes, I have configured a 4-node cluster, each of them has a Spark Worker and a MongoDB Shard. These are the details: Four Debian 9 servers (named visa0, visa1, visa2, visa3) Spark(v2.4.0) cluster on 4 nodes (visa1: master, visa0..3: slaves) MongoDB (v3.2.11) sharded cluster con 4 nodes ( config server replica set on visa1..3, mongos on visa1, shard servers: visa0..3 ) I'm using MongoDB Spark connector installed with "spark-shell --packages org.mongodb.spark:mongo-spark

pyspark sql query : count distinct values with conditions

半城伤御伤魂 提交于 2019-12-24 00:55:24
问题 I have a dataframe as below : +-----------+------------+-------------+-----------+ | id_doctor | id_patient | consumption | type_drug | +-----------+------------+-------------+-----------+ | d1 | p1 | 12.0 | bhd | | d1 | p2 | 10.0 | lsd | | d1 | p1 | 6.0 | bhd | | d1 | p1 | 14.0 | carboxyl | | d2 | p1 | 12.0 | bhd | | d2 | p1 | 13.0 | bhd | | d2 | p2 | 12.0 | lsd | | d2 | p1 | 6.0 | bhd | | d2 | p2 | 12.0 | bhd | +-----------+------------+-------------+-----------+ I want to count distinct