pyspark

pyspark. zip arrays in a dataframe

ぐ巨炮叔叔 提交于 2021-02-10 09:31:13
问题 I have the following PySpark DataFrame: +------+----------------+ | id| data | +------+----------------+ | 1| [10, 11, 12]| | 2| [20, 21, 22]| | 3| [30, 31, 32]| +------+----------------+ At the end, I want to have the following DataFrame +--------+----------------------------------+ | id | data | +--------+----------------------------------+ | [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]| +--------+----------------------------------+ I order to do this. First I extract the data arrays as

How to parse and transform json string from spark data frame rows in pyspark

为君一笑 提交于 2021-02-10 07:57:07
问题 How to parse and transform json string from spark dataframe rows in pyspark? I'm looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows ( jstr1 , jstr2 , ...), which are saved to spark df . I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr has the same schema, columns/keys a

How to open a file which is stored in HDFS in pySpark using with open

本秂侑毒 提交于 2021-02-10 06:37:27
问题 How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found from pyspark import SparkConf,SparkContext conf = SparkConf () sc = SparkContext(conf = conf) def getMovieName(): movieNames = {} with open ("/user/sachinkerala6174/inData/movieStat") as f: for line in f: fields = line.split("|") mID = fields[0] mName = fields[1] movieNames[int(fields[0])] = fields[1] return movieNames nameDict

Unsupported Array error when reading JDBC source in (Py)Spark?

烈酒焚心 提交于 2021-02-10 06:27:50
问题 Trying to convert postgreSQL DB to Dataframe . Following is my code: from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Connect to DB") \ .getOrCreate() jdbcUrl = "jdbc:postgresql://XXXXXX" connectionProperties = { "user" : " ", "password" : " ", "driver" : "org.postgresql.Driver" } query = "(SELECT table_name FROM information_schema.tables) XXX" df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties) table_name_list = df.select("table_name")

column object not callable spark

依然范特西╮ 提交于 2021-02-10 06:22:29
问题 I tried to install spark and run the commands given in the tutorial but get the following error - https://spark.apache.org/docs/latest/quick-start.html P-MBP:spark-2.0.2-bin-hadoop2.4 prem$ ./bin/pyspark Python 2.7.13 (default, Apr 4 2017, 08:44:49) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to

column object not callable spark

人走茶凉 提交于 2021-02-10 06:22:14
问题 I tried to install spark and run the commands given in the tutorial but get the following error - https://spark.apache.org/docs/latest/quick-start.html P-MBP:spark-2.0.2-bin-hadoop2.4 prem$ ./bin/pyspark Python 2.7.13 (default, Apr 4 2017, 08:44:49) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to

Update the Nested Json with another Nested Json using Python

两盒软妹~` 提交于 2021-02-10 05:02:14
问题 For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON. Can anyone help me with this? I want to implement this in Pyspark. Full Set Json look like this: { "email": "abctest@xxx.com", "firstName": "name01", "id": 6304, "surname": "Optional", "layer01": { "key1": "value1", "key2": "value2", "key3": "value3", "key4": "value4", "layer02": { "key1": "value1", "key2": "value2" }, "layer03": [ { "inner_key01": "inner value01" }, {

Update the Nested Json with another Nested Json using Python

拈花ヽ惹草 提交于 2021-02-10 05:00:58
问题 For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON. Can anyone help me with this? I want to implement this in Pyspark. Full Set Json look like this: { "email": "abctest@xxx.com", "firstName": "name01", "id": 6304, "surname": "Optional", "layer01": { "key1": "value1", "key2": "value2", "key3": "value3", "key4": "value4", "layer02": { "key1": "value1", "key2": "value2" }, "layer03": [ { "inner_key01": "inner value01" }, {

Convert pyspark dataframe into list of python dictionaries

怎甘沉沦 提交于 2021-02-10 04:50:38
问题 Hi I'm new to pyspark and I'm trying to convert pyspark.sql.dataframe into list of dictionaries. Below is my dataframe, the type is <class 'pyspark.sql.dataframe.DataFrame'>: +------------------+----------+------------------------+ | title|imdb_score|Worldwide_Gross(dollars)| +------------------+----------+------------------------+ | The Eight Hundred| 7.2| 460699653| | Bad Boys for Life| 6.6| 426505244| | Tenet| 7.8| 334000000| |Sonic the Hedgehog| 6.5| 308439401| | Dolittle| 5.6| 245229088|

Convert pyspark dataframe into list of python dictionaries

故事扮演 提交于 2021-02-10 04:50:30
问题 Hi I'm new to pyspark and I'm trying to convert pyspark.sql.dataframe into list of dictionaries. Below is my dataframe, the type is <class 'pyspark.sql.dataframe.DataFrame'>: +------------------+----------+------------------------+ | title|imdb_score|Worldwide_Gross(dollars)| +------------------+----------+------------------------+ | The Eight Hundred| 7.2| 460699653| | Bad Boys for Life| 6.6| 426505244| | Tenet| 7.8| 334000000| |Sonic the Hedgehog| 6.5| 308439401| | Dolittle| 5.6| 245229088|