pyspark

Pyspark forward and backward fill within column level

蓝咒 提交于 2020-07-10 10:28:19
问题 I try to fill missing data in a pyspark dataframe. The pyspark dataframe looks as such: +---------+---------+-------------------+----+ | latitude|longitude| timestamplast|name| +---------+---------+-------------------+----+ | | 4.905615|2019-08-01 00:00:00| 1| |51.819645| |2019-08-01 00:00:00| 1| | 51.81964| 4.961713|2019-08-01 00:00:00| 2| | | |2019-08-01 00:00:00| 3| | 51.82918| 4.911187| | 3| | 51.82385| 4.901488|2019-08-01 00:00:03| 5| +---------+---------+-------------------+----+ Within

Why do I see multiple spark installations directories?

北城余情 提交于 2020-07-10 10:27:08
问题 I am working on a ubuntu server which has spark installed in it. I don't have sudo access to this server. So under my directory, I created a new virtual environment where I installed pyspark When I type the below command whereis spark-shell #see below /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/

Spark pipeline error gradient boosting model

白昼怎懂夜的黑 提交于 2020-07-10 10:25:25
问题 I am getting an error when use gradient boosting model in python. I previously normalized the data, used VectorAssemble to transform, and indexed the columns, error occurs when when I run this: from pyspark.ml import Pipeline #pipeline = Pipeline(stages=[gbt]) stages = [] stages += [gbt] pipeline = Pipeline(stages=stages) model = pipeline.fit(df_train) prediction = model.transform(df_train) prediction.printSchema() this is the error: command-3539065191562733> in <module>() 6 7 pipeline =

Type Casting Large number of Struct Fields to String using Pyspark

≯℡__Kan透↙ 提交于 2020-07-10 07:40:09
问题 I have a pyspark df who's schema looks like this root |-- company: struct (nullable = true) | |-- 0: long(nullable = true) | |-- 1: long(nullable = true) | |-- 10: long(nullable = true) | |-- 100: long(nullable = true) | |-- 101: long(nullable = true) | |-- 102: long(nullable = true) | |-- 103: long(nullable = true) | |-- 104: long(nullable = true) | |-- 105: long(nullable = true) | |-- 106: long(nullable = true) | |-- 107: long(nullable = true) | |-- 108: long(nullable = true) | |-- 109:

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

ぃ、小莉子 提交于 2020-07-10 03:11:13
问题 I have a pyspark dataframe with multiple columns. For example the one below. from pyspark.sql import Row l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")] rdd = sc.parallelize(l) score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2])) score_card = sqlContext.createDataFrame(score_rdd) +----+--------+--------+ |name|letters1|letters2| +----+--------+--------+ |Jack| a| p| |Jack| b| q| |Bell| c| r| |Bell| d| s| +----+--------+--------+ Now I want to

CI/CD tests involving pyspark - JAVA_HOME is not set

孤者浪人 提交于 2020-07-09 16:25:50
问题 I am working on a project which uses pyspark, and would like to set up automated tests. Here's what my .gitlab-ci.yml file looks like: image: "myimage:latest" stages: - Tests pytest: stage: Tests script: - pytest tests/. I built the docker image myimage using a Dockerfile such as the following (see this excellent answer): FROM python:3.7 RUN python --version # Create app directory WORKDIR /app # copy requirements.txt COPY local-src/requirements.txt ./ # Install app dependencies RUN pip

CI/CD tests involving pyspark - JAVA_HOME is not set

你说的曾经没有我的故事 提交于 2020-07-09 16:25:26
问题 I am working on a project which uses pyspark, and would like to set up automated tests. Here's what my .gitlab-ci.yml file looks like: image: "myimage:latest" stages: - Tests pytest: stage: Tests script: - pytest tests/. I built the docker image myimage using a Dockerfile such as the following (see this excellent answer): FROM python:3.7 RUN python --version # Create app directory WORKDIR /app # copy requirements.txt COPY local-src/requirements.txt ./ # Install app dependencies RUN pip

Jupyter Notebook error while using PySpark Kernel: the code failed because of a fatal error: Error sending http request

独自空忆成欢 提交于 2020-07-09 08:45:05
问题 I and using jupyter notebook's PySpark kernel, I have successfully selected PySpark kernel but I keep getting the below error The code failed because of a fatal error: Error sending http request and maximum retry encountered.. Some things to try: a) Make sure Spark has enough available resources for Jupyter to create a Spark context. b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly. c) Restart the kernel. here's the log also 2019-10-10 13:37

reading a nested JSON file in pyspark

旧时模样 提交于 2020-07-09 06:59:48
问题 I'd like to create a pyspark dataframe from a json file in hdfs. the json file has the following contet: { "Product": { "0": "Desktop Computer", "1": "Tablet", "2": "iPhone", "3": "Laptop" }, "Price": { "0": 700, "1": 250, "2": 800, "3": 1200 } } Then, I read this file using pyspark 2.4.4 df = spark.read.json("/path/file.json") So, I get a result like this: df.show(truncate=False) +---------------------+---------------------------------+ |Price |Product | +---------------------+--------------

How to group by one column in rdd in pyspark?

时间秒杀一切 提交于 2020-07-09 04:43:49
问题 The rdd in pyspark are consist of four elements in every list : [id1, 'aaa',12,87] [id2, 'acx',1,90] [id3, 'bbb',77,10] [id2, 'bbb',77,10] ..... I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]] How can I realize it ? 回答1: spark.version # u'2.2.0' rdd = sc.parallelize((['id1', 'aaa',12,87], ['id2', 'acx',1,90], ['id3', 'bbb',77,10], ['id2', 'bbb',77,10])) rdd.map(lambda x: (x[0], x[1: