pyspark

Are Jupyter notebook executors distributed dynamically in Apache Spark?

混江龙づ霸主 提交于 2020-03-02 09:10:12
问题 I got a question in order to better understand a big data concept within Apache Hadoop Spark. Not sure if it's off-topic in this forum, but let me know. Imagine a Apache Hadoop cluster with 8 servers managed by the Yarn resource manager. I uploaded a file into HDFS (file system) that is configured with 64MB blocksize and a replication count of 3. That file is then split into blocks of 64MB. Now let's imagine the blocks are distributed by HDFS onto node 1, 2 and 3. But now I'm coding some

PySpark-使用Python在Spark上编程

谁说我不能喝 提交于 2020-03-01 13:03:03
Python Programming Guide The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python. Key Differences in the Python API There are a few key differences between the Python and Scala APIs: Python is dynamically typed, so RDDs can hold objects of multiple types. PySpark does not yet support a few API calls, such as lookup and non-text input files, though

Spark编程指南—Python版

拈花ヽ惹草 提交于 2020-02-29 08:04:28
本文翻译自Spark( http://spark.apache.org )的官方文档。由于Spark更新较快,部分API已经过时,本文仅供参考,请以相应版本的官方文档和运行时的提示为准。 概述 从高层次上来看,每一个Spark应用都包含一个驱动程序,用于执行用户的main函数以及在集群上运行各种并行操作。Spark提供的主要抽象是弹性分布式数据集(RDD),这是一个包含诸多元素、被划分到不同节点上进行并行处理的数据集合。RDD通过打开HDFS(或其他hadoop支持的文件系统)上的一个文件、在驱动程序中打开一个已有的Scala集合或由其他RDD转换操作得到。用户可以要求Spark将RDD持久化到内存中,这样就可以有效地在并行操作中复用。另外,在节点发生错误时RDD可以自动恢复。 Spark提供的另一个抽象是可以在并行操作中使用的共享变量。在默认情况下,当Spark将一个函数转化成许多任务在不同的节点上运行的时候,对于所有在函数中使用的变量,每一个任务都会得到一个副本。有时,某一个变量需要在任务之间或任务与驱动程序之间共享。Spark支持两种共享变量:广播变量,用来将一个值缓存到所有节点的内存中;累加器,只能用于累加,比如计数器和求和。 这篇指南将展示这些特性在Spark支持的语言中是如何使用的(本文只翻译了Python部分)。如果你打开了Spark的交互命令行——bin/spark

How to convert multiple parquet files into TFrecord files using SPARK?

不羁岁月 提交于 2020-02-28 17:24:08
问题 I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy() . I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps: Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files. Read those parquet files to

PySpark: How to specify column with comma as decimal

核能气质少年 提交于 2020-02-28 03:05:14
问题 I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa. For example: I have 2.416,67 instead of 2,416.67 . My data in .csv file looks like this - ID; Revenue 21; 2.645,45 23; 31.147,05 . . 55; 1.009,11 In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats. Pandas code: import pandas as pd df=pd.read_csv(

PySpark: How to specify column with comma as decimal

自闭症网瘾萝莉.ら 提交于 2020-02-28 02:47:03
问题 I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa. For example: I have 2.416,67 instead of 2,416.67 . My data in .csv file looks like this - ID; Revenue 21; 2.645,45 23; 31.147,05 . . 55; 1.009,11 In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats. Pandas code: import pandas as pd df=pd.read_csv(

PySpark: How to specify column with comma as decimal

戏子无情 提交于 2020-02-28 02:46:07
问题 I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa. For example: I have 2.416,67 instead of 2,416.67 . My data in .csv file looks like this - ID; Revenue 21; 2.645,45 23; 31.147,05 . . 55; 1.009,11 In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats. Pandas code: import pandas as pd df=pd.read_csv(

convert Dense Vector to Sparse Vector in PySpark

旧巷老猫 提交于 2020-02-27 12:42:01
问题 Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following: Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ]) That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it? 回答1: import scipy.sparse from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT from pyspark.sql.functions import udf, col If you have just one dense vector this

convert Dense Vector to Sparse Vector in PySpark

一曲冷凌霜 提交于 2020-02-27 12:41:44
问题 Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following: Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ]) That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it? 回答1: import scipy.sparse from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT from pyspark.sql.functions import udf, col If you have just one dense vector this

outlier detection in pyspark

一个人想着一个人 提交于 2020-02-27 12:00:18
问题 I have a pyspark data frame as shown below. +---+-------+--------+ |age|balance|duration| +---+-------+--------+ | 2| 2143| 261| | 44| 29| 151| | 33| 2| 76| | 50| 1506| 92| | 33| 1| 198| | 35| 231| 139| | 28| 447| 217| | 2| 2| 380| | 58| 121| 50| | 43| 693| 55| | 41| 270| 222| | 50| 390| 137| | 53| 6| 517| | 58| 71| 71| | 57| 162| 174| | 40| 229| 353| | 45| 13| 98| | 57| 52| 38| | 3| 0| 219| | 4| 0| 54| +---+-------+--------+ and my expected output should be look like, +---+-------+--------+-