pyspark | 易学教程

Are Jupyter notebook executors distributed dynamically in Apache Spark?

阅读更多关于 Are Jupyter notebook executors distributed dynamically in Apache Spark?

问题 I got a question in order to better understand a big data concept within Apache Hadoop Spark. Not sure if it's off-topic in this forum, but let me know. Imagine a Apache Hadoop cluster with 8 servers managed by the Yarn resource manager. I uploaded a file into HDFS (file system) that is configured with 64MB blocksize and a replication count of 3. That file is then split into blocks of 64MB. Now let's imagine the blocks are distributed by HDFS onto node 1, 2 and 3. But now I'm coding some

PySpark-使用Python在Spark上编程

阅读更多关于 PySpark-使用Python在Spark上编程

Python Programming Guide The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python. Key Differences in the Python API There are a few key differences between the Python and Scala APIs: Python is dynamically typed, so RDDs can hold objects of multiple types. PySpark does not yet support a few API calls, such as lookup and non-text input files, though

Spark编程指南—Python版

阅读更多关于 Spark编程指南—Python版

本文翻译自Spark( http://spark.apache.org )的官方文档。由于Spark更新较快，部分API已经过时，本文仅供参考，请以相应版本的官方文档和运行时的提示为准。概述从高层次上来看，每一个Spark应用都包含一个驱动程序，用于执行用户的main函数以及在集群上运行各种并行操作。Spark提供的主要抽象是弹性分布式数据集（RDD），这是一个包含诸多元素、被划分到不同节点上进行并行处理的数据集合。RDD通过打开HDFS（或其他hadoop支持的文件系统）上的一个文件、在驱动程序中打开一个已有的Scala集合或由其他RDD转换操作得到。用户可以要求Spark将RDD持久化到内存中，这样就可以有效地在并行操作中复用。另外，在节点发生错误时RDD可以自动恢复。 Spark提供的另一个抽象是可以在并行操作中使用的共享变量。在默认情况下，当Spark将一个函数转化成许多任务在不同的节点上运行的时候，对于所有在函数中使用的变量，每一个任务都会得到一个副本。有时，某一个变量需要在任务之间或任务与驱动程序之间共享。Spark支持两种共享变量：广播变量，用来将一个值缓存到所有节点的内存中；累加器，只能用于累加，比如计数器和求和。这篇指南将展示这些特性在Spark支持的语言中是如何使用的（本文只翻译了Python部分）。如果你打开了Spark的交互命令行——bin/spark

How to convert multiple parquet files into TFrecord files using SPARK?

阅读更多关于 How to convert multiple parquet files into TFrecord files using SPARK?

问题 I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy() . I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps: Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files. Read those parquet files to

PySpark: How to specify column with comma as decimal

阅读更多关于 PySpark: How to specify column with comma as decimal

问题 I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa. For example: I have 2.416,67 instead of 2,416.67 . My data in .csv file looks like this - ID; Revenue 21; 2.645,45 23; 31.147,05 . . 55; 1.009,11 In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats. Pandas code: import pandas as pd df=pd.read_csv(

PySpark: How to specify column with comma as decimal

阅读更多关于 PySpark: How to specify column with comma as decimal

PySpark: How to specify column with comma as decimal

阅读更多关于 PySpark: How to specify column with comma as decimal

convert Dense Vector to Sparse Vector in PySpark

阅读更多关于 convert Dense Vector to Sparse Vector in PySpark

问题 Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following: Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ]) That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it? 回答1: import scipy.sparse from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT from pyspark.sql.functions import udf, col If you have just one dense vector this

convert Dense Vector to Sparse Vector in PySpark

阅读更多关于 convert Dense Vector to Sparse Vector in PySpark

outlier detection in pyspark

阅读更多关于 outlier detection in pyspark

问题 I have a pyspark data frame as shown below. +---+-------+--------+ |age|balance|duration| +---+-------+--------+ | 2| 2143| 261| | 44| 29| 151| | 33| 2| 76| | 50| 1506| 92| | 33| 1| 198| | 35| 231| 139| | 28| 447| 217| | 2| 2| 380| | 58| 121| 50| | 43| 693| 55| | 41| 270| 222| | 50| 390| 137| | 53| 6| 517| | 58| 71| 71| | 57| 162| 174| | 40| 229| 353| | 45| 13| 98| | 57| 52| 38| | 3| 0| 219| | 4| 0| 54| +---+-------+--------+ and my expected output should be look like, +---+-------+--------+-