pyspark | 易学教程

Spark MLlib - trainImplicit warning

阅读更多关于 Spark MLlib - trainImplicit warning

问题 I keep seeing these warnings when using trainImplicit : WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same. All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also

Convert a simple one line string to RDD in Spark

阅读更多关于 Convert a simple one line string to RDD in Spark

问题 I have a simple line: line = "Hello, world" I would like to convert it to an RDD with only one element. I have tried sc.parallelize(line) But it get: sc.parallelize(line).collect() ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd'] Any ideas? 回答1: try using List as parameter: sc.parallelize(List(line)).collect() it returns res1: Array[String] = Array(hello,world) 回答2: The below code works fine in Python sc.parallelize([line]).collect() ['Hello, world'] Here we are passing the

Spark Container & Executor OOMs during `reduceByKey`

阅读更多关于 Spark Container & Executor OOMs during `reduceByKey`

问题 I'm running a Spark job on Amazon's EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size. The job joins the data together (using reduceByKey ), does some maps and filters, and saves it to S3 in Parquet format. While the job uses Dataframes for saving, all of our actual transformations and actions are performed on RDDs. Note, I've included a detailed rundown of my current configurations and values with which I've experimented already after

Spark Container & Executor OOMs during `reduceByKey`

阅读更多关于 Spark Container & Executor OOMs during `reduceByKey`

How to convert Spark Streaming data into Spark DataFrame

阅读更多关于 How to convert Spark Streaming data into Spark DataFrame

问题 So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either. Here's part of my latest version of the code: import sys import re

In PySpark, how can I log to log4j from inside a transformation

阅读更多关于 In PySpark, how can I log to log4j from inside a transformation

问题 I want to log to the standard logger inside an executor during transformation with log levels and formatting respected. Unfortunately I can't get access to the log4j logger object inside the method as it's not serializable, and the spark context isn't available inside the transformation. I could just log outside of the transformation all of the objects I'm going to touch but that doesn't really help debugging or monitoring code execution. def slow_row_contents_fetch(row): rows = fetch_id_row

Spark and Hive table schema out of sync after external overwrite

阅读更多关于 Spark and Hive table schema out of sync after external overwrite

问题 I'm am having issues with the schema for Hive tables being out of sync between Spark and Hive on a Mapr cluster with Spark 2.1.0 and Hive 2.1.1. I need to try to resolve this problem specifically for managed tables, but the issue can be reproduced with unmanaged/external tables. Overview of Steps Use saveAsTable to save a dataframe to a given table. Use mode("overwrite").parquet("path/to/table") to overwrite the data for the previously saved table. I am actually modifying the data through a

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

阅读更多关于 Spark ML Pipeline with RandomForest takes too long on 20MB dataset

问题 I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci

How to change SparkContext properties in Interactive PySpark session

阅读更多关于 How to change SparkContext properties in Interactive PySpark session

问题 How can I change spark.driver.maxResultSize in pyspark interactive shell? I have used the following code from pyspark import SparkConf, SparkContext conf = (SparkConf() .set("spark.driver.maxResultSize", "10g")) sc.stop() sc=SparkContext(conf) but it gives me the error AttributeError: 'SparkConf' object has no attribute '_get_object_id' 回答1: So what your seeing is that the SparkConf isn't a java object, this is happening because its trying to use the SparkConf as the first parameter, if

Count number of words in a spark dataframe

阅读更多关于 Count number of words in a spark dataframe

问题 How can we find the number of words in a column of a spark dataframe without using REPLACE() function of SQL ? Below is the code and input I am working with but the replace() function does not work. from pyspark.sql import SparkSession my_spark = SparkSession \ .builder \ .appName("Python Spark SQL example") \ .enableHiveSupport() \ .getOrCreate() parqFileName = 'gs://caserta-pyspark-eval/train.pqt' tuesdayDF = my_spark.read.parquet(parqFileName) tuesdayDF.createOrReplaceTempView("parquetFile