pyspark | 易学教程

Low JDBC write speed from Spark to MySQL

阅读更多关于 Low JDBC write speed from Spark to MySQL

问题 I need write about 1 million rows from Spark a DataFrame to MySQL but the insert is too slow. How can I improve it? Code below: df = sqlContext.createDataFrame(rdd, schema) df.write.jdbc(url='xx', table='xx', mode='overwrite') 回答1: The answer in https://stackoverflow.com/a/10617768/3318517 has worked for me. Add rewriteBatchedStatements=true to the connection URL. (See Configuration Properties for Connector/J.) My benchmark went from 3325 seconds to 42 seconds! 来源： https://stackoverflow.com

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

阅读更多关于 Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

问题 I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

阅读更多关于 Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

How do I get Python libraries in pyspark?

阅读更多关于 How do I get Python libraries in pyspark?

问题 I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark. When I try to import any of them I get the below error: >>> from shapely.geometry import polygon Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named shapely.geometry I know the module isn't present, but how can these packages be brought to my pyspark libraries? 回答1: In the Spark context try using: SparkContext.addPyFile("module.py") # also .zip , quoting from the docs:

pyspark dataframe filter or include based on list

阅读更多关于 pyspark dataframe filter or include based on list

问题 I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"]) # define a list of scores l = [10,18,20] # filter out records by scores by list l records = df.filter(df.score in l) # expected: (0,1), (0,1), (0

Avoid performance impact of a single partition mode in Spark window functions

阅读更多关于 Avoid performance impact of a single partition mode in Spark window functions

问题 My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For example, I have: >>> df.show() +-----+----------+ |index| col1| +-----+----------+ | 0.0|0.58734024| | 1.0|0.67304325| | 2.0|0.85154736| | 3.0| 0.5449719| +-----+----------+ If I choose to calculate these using "Window" functions, then I can do that like so: >>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc()) >>> import pyspark.sql.functions as f >>>

'PipelinedRDD' object has no attribute 'toDF' in PySpark

阅读更多关于 'PipelinedRDD' object has no attribute 'toDF' in PySpark

问题 I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module ( Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util import MLUtils from pyspark import SparkContext sc = SparkContext("local", "Teste Original") data = MLUtils.loadLibSVMFile(sc, "/home/svm_capture").toDF() and I'm running using: ./spark-submit my_script.py And I get the error: Traceback (most recent

How to convert json to pyspark dataframe (faster implementation) [duplicate]

阅读更多关于 How to convert json to pyspark dataframe (faster implementation) [duplicate]

问题 This question already has answers here : reading json file in pyspark (3 answers) Closed 2 years ago . I have json data in form of {'abc':1, 'def':2, 'ghi':3} How to convert it into pyspark dataframe in python? 回答1: import json j = {'abc':1, 'def':2, 'ghi':3} a=[json.dumps(j)] jsonRDD = sc.parallelize(a) df = spark.read.json(jsonRDD) >>> df.show() +---+---+---+ |abc|def|ghi| +---+---+---+ | 1| 2| 3| +---+---+---+ 来源： https://stackoverflow.com/questions/44456076/how-to-convert-json-to-pyspark

Add column to Data Frame conditionally in Pyspark

阅读更多关于 Add column to Data Frame conditionally in Pyspark

问题 I have a data frame in PySpark. I would like to add a column to the data frame conditionally. Say If the data frame doesn’t have the column then add a column with null values. If the column is present then do nothing and return the same data frame as a new data frame How do I pass the conditional statement in PySpark 回答1: It is not hard but you'll need a bit more than a column name to do it right. Required imports from pyspark.sql import types as t from pyspark.sql.functions import lit from

How to correctly get the weights using spark for synthetic dataset?

阅读更多关于 How to correctly get the weights using spark for synthetic dataset?

问题 I'm doing LogisticRegressionWithSGD on spark for synthetic dataset. I've calculated the error on matlab using vanilla gradient descent and on R which is ~5%. I got similar weight that was used in the model that I used to generate y. The dataset was generated using this example. While I am able to get very close error rate at the end with different stepsize tuning, the weights for individual feature isn't the same. In fact, it varies a lot. I tried LBFGS for spark and it's able to predict both