pyspark

Low JDBC write speed from Spark to MySQL

ぐ巨炮叔叔 提交于 2020-01-09 19:08:08
问题 I need write about 1 million rows from Spark a DataFrame to MySQL but the insert is too slow. How can I improve it? Code below: df = sqlContext.createDataFrame(rdd, schema) df.write.jdbc(url='xx', table='xx', mode='overwrite') 回答1: The answer in https://stackoverflow.com/a/10617768/3318517 has worked for me. Add rewriteBatchedStatements=true to the connection URL. (See Configuration Properties for Connector/J.) My benchmark went from 3325 seconds to 42 seconds! 来源: https://stackoverflow.com

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

廉价感情. 提交于 2020-01-09 09:18:54
问题 I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

主宰稳场 提交于 2020-01-09 09:17:26
问题 I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as

How do I get Python libraries in pyspark?

梦想的初衷 提交于 2020-01-09 09:07:06
问题 I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark. When I try to import any of them I get the below error: >>> from shapely.geometry import polygon Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named shapely.geometry I know the module isn't present, but how can these packages be brought to my pyspark libraries? 回答1: In the Spark context try using: SparkContext.addPyFile("module.py") # also .zip , quoting from the docs:

pyspark dataframe filter or include based on list

自古美人都是妖i 提交于 2020-01-09 04:47:29
问题 I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"]) # define a list of scores l = [10,18,20] # filter out records by scores by list l records = df.filter(df.score in l) # expected: (0,1), (0,1), (0

Avoid performance impact of a single partition mode in Spark window functions

血红的双手。 提交于 2020-01-08 17:42:07
问题 My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For example, I have: >>> df.show() +-----+----------+ |index| col1| +-----+----------+ | 0.0|0.58734024| | 1.0|0.67304325| | 2.0|0.85154736| | 3.0| 0.5449719| +-----+----------+ If I choose to calculate these using "Window" functions, then I can do that like so: >>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc()) >>> import pyspark.sql.functions as f >>>

'PipelinedRDD' object has no attribute 'toDF' in PySpark

烈酒焚心 提交于 2020-01-08 12:24:31
问题 I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module ( Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util import MLUtils from pyspark import SparkContext sc = SparkContext("local", "Teste Original") data = MLUtils.loadLibSVMFile(sc, "/home/svm_capture").toDF() and I'm running using: ./spark-submit my_script.py And I get the error: Traceback (most recent

How to convert json to pyspark dataframe (faster implementation) [duplicate]

99封情书 提交于 2020-01-07 03:47:06
问题 This question already has answers here : reading json file in pyspark (3 answers) Closed 2 years ago . I have json data in form of {'abc':1, 'def':2, 'ghi':3} How to convert it into pyspark dataframe in python? 回答1: import json j = {'abc':1, 'def':2, 'ghi':3} a=[json.dumps(j)] jsonRDD = sc.parallelize(a) df = spark.read.json(jsonRDD) >>> df.show() +---+---+---+ |abc|def|ghi| +---+---+---+ | 1| 2| 3| +---+---+---+ 来源: https://stackoverflow.com/questions/44456076/how-to-convert-json-to-pyspark

Add column to Data Frame conditionally in Pyspark

爷,独闯天下 提交于 2020-01-07 03:26:30
问题 I have a data frame in PySpark. I would like to add a column to the data frame conditionally. Say If the data frame doesn’t have the column then add a column with null values. If the column is present then do nothing and return the same data frame as a new data frame How do I pass the conditional statement in PySpark 回答1: It is not hard but you'll need a bit more than a column name to do it right. Required imports from pyspark.sql import types as t from pyspark.sql.functions import lit from

How to correctly get the weights using spark for synthetic dataset?

六月ゝ 毕业季﹏ 提交于 2020-01-07 03:15:14
问题 I'm doing LogisticRegressionWithSGD on spark for synthetic dataset. I've calculated the error on matlab using vanilla gradient descent and on R which is ~5%. I got similar weight that was used in the model that I used to generate y. The dataset was generated using this example. While I am able to get very close error rate at the end with different stepsize tuning, the weights for individual feature isn't the same. In fact, it varies a lot. I tried LBFGS for spark and it's able to predict both