pyspark-sql

pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

余生长醉 提交于 2019-11-26 17:12:32
问题 I use docker image sequenceiq/spark on my Mac to study these spark examples, during the study process, I upgrade the spark inside that image to 1.6.1 according to this answer, and the error occurred when I start the Simple Data Operations example, here is what happened: when I run df = sqlContext.read.format("jdbc").option("url",url).option("dbtable","people").load() it raise a error, and the full stack with the pyspark console is as followed: Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)

How to change dataframe column names in pyspark?

☆樱花仙子☆ 提交于 2019-11-26 17:00:44
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following: df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt") oldSchema = df.schema for i,k in enumerate(oldSchema.fields): k.name = new_column_name_list[i] df =

Cannot find col function in pyspark

佐手、 提交于 2019-11-26 15:44:35
问题 In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist? 回答1: It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods. If you carefully

More than one hour to execute pyspark.sql.DataFrame.take(4)

你。 提交于 2019-11-26 06:48:29
问题 I am running spark 1.6 on 3 VMs (i.e. 1x master; 2x slaves) all with 4 cores and 16GB RAM. I can see the workers registered on spark-master webUI. I want to retrieve data from my Vertica database to work on it. As I didn\'t manage to run complex queries I tried dummy queries to understand. We consider here an easy task. My code is: df = sqlContext.read.format(\'jdbc\').options(url=\'xxxx\', dbtable=\'xxx\', user=\'xxxx\', password=\'xxxx\').load() four = df.take(4) And the output is (note: I

How to change dataframe column names in pyspark?

℡╲_俬逩灬. 提交于 2019-11-26 04:59:02
问题 I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn\'t work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following: df = sqlContext.read.format(\"com.databricks.spark.csv\").options(header=\'false\', inferschema=\'true\', delimiter=\'\\t\').load(\

Split Spark Dataframe string column into multiple columns

余生颓废 提交于 2019-11-26 03:54:41
问题 I\'ve seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn\'t what I want at all. I simply want to do the Dataframe equivalent of the very simple: rdd.map(lambda row: row + [row.my_str_col.split(\'-\')]) which takes something looking like: col1 | my_str_col -----+----------- 18 | 856-yygrm 201 | 777-psgdg and converts it to this: col1 | my_str_col | _col3 | _col4 -----+------------+-------+------

Convert pyspark string to date format

折月煮酒 提交于 2019-11-25 23:49:37
问题 I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias(\'new_date\')).show() and I get a string of nulls. Can anyone help? 回答1: It is possible (preferrable?) to do this without a udf: from pyspark.sql.functions import unix_timestamp, from_unixtime df = spark.createDataFrame( [("11/25/1991",), ("11/24/1991",), ("11/30/1991",)], ['date_str'] ) df2 = df.select(

Using a column value as a parameter to a spark DataFrame function

风流意气都作罢 提交于 2019-11-25 23:13:10
问题 Consider the following DataFrame: #+------+---+ #|letter|rpt| #+------+---+ #| X| 3| #| Y| 1| #| Z| 2| #+------+---+ which can be created using the following code: df = spark.createDataFrame([(\"X\", 3),(\"Y\", 1),(\"Z\", 2)], [\"letter\", \"rpt\"]) Suppose I wanted to repeat each row the number of times specified in the column rpt , just like in this question. One way would be to replicate my solution to that question using the following pyspark-sql query: query = \"\"\" SELECT * FROM

How to make good reproducible Apache Spark examples

試著忘記壹切 提交于 2019-11-25 22:14:04
问题 I\'ve been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don\'t provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth. For example: see the comments on this question. Perhaps part of the problem is that people just don\'t know how to easily create an MCVE for