pyspark-sql

What's the difference between --archives, --files, py-files in pyspark job arguments

房东的猫 提交于 2019-11-27 16:33:46
问题 --archives , --files , --py-files and sc.addFile and sc.addPyFile are quite confusing, can someone explain these clearly? 回答1: These options are truly scattered all over the place. In general, add your data files via --files or --archives and code files via --py-files . The latter will be added to the classpath (c.f., here) so you could import and use. As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here) From http://spark.apache.org

How to use a subquery for dbtable option in jdbc data source?

风流意气都作罢 提交于 2019-11-27 16:31:02
问题 I want to use Spark to process some data from a JDBC source. But to begin with, instead of reading original tables from JDBC, I want to run some queries on the JDBC side to filter columns and join tables, and load the query result as a table in Spark SQL. The following syntax to load raw JDBC table works for me: df_table1 = sqlContext.read.format('jdbc').options( url="jdbc:mysql://foo.com:3306", dbtable="mydb.table1", user="me", password="******", driver="com.mysql.jdbc.Driver" # mysql JDBC

GroupByKey and create lists of values pyspark sql dataframe

冷暖自知 提交于 2019-11-27 16:10:52
So I have a spark dataframe that looks like: a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7 And I want to group by column a , create a list of values from column b, and forget about c. The output dataframe would be : a | b_list 5 | (2,4) 2 | (4,3) How would I go about doing this with a pyspark sql dataframe? Thank you! :) Here are the steps to get that Dataframe. >>> from pyspark.sql import functions as F >>> >>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}] >>> df = spark.createDataFrame(d) >>> df.show() +---+---+---+ | a| b| c| +--

Cannot find col function in pyspark

不打扰是莪最后的温柔 提交于 2019-11-27 11:56:05
In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist? It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods. If you carefully check the source you'll find col listed among other _functions . This dictionary is further iterated and

Spark SQL security considerations

感情迁移 提交于 2019-11-27 08:30:28
问题 What are the security considerations when accepting and executing arbitrary spark SQL queries? Imagine the following setup: Two files on hdfs are registered as tables a_secrets and b_secrets : # must only be accessed by clients with access to all of customer a' data spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets") # must only be accessed by clients with access to all of customer b's data spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets") These two views

How to skip lines while reading a CSV file as a dataFrame using PySpark?

我只是一个虾纸丫 提交于 2019-11-27 07:47:25
问题 I have a CSV file that is structured this way: Header Blank Row "Col1","Col2" "1,200","1,456" "2,000","3,450" I have two problems in reading this file. I want to Ignore the Header and Ignore the blank row The commas within the value is not a separator Here is what I tried: df = sc.textFile("myFile.csv")\ .map(lambda line: line.split(","))\ #Split By comma .filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows However, This did not work, because the commas

Caching ordered Spark DataFrame creates unwanted job

旧城冷巷雨未停 提交于 2019-11-27 07:39:15
问题 I want to convert a RDD to a DataFrame and want to cache the results of the RDD: from pyspark.sql import * from pyspark.sql.types import * import pyspark.sql.functions as fn schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())]) df = spark.createDataFrame( sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(), schema=schema, verifySchema=False ).orderBy("t") #.cache() If you don't use a cache function no job is generated.

Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

左心房为你撑大大i 提交于 2019-11-27 07:33:54
问题 How can I set a schema for a streaming DataFrame in PySpark. from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split # Import data types from pyspark.sql.types import * spark = SparkSession\ .builder\ .appName("StructuredNetworkWordCount")\ .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:5560 lines = spark\ .readStream\ .format('socket')\ .option('host', '192.168.0.113')\

PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

谁说胖子不能爱 提交于 2019-11-27 06:46:45
问题 I have timestamp dataset which is in format of And I have written a udf in pyspark to process this dataset and return as Map of key values. But am getting below error message. Dataset:df_ts_list +--------------------+ | ts_list| +--------------------+ |[1477411200, 1477...| |[1477238400, 1477...| |[1477022400, 1477...| |[1477224000, 1477...| |[1477256400, 1477...| |[1477346400, 1476...| |[1476986400, 1477...| |[1477321200, 1477...| |[1477306800, 1477...| |[1477062000, 1477...| |[1477249200,

Apply a transformation to multiple columns pyspark dataframe

六眼飞鱼酱① 提交于 2019-11-27 06:19:29
问题 Suppose I have the following spark-dataframe: +-----+-------+ | word| label| +-----+-------+ | red| color| | red| color| | blue| color| | blue|feeling| |happy|feeling| +-----+-------+ Which can be created using the following code: sample_df = spark.createDataFrame([ ('red', 'color'), ('red', 'color'), ('blue', 'color'), ('blue', 'feeling'), ('happy', 'feeling') ], ('word', 'label') ) I can perform a groupBy() to get the counts of each word-label pair: sample_df = sample_df.groupBy('word',