spark-dataframe | 易学教程

How to filter column on values in list in pyspark?

阅读更多关于 How to filter column on values in list in pyspark?

问题 I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code: df = dfRawData.filter(col("X").between("CB","CI","CR")) But I am getting the following error: between() takes exactly 3 arguments (4 given) Please let me know how I can resolve this issue. 回答1: between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To

Fetching esJsonRDD from elasticsearch with complex filtering in Spark

阅读更多关于 Fetching esJsonRDD from elasticsearch with complex filtering in Spark

问题 I am currently fetching the elasticsearch RDD in our Spark Job filtering based on one-line elastic query as such (example): val elasticRdds = sparkContext.esJsonRDD(esIndex, s"?default_operator=AND&q=director.name:DAVID + \n movie.name:SEVEN") Now if our search query becomes complex like: { "query": { "filtered": { "query": { "query_string": { "default_operator": "AND", "query": "director.name:DAVID + \n movie.name:SEVEN" } }, "filter": { "nested": { "path": "movieStatus.boxoffice.status",

Add leading zeros to Columns in a Spark Data Frame [duplicate]

阅读更多关于 Add leading zeros to Columns in a Spark Data Frame [duplicate]

问题 This question already has an answer here : Prepend zeros to a value in PySpark (1 answer) Closed last year . In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in. val df = spark.read .format("com.databricks.spark.xml")

Should I Avoid groupby() in Dataset/Dataframe? [duplicate]

阅读更多关于 Should I Avoid groupby() in Dataset/Dataframe? [duplicate]

问题 This question already has an answer here : DataFrame / Dataset groupBy behaviour/optimization (1 answer) Closed last year . I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled. Now, my question is if this still applies to Dataset/Dataframe? I was thinking that

Is it better for Spark to select from hive or select from file

阅读更多关于 Is it better for Spark to select from hive or select from file

问题 I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why? Mike 回答1: tl;dr : I would read it straight from the parquet files I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

Filter rows by distinct values in one column in PySpark

阅读更多关于 Filter rows by distinct values in one column in PySpark

问题 Let's say I have the following table: +--------------------+--------------------+------+------------+--------------------+ | host| path|status|content_size| time| +--------------------+--------------------+------+------------+--------------------+ |js002.cc.utsunomi...|/shuttle/resource...| 404| 0|1995-08-01 00:07:...| | tia1.eskimo.com |/pub/winvn/releas...| 404| 0|1995-08-01 00:28:...| |grimnet23.idirect...|/www/software/win...| 404| 0|1995-08-01 00:50:...| |miriworld.its.uni...|/history

Pyspark read delta/upsert dataset from csv files

阅读更多关于 Pyspark read delta/upsert dataset from csv files

问题 I have a dataset that is updated periodically, that I receive as a series of CSV files giving the changes. I'd like a Dataframe that contains only the latest version of each row. Is there a way to load the whole dataset in Spark/pyspark that allows for parallelism? Example: File 1 (Key, Value) 1,ABC 2,DEF 3,GHI File 2 (Key, Value) 2,XYZ 4,UVW File 3 (Key, Value) 3,JKL 4,MNO Should result in: 1,ABC 2,XYZ 3,JKL 4,MNO I know I could do this by loading each file sequentially and then using an

join in a dataframe spark java

阅读更多关于 join in a dataframe spark java

问题 First of all, thank you for the time in reading my question. My question is the following: In Spark with Java, i load in two dataframe the data of two csv files. These dataframes will have the following information. Dataframe Airport Id | Name | City ----------------------- 1 | Barajas | Madrid Dataframe airport_city_state City | state ---------------- Madrid | España I want to join these two dataframes so that it looks like this: dataframe result Id | Name | City | state --------------------

Get list of data types from schema in Apache Spark

阅读更多关于 Get list of data types from schema in Apache Spark

问题 I have the following code in Spark-Python to get the list of names from the schema of a DataFrame, which works fine, but how can I get the list of the data types? columnNames = df.schema.names For example, something like: columnTypes = df.schema.types Is there any way to get a separate list of the data types contained in a DataFrame schema? 回答1: Here's a suggestion: df = sqlContext.createDataFrame([('a', 1)]) types = [f.dataType for f in df.schema.fields] types > [StringType, LongType]

Pyspark read multiple csv files into a dataframe (OR RDD?)

阅读更多关于 Pyspark read multiple csv files into a dataframe (OR RDD?)

问题 I've got a Spark 2.0.2 cluster that I'm hitting via Pyspark through Jupyter Notebook. I have multiple pipe delimited txt files (loaded into HDFS. but also available on a local directory) that I need to load using spark-csv into three separate dataframes, depending on the name of the file. I see three approaches I can take - either I can use python to somehow iterate through the HDFS directory (haven't figured out how to do this yet, load each file and then do a union. I also know that there