spark-dataframe

How to filter column on values in list in pyspark?

倾然丶 夕夏残阳落幕 提交于 2019-12-12 10:43:09
问题 I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code: df = dfRawData.filter(col("X").between("CB","CI","CR")) But I am getting the following error: between() takes exactly 3 arguments (4 given) Please let me know how I can resolve this issue. 回答1: between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To

Fetching esJsonRDD from elasticsearch with complex filtering in Spark

大城市里の小女人 提交于 2019-12-12 10:23:19
问题 I am currently fetching the elasticsearch RDD in our Spark Job filtering based on one-line elastic query as such (example): val elasticRdds = sparkContext.esJsonRDD(esIndex, s"?default_operator=AND&q=director.name:DAVID + \n movie.name:SEVEN") Now if our search query becomes complex like: { "query": { "filtered": { "query": { "query_string": { "default_operator": "AND", "query": "director.name:DAVID + \n movie.name:SEVEN" } }, "filter": { "nested": { "path": "movieStatus.boxoffice.status",

Add leading zeros to Columns in a Spark Data Frame [duplicate]

梦想的初衷 提交于 2019-12-12 09:59:53
问题 This question already has an answer here : Prepend zeros to a value in PySpark (1 answer) Closed last year . In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in. val df = spark.read .format("com.databricks.spark.xml")

Should I Avoid groupby() in Dataset/Dataframe? [duplicate]

强颜欢笑 提交于 2019-12-12 09:59:21
问题 This question already has an answer here : DataFrame / Dataset groupBy behaviour/optimization (1 answer) Closed last year . I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled. Now, my question is if this still applies to Dataset/Dataframe? I was thinking that

Is it better for Spark to select from hive or select from file

无人久伴 提交于 2019-12-12 08:42:03
问题 I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why? Mike 回答1: tl;dr : I would read it straight from the parquet files I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

Filter rows by distinct values in one column in PySpark

折月煮酒 提交于 2019-12-12 08:23:34
问题 Let's say I have the following table: +--------------------+--------------------+------+------------+--------------------+ | host| path|status|content_size| time| +--------------------+--------------------+------+------------+--------------------+ |js002.cc.utsunomi...|/shuttle/resource...| 404| 0|1995-08-01 00:07:...| | tia1.eskimo.com |/pub/winvn/releas...| 404| 0|1995-08-01 00:28:...| |grimnet23.idirect...|/www/software/win...| 404| 0|1995-08-01 00:50:...| |miriworld.its.uni...|/history

Pyspark read delta/upsert dataset from csv files

ぃ、小莉子 提交于 2019-12-12 08:17:08
问题 I have a dataset that is updated periodically, that I receive as a series of CSV files giving the changes. I'd like a Dataframe that contains only the latest version of each row. Is there a way to load the whole dataset in Spark/pyspark that allows for parallelism? Example: File 1 (Key, Value) 1,ABC 2,DEF 3,GHI File 2 (Key, Value) 2,XYZ 4,UVW File 3 (Key, Value) 3,JKL 4,MNO Should result in: 1,ABC 2,XYZ 3,JKL 4,MNO I know I could do this by loading each file sequentially and then using an

join in a dataframe spark java

。_饼干妹妹 提交于 2019-12-12 08:06:17
问题 First of all, thank you for the time in reading my question. My question is the following: In Spark with Java, i load in two dataframe the data of two csv files. These dataframes will have the following information. Dataframe Airport Id | Name | City ----------------------- 1 | Barajas | Madrid Dataframe airport_city_state City | state ---------------- Madrid | España I want to join these two dataframes so that it looks like this: dataframe result Id | Name | City | state --------------------

Get list of data types from schema in Apache Spark

倖福魔咒の 提交于 2019-12-12 07:44:04
问题 I have the following code in Spark-Python to get the list of names from the schema of a DataFrame, which works fine, but how can I get the list of the data types? columnNames = df.schema.names For example, something like: columnTypes = df.schema.types Is there any way to get a separate list of the data types contained in a DataFrame schema? 回答1: Here's a suggestion: df = sqlContext.createDataFrame([('a', 1)]) types = [f.dataType for f in df.schema.fields] types > [StringType, LongType]

Pyspark read multiple csv files into a dataframe (OR RDD?)

二次信任 提交于 2019-12-12 07:19:41
问题 I've got a Spark 2.0.2 cluster that I'm hitting via Pyspark through Jupyter Notebook. I have multiple pipe delimited txt files (loaded into HDFS. but also available on a local directory) that I need to load using spark-csv into three separate dataframes, depending on the name of the file. I see three approaches I can take - either I can use python to somehow iterate through the HDFS directory (haven't figured out how to do this yet, load each file and then do a union. I also know that there