apache-spark-sql

How to read multiple partitioned .gzip files into a Spark Dataframe?

扶醉桌前 提交于 2021-02-07 19:41:50
问题 I have the following folder of partitioned data- my_folder |--part-0000.gzip |--part-0001.gzip |--part-0002.gzip |--part-0003.gzip I try to read this data into a dataframe using- >>> my_df = spark.read.csv("/path/to/my_folder/*") >>> my_df.show(5) +--------------------+ | _c0| +--------------------+ |��[I���...| |��RUu�[*Ք��g��T...| |�t��� �qd��8~��...| |�(���b4�:������I�...| |���!y�)�PC��ќ\�...| +--------------------+ only showing top 5 rows Also tried using this to check the data- >>> rdd =

Spark SQL - rlike ignore case

前提是你 提交于 2021-02-07 19:18:07
问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

Spark SQL - rlike ignore case

孤街醉人 提交于 2021-02-07 19:17:57
问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

Spark SQL - rlike ignore case

◇◆丶佛笑我妖孽 提交于 2021-02-07 19:17:09
问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

pyspark - Convert sparse vector obtained after one hot encoding into columns

僤鯓⒐⒋嵵緔 提交于 2021-02-07 18:43:41
问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

Java Spark : Stack Overflow Error on GroupBy

坚强是说给别人听的谎言 提交于 2021-02-07 16:08:31
问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters

Java Spark : Stack Overflow Error on GroupBy

社会主义新天地 提交于 2021-02-07 16:07:08
问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

南笙酒味 提交于 2021-02-07 10:53:15
问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

流过昼夜 提交于 2021-02-07 10:52:14
问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

Escape quotes is not working in spark 2.2.0 while reading csv

断了今生、忘了曾经 提交于 2021-02-07 10:34:18
问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop