apache-spark-sql | 易学教程

How to read multiple partitioned .gzip files into a Spark Dataframe?

阅读更多关于 How to read multiple partitioned .gzip files into a Spark Dataframe?

问题 I have the following folder of partitioned data- my_folder |--part-0000.gzip |--part-0001.gzip |--part-0002.gzip |--part-0003.gzip I try to read this data into a dataframe using- >>> my_df = spark.read.csv("/path/to/my_folder/*") >>> my_df.show(5) +--------------------+ | _c0| +--------------------+ |��[I��...| |��RUu�[*Ք��g��T...| |�t�� qd��8~��...| |�(��b4�:��I�...| |��!y�)�PC��ќ\�...| +--------------------+ only showing top 5 rows Also tried using this to check the data- >>> rdd =

Spark SQL - rlike ignore case

阅读更多关于 Spark SQL - rlike ignore case

问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

Spark SQL - rlike ignore case

阅读更多关于 Spark SQL - rlike ignore case

Spark SQL - rlike ignore case

阅读更多关于 Spark SQL - rlike ignore case

pyspark - Convert sparse vector obtained after one hot encoding into columns

阅读更多关于 pyspark - Convert sparse vector obtained after one hot encoding into columns

问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

Java Spark : Stack Overflow Error on GroupBy

阅读更多关于 Java Spark : Stack Overflow Error on GroupBy

问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters

Java Spark : Stack Overflow Error on GroupBy

阅读更多关于 Java Spark : Stack Overflow Error on GroupBy

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

阅读更多关于 How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

阅读更多关于 How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

Escape quotes is not working in spark 2.2.0 while reading csv

阅读更多关于 Escape quotes is not working in spark 2.2.0 while reading csv

问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop