pyspark

How to define schema for Pyspark createDataFrame(rdd, schema)?

好久不见. 提交于 2020-07-22 07:19:06
问题 I looked at spark-rdd to dataframe. I read my gziped json into rdd rdd1 =sc.textFile('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz') I want to convert it to spark dataframe. The first method from the linked SO question does not work. This is the first row form the file {"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000", "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event

SparkException: Chi-square test expect factors

房东的猫 提交于 2020-07-21 07:04:30
问题 I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square

Can I tell spark.read.json that my files are gzipped?

走远了吗. 提交于 2020-07-20 17:27:07
问题 I have an s3 bucket with nearly 100k gzipped JSON files. These files are called [timestamp].json instead of the more sensible [timestamp].json.gz . I have other processes that use them so renaming is not an option and copying them is even less ideal. I am using spark.read.json([pattern]) to read these files. If I rename the filename to contain the .gz this works fine, but whilst the extension is just .json they cannot be read. Is there any way I can tell spark that these files are gzipped?

Can I tell spark.read.json that my files are gzipped?

那年仲夏 提交于 2020-07-20 17:26:09
问题 I have an s3 bucket with nearly 100k gzipped JSON files. These files are called [timestamp].json instead of the more sensible [timestamp].json.gz . I have other processes that use them so renaming is not an option and copying them is even less ideal. I am using spark.read.json([pattern]) to read these files. If I rename the filename to contain the .gz this works fine, but whilst the extension is just .json they cannot be read. Is there any way I can tell spark that these files are gzipped?

Can I tell spark.read.json that my files are gzipped?

杀马特。学长 韩版系。学妹 提交于 2020-07-20 17:25:58
问题 I have an s3 bucket with nearly 100k gzipped JSON files. These files are called [timestamp].json instead of the more sensible [timestamp].json.gz . I have other processes that use them so renaming is not an option and copying them is even less ideal. I am using spark.read.json([pattern]) to read these files. If I rename the filename to contain the .gz this works fine, but whilst the extension is just .json they cannot be read. Is there any way I can tell spark that these files are gzipped?

Spark combine multiple rows to Single row base on specific Column with out groupBy operation

a 夏天 提交于 2020-07-20 04:31:06
问题 I have a spark data frame like below with 7k columns. +---+----+----+----+----+----+----+ | id| 1| 2| 3|sf_1|sf_2|sf_3| +---+----+----+----+----+----+----+ | 2|null|null|null| 102| 202| 302| | 4|null|null|null| 104| 204| 304| | 1|null|null|null| 101| 201| 301| | 3|null|null|null| 103| 203| 303| | 1| 11| 21| 31|null|null|null| | 2| 12| 22| 32|null|null|null| | 4| 14| 24| 34|null|null|null| | 3| 13| 23| 33|null|null|null| +---+----+----+----+----+----+----+ I wanted to transform data frame like

UPSERT in parquet Pyspark

假如想象 提交于 2020-07-19 01:59:52
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

UPSERT in parquet Pyspark

不羁的心 提交于 2020-07-19 01:58:45
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

Create Spark DataFrame from Pandas DataFrame

笑着哭i 提交于 2020-07-18 21:09:09
问题 I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow. import pandas as pd pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]}) spark_df = sqlContext.createDataFrame(pandas_df) spark_df.printSchema() Till' this point everything is OK. The output is: root |-- Letters: string (nullable = true) The problem comes when I try to print the DataFrame: spark_df.show() This is the result: An error occurred while calling o158.collectToPython. : org.apache

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

前提是你 提交于 2020-07-17 11:14:27
问题 I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code: import findspark findspark.init() I receive a Value error: ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation). However the SPARK_HOME variable is set. Here is a screenshot that