pyspark | 易学教程

How to define schema for Pyspark createDataFrame(rdd, schema)?

阅读更多关于 How to define schema for Pyspark createDataFrame(rdd, schema)?

问题 I looked at spark-rdd to dataframe. I read my gziped json into rdd rdd1 =sc.textFile('s3://cw-milenko-tests/Json_gzips/ticr_calculated_2_2020-05-27T11-59-06.json.gz') I want to convert it to spark dataframe. The first method from the linked SO question does not work. This is the first row form the file {"code_event": "1092406", "code_event_system": "LOTTO", "company_id": "2", "date_event": "2020-05-27 12:00:00.000", "date_event_real": "0001-01-01 00:00:00.000", "ecode_class": "", "ecode_event

SparkException: Chi-square test expect factors

阅读更多关于 SparkException: Chi-square test expect factors

问题 I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square

Can I tell spark.read.json that my files are gzipped?

阅读更多关于 Can I tell spark.read.json that my files are gzipped?

问题 I have an s3 bucket with nearly 100k gzipped JSON files. These files are called [timestamp].json instead of the more sensible [timestamp].json.gz . I have other processes that use them so renaming is not an option and copying them is even less ideal. I am using spark.read.json([pattern]) to read these files. If I rename the filename to contain the .gz this works fine, but whilst the extension is just .json they cannot be read. Is there any way I can tell spark that these files are gzipped?

Can I tell spark.read.json that my files are gzipped?

阅读更多关于 Can I tell spark.read.json that my files are gzipped?

Can I tell spark.read.json that my files are gzipped?

阅读更多关于 Can I tell spark.read.json that my files are gzipped?

Spark combine multiple rows to Single row base on specific Column with out groupBy operation

阅读更多关于 Spark combine multiple rows to Single row base on specific Column with out groupBy operation

问题 I have a spark data frame like below with 7k columns. +---+----+----+----+----+----+----+ | id| 1| 2| 3|sf_1|sf_2|sf_3| +---+----+----+----+----+----+----+ | 2|null|null|null| 102| 202| 302| | 4|null|null|null| 104| 204| 304| | 1|null|null|null| 101| 201| 301| | 3|null|null|null| 103| 203| 303| | 1| 11| 21| 31|null|null|null| | 2| 12| 22| 32|null|null|null| | 4| 14| 24| 34|null|null|null| | 3| 13| 23| 33|null|null|null| +---+----+----+----+----+----+----+ I wanted to transform data frame like

UPSERT in parquet Pyspark

阅读更多关于 UPSERT in parquet Pyspark

问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

UPSERT in parquet Pyspark

阅读更多关于 UPSERT in parquet Pyspark

Create Spark DataFrame from Pandas DataFrame

阅读更多关于 Create Spark DataFrame from Pandas DataFrame

问题 I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow. import pandas as pd pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]}) spark_df = sqlContext.createDataFrame(pandas_df) spark_df.printSchema() Till' this point everything is OK. The output is: root |-- Letters: string (nullable = true) The problem comes when I try to print the DataFrame: spark_df.show() This is the result: An error occurred while calling o158.collectToPython. : org.apache

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

阅读更多关于 The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

问题 I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code: import findspark findspark.init() I receive a Value error: ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation). However the SPARK_HOME variable is set. Here is a screenshot that