pyspark

Get HDFS file path in PySpark for files in sequence file format

我的未来我决定 提交于 2021-01-24 07:09:23
问题 My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things: Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format. How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format. Appreciate any

How to apply multiple filters in a for loop for pyspark

三世轮回 提交于 2021-01-23 11:09:18
问题 I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter. I've broken down the code into some simple test cases and tried the non-looped version and they work. test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')] rdd = sc.parallelize(test_input, 1) # Index 0 needs to be longer than length 0 # Index 1 needs to be longer than length 1 for i in [0,1]: rdd =

How to apply multiple filters in a for loop for pyspark

半世苍凉 提交于 2021-01-23 11:08:05
问题 I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter. I've broken down the code into some simple test cases and tried the non-looped version and they work. test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')] rdd = sc.parallelize(test_input, 1) # Index 0 needs to be longer than length 0 # Index 1 needs to be longer than length 1 for i in [0,1]: rdd =

PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

旧街凉风 提交于 2021-01-23 06:01:19
问题 I was trying to connect to MongoDB Atlas from PySpark and I have the following problem: from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql.functions import * sc = SparkContext spark = SparkSession.builder \ .config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \ .config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db

PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

杀马特。学长 韩版系。学妹 提交于 2021-01-23 05:59:36
问题 I was trying to connect to MongoDB Atlas from PySpark and I have the following problem: from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql.functions import * sc = SparkContext spark = SparkSession.builder \ .config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \ .config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db

PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

久未见 提交于 2021-01-23 05:58:54
问题 I was trying to connect to MongoDB Atlas from PySpark and I have the following problem: from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql.functions import * sc = SparkContext spark = SparkSession.builder \ .config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \ .config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db

write spark dataframe as array of json (pyspark)

余生长醉 提交于 2021-01-22 06:43:41
问题 I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON. Let's me explain with a simple (reproducible) code. We have: import numpy as np import pandas as pd df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)})) Saving the dataframe as: df.write.json('s3://path/to/json') each file just created has one JSON object per line, something like: {"x":0.9953802385540144,"y":0.476027611419198} {"x":0

write spark dataframe as array of json (pyspark)

雨燕双飞 提交于 2021-01-22 06:43:06
问题 I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON. Let's me explain with a simple (reproducible) code. We have: import numpy as np import pandas as pd df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)})) Saving the dataframe as: df.write.json('s3://path/to/json') each file just created has one JSON object per line, something like: {"x":0.9953802385540144,"y":0.476027611419198} {"x":0

write spark dataframe as array of json (pyspark)

故事扮演 提交于 2021-01-22 06:41:49
问题 I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON. Let's me explain with a simple (reproducible) code. We have: import numpy as np import pandas as pd df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)})) Saving the dataframe as: df.write.json('s3://path/to/json') each file just created has one JSON object per line, something like: {"x":0.9953802385540144,"y":0.476027611419198} {"x":0

Validate date format in a dataframe column in pyspark

自闭症网瘾萝莉.ら 提交于 2021-01-21 12:12:28
问题 I have a dataframe with column as Date along with few other columns. I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage