pyspark

Pyspark: Extract date from Datetime value

只愿长相守 提交于 2020-05-28 15:18:25
问题 I am trying to figure out, how to extract a date from a datetime value using Pyspark sql. The datetime values look like this: DateTime 2018-05-21T00:00:00.000-04:00 2016-02-22T02:00:02.234-06:00 When I now load this into a spark dataframe and try to extract the date (via Date() or Timestamp() and then Date() I always get the error, that a date or timestamp value is expected, but a DateTime value was provided. Can someone help me with retrieving the date from this value? I think, you need to

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

一笑奈何 提交于 2020-05-28 13:46:55
问题 I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks. 回答1: writing DataFrame to HDFS (Spark 1.6). df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. some of the format options are csv , parquet , json etc. reading DataFrame from HDFS (Spark 1.6). from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

為{幸葍}努か 提交于 2020-05-28 13:46:48
问题 I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks. 回答1: writing DataFrame to HDFS (Spark 1.6). df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. some of the format options are csv , parquet , json etc. reading DataFrame from HDFS (Spark 1.6). from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

How to load local resource from a python package loaded in AWS PySpark

余生颓废 提交于 2020-05-28 11:59:10
问题 I have uploaded a python package into AWS EMR with PySpark. My python package has a structure like the following, where I have a resource file (a sklearn joblib model) within the package: myetllib ├── Dockerfile ├── __init__.py ├── modules │ ├── bin │ ├── joblib │ ├── joblib-0.14.1.dist-info │ ├── numpy │ ├── numpy-1.18.4.dist-info │ ├── numpy.libs │ ├── scikit_learn-0.21.3.dist-info │ ├── scipy │ ├── scipy-1.4.1.dist-info │ └── sklearn ├── requirements.txt └── mysubmodule ├── __init__.py ├──

pyspark RDD word calculate

心已入冬 提交于 2020-05-28 11:53:25
问题 I have a dataframe with text and category. I want to count the words which are common in these categories. I am using nltk to remove the stop words and tokenize however not able to include the category in the process. Below is my sample code of the problem. from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession,Row import nltk spark_conf = SparkConf()\ .setAppName("test") sc=SparkContext.getOrCreate(spark_conf) def wordTokenize(x): words = [word for line in x for

Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

|▌冷眼眸甩不掉的悲伤 提交于 2020-05-28 04:17:46
问题 I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp This is what I have tried so far df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) ) printSchema() shows end_time: string (nullable = true) when I expended timestamp as the type of variable 回答1: Try using from_utc_timestamp: from pyspark.sql.functions import from_utc_timestamp df = df.withColumn('end_time', from_utc_timestamp(df

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

一世执手 提交于 2020-05-27 09:18:09
问题 This question already has answers here : 'PipelinedRDD' object has no attribute 'toDF' in PySpark (2 answers) Closed 2 years ago . from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("myApp").setMaster("local") sc = SparkContext(conf=conf) a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"]) a.show() Results in: Traceback (most recent call last): File "/Users/ktemlyakov/messing_around/SparkStuff

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

半世苍凉 提交于 2020-05-27 09:17:46
问题 This question already has answers here : 'PipelinedRDD' object has no attribute 'toDF' in PySpark (2 answers) Closed 2 years ago . from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("myApp").setMaster("local") sc = SparkContext(conf=conf) a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"]) a.show() Results in: Traceback (most recent call last): File "/Users/ktemlyakov/messing_around/SparkStuff

pyspark: arrays_zip equivalent in Spark 2.3

亡梦爱人 提交于 2020-05-27 03:31:12
问题 How to write the equivalent function of arrays_zip in Spark 2.3? Source code from Spark 2.4 def arrays_zip(*cols): """ Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. :param cols: columns of arrays to be merged. >>> from pyspark.sql.functions import arrays_zip >>> df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2']) >>> df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect() [Row(zipped

Filter pyspark dataframe if contains a list of strings

自作多情 提交于 2020-05-26 10:10:06
问题 Suppose that we have a pyspark dataframe that one of its columns ( column_a ) contains some string values, and also there is a list of strings ( list_a ). Dataframe: column_a | count some_string | 10 another_one | 20 third_string | 30 list_a: ['string', 'third', ...] I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items. This is the code that works to filter the column_a based on a single string: df['column_a'].like('%string_value%') But how