pyspark | 易学教程

Pyspark: Extract date from Datetime value

阅读更多关于 Pyspark: Extract date from Datetime value

问题 I am trying to figure out, how to extract a date from a datetime value using Pyspark sql. The datetime values look like this: DateTime 2018-05-21T00:00:00.000-04:00 2016-02-22T02:00:02.234-06:00 When I now load this into a spark dataframe and try to extract the date (via Date() or Timestamp() and then Date() I always get the error, that a date or timestamp value is expected, but a DateTime value was provided. Can someone help me with retrieving the date from this value? I think, you need to

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

阅读更多关于 How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

问题 I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks. 回答1: writing DataFrame to HDFS (Spark 1.6). df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. some of the format options are csv , parquet , json etc. reading DataFrame from HDFS (Spark 1.6). from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

阅读更多关于 How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

How to load local resource from a python package loaded in AWS PySpark

阅读更多关于 How to load local resource from a python package loaded in AWS PySpark

问题 I have uploaded a python package into AWS EMR with PySpark. My python package has a structure like the following, where I have a resource file (a sklearn joblib model) within the package: myetllib ├── Dockerfile ├── __init__.py ├── modules │ ├── bin │ ├── joblib │ ├── joblib-0.14.1.dist-info │ ├── numpy │ ├── numpy-1.18.4.dist-info │ ├── numpy.libs │ ├── scikit_learn-0.21.3.dist-info │ ├── scipy │ ├── scipy-1.4.1.dist-info │ └── sklearn ├── requirements.txt └── mysubmodule ├── __init__.py ├──

pyspark RDD word calculate

阅读更多关于 pyspark RDD word calculate

问题 I have a dataframe with text and category. I want to count the words which are common in these categories. I am using nltk to remove the stop words and tokenize however not able to include the category in the process. Below is my sample code of the problem. from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession,Row import nltk spark_conf = SparkConf()\ .setAppName("test") sc=SparkContext.getOrCreate(spark_conf) def wordTokenize(x): words = [word for line in x for

Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

阅读更多关于 Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

问题 I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp This is what I have tried so far df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) ) printSchema() shows end_time: string (nullable = true) when I expended timestamp as the type of variable 回答1: Try using from_utc_timestamp: from pyspark.sql.functions import from_utc_timestamp df = df.withColumn('end_time', from_utc_timestamp(df

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

阅读更多关于 Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

问题 This question already has answers here : 'PipelinedRDD' object has no attribute 'toDF' in PySpark (2 answers) Closed 2 years ago . from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("myApp").setMaster("local") sc = SparkContext(conf=conf) a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"]) a.show() Results in: Traceback (most recent call last): File "/Users/ktemlyakov/messing_around/SparkStuff

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

阅读更多关于 Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

pyspark: arrays_zip equivalent in Spark 2.3

阅读更多关于 pyspark: arrays_zip equivalent in Spark 2.3

问题 How to write the equivalent function of arrays_zip in Spark 2.3? Source code from Spark 2.4 def arrays_zip(*cols): """ Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. :param cols: columns of arrays to be merged. >>> from pyspark.sql.functions import arrays_zip >>> df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2']) >>> df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect() [Row(zipped

Filter pyspark dataframe if contains a list of strings

阅读更多关于 Filter pyspark dataframe if contains a list of strings

问题 Suppose that we have a pyspark dataframe that one of its columns ( column_a ) contains some string values, and also there is a list of strings ( list_a ). Dataframe: column_a | count some_string | 10 another_one | 20 third_string | 30 list_a: ['string', 'third', ...] I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items. This is the code that works to filter the column_a based on a single string: df['column_a'].like('%string_value%') But how