pyspark | 易学教程

Splitting a dictionary in a Pyspark dataframe into individual columns

阅读更多关于 Splitting a dictionary in a Pyspark dataframe into individual columns

问题 I have a dataframe (in Pyspark) that has one of the row values as a dictionary: df.show() And it looks like: +----+---+-----------------------------+ |name|age|info | +----+---+-----------------------------+ |rob |26 |{color: red, car: volkswagen}| |evan|25 |{color: blue, car: mazda} | +----+---+-----------------------------+ Based on the comments to give more: df.printSchema() The types are strings root |-- name: string (nullable = true) |-- age: string (nullable = true) |-- dict: string

Pyspark ML: how to get subModels values with CrossValidator()

阅读更多关于 Pyspark ML: how to get subModels values with CrossValidator()

问题 I would like to get the cross-validation 's (internal) training accuracy, using PySpark end ML library: lr = LogisticRegression() param_grid = (ParamGridBuilder() .addGrid(lr.regParam, [0.01, 0.5]) .addGrid(lr.maxIter, [5, 10]) .addGrid(lr.elasticNetParam, [0.01, 0.1]) .build()) evaluator = MulticlassClassificationEvaluator(predictionCol='prediction') cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5) model_cv = cv.fit(train) predictions_lr =

Pyspark dataframes: Extract a column based on the value of another column

阅读更多关于 Pyspark dataframes: Extract a column based on the value of another column

问题 I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format): Src_ip dst_ip V1 V2 V3 top "A" "B" xx yy zz "V1" Now I want to add a column, lets say top_value which takes the value of column corresponding to the string in V1. Src_ip dst_ip V1 V2 V3 top top_value "A" "B" xx yy zz "V1" xx So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value" I have tried by

How to create a HDInsightOnDemand LinkedService with a script action in Data Factory?

阅读更多关于 How to create a HDInsightOnDemand LinkedService with a script action in Data Factory?

问题 We are creating a DataFactory for running a pySpark job, that uses a HDInsight on demand cluster. The problem is that we need to use additional python dependencies for running this job, such as numpy, that are not installed. We believe that the way of doing so is configuring a Script Action for the HDInsightOnDemandLinkedService, but we cannot find this option on DataFactory or LikedServices. Is there an alternative for automating the HDInsightOnDemand installation of the dependencies? 回答1:

Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

阅读更多关于 Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

问题 I have an dataframe where I need to search a value present in one column i.e., StringType in another column i.e., ArrayType but I want to pick the values from the second column till last value in array from the first occurences of the first column. Explained below with examples : Input DF is below : Employee_Name|Employee_ID|Mapped_Project_ID Name1|E101|[E101, E102, E103] Name2|E102|[E101, E102, E103] Name3|E103|[E101, E102, E103, E104, E105] Output DF Should look like as below: Employee_Name

Training a Machine Learning model on selected parts of DataFrame in PySpark

阅读更多关于 Training a Machine Learning model on selected parts of DataFrame in PySpark

问题 I need to run a Random Forest on a dataset. The dataset is in a DataFrame organised as follows: training_set_all = sc.parallelize([ ['u1', 1, 0.9, 0.5, 0.0], ['u1', 0, 0.5, 0.1, 0.0], ['u2', 1, 0.3, 0.3, 0.8], ['u3', 1, 0.2, 0.2, 0.6], ['u2', 0, 0.0, 0.1, 0.4], ... ]).toDF(('status', 'user', 'product', 'f1', 'f2', 'f3')) Basically there is a user , the class - target variable - (1 or 0) and then three numerical float features. In practice every user has its own training set and it is all

How to get values from RDD dynamically with Python?

阅读更多关于 How to get values from RDD dynamically with Python?

问题 Below is sample record for a book in our system on campus. Each book record is a text file. I have loaded up records with: books = sc.wholeTextFiles (file:///data/dir/*/*/*/”) This would give me a RDD. One record in the RDD looks like this: [[‘Call No: 56CB', 'Title: Global Warming', 'Type: Serial, 'Database: AWS898,', ‘Microfilm: Y,', 'Access: Public ,', ]] I am trying to extract the values in the 4 to N tuple positions of the RDD. 0 through 4 tuples are always there. But the RDD may be

Make and populate a PySpark dataframe with columns as period_range

阅读更多关于 Make and populate a PySpark dataframe with columns as period_range

问题 I have a PySpark dataframe like this +----------+--------+----------+----------+ |id_ | p |d1 | d2 | +----------+--------+----------+----------+ | 1 | A |2018-09-26|2018-10-26| | 2 | B |2018-06-21|2018-07-19| | 2 | B |2018-08-13|2018-10-07| | 2 | B |2018-12-31|2019-02-27| | 2 | B |2019-05-28|2019-06-25| | 3 |C |2018-06-15|2018-07-13| | 3 |C |2018-08-15|2018-10-09| | 3 |C |2018-12-03|2019-03-12| | 3 |C |2019-05-10|2019-06-07| | 4 | A |2019-01-30|2019-03-01| | 4 | A |2019-05-30|2019-07-25| | 5

Split one column based the value of another column in pyspark [duplicate]

阅读更多关于 Split one column based the value of another column in pyspark [duplicate]

问题 This question already has an answer here : Using a column value as a parameter to a spark DataFrame function (1 answer) Closed 8 months ago . I have the following data frame +----+-------+ |item| path| +----+-------+ | a| a/b/c| | b| e/b/f| | d|e/b/d/h| | c| g/h/c| +----+-------+ I want to find relative path of an of the column "item" by locating its value in column 'path' and extracting the path's LHS as shown below +----+-------+--------+ |item| path|rel_path| +----+-------+--------+ | a| a

Passing class functions to PySpark RDD

阅读更多关于 Passing class functions to PySpark RDD

问题 I have a class named some_class() in a Python file here: /some-folder/app/bin/file.py I am importing it to my code here: /some-folder2/app/code/file2.py By import sys sys.path.append('/some-folder/app/bin') from file import some_class clss = some_class() I want to use this class's function named some_function in map of spark sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x)) This is giving me an error : No module named file While class.some_function when I am calling it