pyspark

Splitting a dictionary in a Pyspark dataframe into individual columns

不想你离开。 提交于 2020-01-03 03:24:28
问题 I have a dataframe (in Pyspark) that has one of the row values as a dictionary: df.show() And it looks like: +----+---+-----------------------------+ |name|age|info | +----+---+-----------------------------+ |rob |26 |{color: red, car: volkswagen}| |evan|25 |{color: blue, car: mazda} | +----+---+-----------------------------+ Based on the comments to give more: df.printSchema() The types are strings root |-- name: string (nullable = true) |-- age: string (nullable = true) |-- dict: string

Pyspark ML: how to get subModels values with CrossValidator()

我是研究僧i 提交于 2020-01-03 03:17:05
问题 I would like to get the cross-validation 's (internal) training accuracy, using PySpark end ML library: lr = LogisticRegression() param_grid = (ParamGridBuilder() .addGrid(lr.regParam, [0.01, 0.5]) .addGrid(lr.maxIter, [5, 10]) .addGrid(lr.elasticNetParam, [0.01, 0.1]) .build()) evaluator = MulticlassClassificationEvaluator(predictionCol='prediction') cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5) model_cv = cv.fit(train) predictions_lr =

Pyspark dataframes: Extract a column based on the value of another column

落花浮王杯 提交于 2020-01-03 03:16:10
问题 I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format): Src_ip dst_ip V1 V2 V3 top "A" "B" xx yy zz "V1" Now I want to add a column, lets say top_value which takes the value of column corresponding to the string in V1. Src_ip dst_ip V1 V2 V3 top top_value "A" "B" xx yy zz "V1" xx So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value" I have tried by

How to create a HDInsightOnDemand LinkedService with a script action in Data Factory?

纵饮孤独 提交于 2020-01-03 02:47:07
问题 We are creating a DataFactory for running a pySpark job, that uses a HDInsight on demand cluster. The problem is that we need to use additional python dependencies for running this job, such as numpy, that are not installed. We believe that the way of doing so is configuring a Script Action for the HDInsightOnDemandLinkedService, but we cannot find this option on DataFactory or LikedServices. Is there an alternative for automating the HDInsightOnDemand installation of the dependencies? 回答1:

Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

时光总嘲笑我的痴心妄想 提交于 2020-01-03 02:46:22
问题 I have an dataframe where I need to search a value present in one column i.e., StringType in another column i.e., ArrayType but I want to pick the values from the second column till last value in array from the first occurences of the first column. Explained below with examples : Input DF is below : Employee_Name|Employee_ID|Mapped_Project_ID Name1|E101|[E101, E102, E103] Name2|E102|[E101, E102, E103] Name3|E103|[E101, E102, E103, E104, E105] Output DF Should look like as below: Employee_Name

Training a Machine Learning model on selected parts of DataFrame in PySpark

老子叫甜甜 提交于 2020-01-03 02:41:07
问题 I need to run a Random Forest on a dataset. The dataset is in a DataFrame organised as follows: training_set_all = sc.parallelize([ ['u1', 1, 0.9, 0.5, 0.0], ['u1', 0, 0.5, 0.1, 0.0], ['u2', 1, 0.3, 0.3, 0.8], ['u3', 1, 0.2, 0.2, 0.6], ['u2', 0, 0.0, 0.1, 0.4], ... ]).toDF(('status', 'user', 'product', 'f1', 'f2', 'f3')) Basically there is a user , the class - target variable - (1 or 0) and then three numerical float features. In practice every user has its own training set and it is all

How to get values from RDD dynamically with Python?

风流意气都作罢 提交于 2020-01-03 02:21:29
问题 Below is sample record for a book in our system on campus. Each book record is a text file. I have loaded up records with: books = sc.wholeTextFiles (file:///data/dir/*/*/*/”) This would give me a RDD. One record in the RDD looks like this: [[‘Call No: 56CB', 'Title: Global Warming', 'Type: Serial, 'Database: AWS898,', ‘Microfilm: Y,', 'Access: Public ,', ]] I am trying to extract the values in the 4 to N tuple positions of the RDD. 0 through 4 tuples are always there. But the RDD may be

Make and populate a PySpark dataframe with columns as period_range

安稳与你 提交于 2020-01-03 01:41:06
问题 I have a PySpark dataframe like this +----------+--------+----------+----------+ |id_ | p |d1 | d2 | +----------+--------+----------+----------+ | 1 | A |2018-09-26|2018-10-26| | 2 | B |2018-06-21|2018-07-19| | 2 | B |2018-08-13|2018-10-07| | 2 | B |2018-12-31|2019-02-27| | 2 | B |2019-05-28|2019-06-25| | 3 |C |2018-06-15|2018-07-13| | 3 |C |2018-08-15|2018-10-09| | 3 |C |2018-12-03|2019-03-12| | 3 |C |2019-05-10|2019-06-07| | 4 | A |2019-01-30|2019-03-01| | 4 | A |2019-05-30|2019-07-25| | 5

Split one column based the value of another column in pyspark [duplicate]

懵懂的女人 提交于 2020-01-03 00:56:08
问题 This question already has an answer here : Using a column value as a parameter to a spark DataFrame function (1 answer) Closed 8 months ago . I have the following data frame +----+-------+ |item| path| +----+-------+ | a| a/b/c| | b| e/b/f| | d|e/b/d/h| | c| g/h/c| +----+-------+ I want to find relative path of an of the column "item" by locating its value in column 'path' and extracting the path's LHS as shown below +----+-------+--------+ |item| path|rel_path| +----+-------+--------+ | a| a

Passing class functions to PySpark RDD

独自空忆成欢 提交于 2020-01-02 22:01:37
问题 I have a class named some_class() in a Python file here: /some-folder/app/bin/file.py I am importing it to my code here: /some-folder2/app/code/file2.py By import sys sys.path.append('/some-folder/app/bin') from file import some_class clss = some_class() I want to use this class's function named some_function in map of spark sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x)) This is giving me an error : No module named file While class.some_function when I am calling it