apache-spark-ml | 易学教程

Tagging columns as Categorical in Spark

阅读更多关于 Tagging columns as Categorical in Spark

问题 I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical.

Spark RandomForest training StackOverflow error

阅读更多关于 Spark RandomForest training StackOverflow error

问题 I am running a training of my model and I am getting the StackOverflow error whenever I increase the maxDepth over 12. Everything works correctly for 5,10,11. I am using spark 2.0.2 (and i cannot upgrade it for next couple of weeks). I have > 3M data, 200 features, 2500 trees and I would like to improve the accuracy by increasing the max depth. Is there a way to overcome this problem? Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in

Spark ML StringIndexer Different Labels Training/Testing

阅读更多关于 Spark ML StringIndexer Different Labels Training/Testing

问题 I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category. The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly. I am processing the training/testing data in the exact same way, and don't save the model. I have tried manually

How to print the decision path / rules used to predict sample of a specific row in PySpark?

阅读更多关于 How to print the decision path / rules used to predict sample of a specific row in PySpark?

问题 How to print the decision path of a specific sample in a Spark DataFrame? Spark Version: '2.3.1' The below code prints the decision path of the whole model, how to make it print a decision path of a specific sample? For example, the decision path of the row where tagvalue ball equals 2 import pyspark.sql.functions as F from pyspark.ml import Pipeline, Transformer from pyspark.sql import DataFrame from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import

Create labeledPoints from Spark DataFrame in Python

阅读更多关于 Create labeledPoints from Spark DataFrame in Python

问题 What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this .map() function: def parsePoint(line): listmp = list(line.split('\t')) dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose() dataframe.insert(0, 'status', dataframe['accepted']) if 'NULL' in dataframe.columns: dataframe = dataframe

Issues with Logistic Regression for multiclass classification using PySpark

阅读更多关于 Issues with Logistic Regression for multiclass classification using PySpark

问题 I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: For full code base and error log, please check my github repo Case 1 : I tried using the pipeline of ML as follow: # imported library from ML from pyspark.ml.feature import HashingTF from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression print(type(trainingData)) # for checking only print(trainingData.take(2)) # for of data type lr = LogisticRegression

How to train SparkML gradient boosting classifer given a RDD

阅读更多关于 How to train SparkML gradient boosting classifer given a RDD

问题 Given the following rdd training_rdd = rdd.select( # Categorical features col('device_os'), # 'ios', 'android' # Numeric features col('30day_click_count'), col('30day_impression_count'), np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'), # label col('did_click').alias('label') ) I am confused about the syntax to train a gradient boosting classifer. I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification

Spark cluster does not scale to small data

阅读更多关于 Spark cluster does not scale to small data

问题 i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

阅读更多关于 Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

问题 Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them. How do I handle unseen labels on new data for prediction since

LDA model prediction nonconsistance

阅读更多关于 LDA model prediction nonconsistance

问题 I trained a LDA model and load it into the environment to transform the new data: from pyspark.ml.clustering import LocalLDAModel lda = LocalLDAModel.load(path) df = lda.transform(text) The model will add a new column called topicDistribution . In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice. May I ask the reason why and how to fix it? 回答1: LDA uses randomness when training and, depending on the