random-forest

Random Forest Black Box with CleverHans

牧云@^-^@ 提交于 2019-12-06 15:33:19
I am new to this stuff and trying to attack Random Forest with Black Box FGSM (from clever hans) But I'm not sure how to implement it. They've a blackbox example for Mnist data but I dont understand where should I put my random forest and where should I attack. Any help would be appreciated. In the current tutorial, the black-box model is a neural network implemented with TensorFlow and its predictions (the labels) are used to train a substitute model (a copy of the black-box model). The substitute model is then used to craft adversarial examples that transfer to the black-box model. In your

Bizarre Behavior of randomForest Package When Dropping One Prediction Class

a 夏天 提交于 2019-12-06 11:41:59
问题 I am running a random forest model that produces results that make absolutely no sense to me from a statistical perspective, and thus I'm convinced that something must be going wrong code-wise with the randomForest package. The predicted / left hand side variable is, in at least this iteration of the model, a party ID with 3 possible outcomes: Democrat, Independent, Republican. I run the model, get results, fine. I'm at this point not super concerned with the results per se, but rather what

Tagging columns as Categorical in Spark

て烟熏妆下的殇ゞ 提交于 2019-12-06 09:59:30
I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present. Also, StringIndexer maps categories to

Handling categorical features using scikit-learn

£可爱£侵袭症+ 提交于 2019-12-06 06:01:25
What am I doing? I am solving a classification problem using Random Forests. I have a set of strings of a fixed length (10 characters long) that represent DNA sequences. DNA alphabet consists of 4 letters, namely A , C , G , T . Here's a sample of my raw data: ATGCTACTGA ACGTACTGAT AGCTATTGTA CGTGACTAGT TGACTATGAT Each DNA sequence comes with experimental data describing a real biological response; the molecule was seen to elicit biological response (1), or not (0). Problem: The training set consists of both, categorical (nominal) and numerical features. It is of the following structure:

Spark RandomForest training StackOverflow error

别说谁变了你拦得住时间么 提交于 2019-12-06 05:59:25
I am running a training of my model and I am getting the StackOverflow error whenever I increase the maxDepth over 12. Everything works correctly for 5,10,11. I am using spark 2.0.2 (and i cannot upgrade it for next couple of weeks). I have > 3M data, 200 features, 2500 trees and I would like to improve the accuracy by increasing the max depth. Is there a way to overcome this problem? Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 789.0 failed 4 times, most recent failure: Lost task 92.3 in stage 789.0 (TID 66903, 10.0.0.11): java

How to handle categorical features for Decision Tree, Random Forest in spark ml?

倖福魔咒の 提交于 2019-12-06 05:35:51
问题 I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set. In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features

Extract and Visualize Model Trees from Sparklyr

泪湿孤枕 提交于 2019-12-06 03:27:34
问题 Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The

Neural Network - Working with a imbalanced dataset

孤人 提交于 2019-12-05 22:26:00
I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem). The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'. Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of

something similar to permutation accuracy importance in h2o package

假如想象 提交于 2019-12-05 20:26:02
I fitted a random forest for my multinomial target with the randomForest package in R. Looking for the variable importance I found out permutation accuracy importance which is what I was looking for my analysis. I fitted a random forest with the h2o package too, but the only measures it shows me are relative_importance, scaled_importance, percentage . My question is: can I extract a measure that shows me the level of the target which better classify the variable i want to take in exam? Permutation accuracy importance is the best measure I can use in this case? For example: I have a 3 levels

Python vectorization for classification [duplicate]

我怕爱的太早我们不能终老 提交于 2019-12-05 19:37:18
This question already has an answer here: Scikit learn - fit_transform on the test set 1 answer I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the