scikits | 易学教程

Forecasting using Pandas OLS

阅读更多关于 Forecasting using Pandas OLS

问题 I have been using the scikits.statsmodels OLS predict function to forecast fitted data but would now like to shift to using Pandas. The documentation refers to OLS as well as to a function called y_predict but I can't find any documentation on how to use it correctly. By way of example: exogenous = { "1998": "4760","1999": "5904","2000": "4504","2001": "9808","2002": "4241","2003": "4086","2004": "4687","2005": "7686","2006": "3740","2007": "3075","2008": "3753","2009": "4679","2010": "5468",

Scikit - 3D feature array for SVM

阅读更多关于 Scikit - 3D feature array for SVM

I am trying to train an SVM in scikit. I am following the example and tried to adjust it to my 3d feature vectors. I tried the example from the page http://scikit-learn.org/stable/modules/svm.html and it ran through. While bugfixing I came back to the tutorial setup and found this: X = [[0, 0], [1, 1],[2,2]] y = [0, 1,1] clf = svm.SVC() clf.fit(X, y) works while X = [[0, 0,0], [1, 1,1],[2,2,2]] y = [0, 1,1] clf = svm.SVC() clf.fit(X, y) fails with: ValueError: X.shape[1] = 2 should be equal to 3, the number of features at training time what is wrong here? It's only one additional dimension...

Cannot get scikit-learn installed on OS X

阅读更多关于 Cannot get scikit-learn installed on OS X

问题 I would like to use scikit-learn on an upcoming project and I absolutely cannot install it. I can install other packages either by building them from source or through pip without a problem. For scikit-learn, I've tried cloning the project on GitHub and installing via pip without success. Can anyone please help? Here is part of my pip.log : Downloading/unpacking scikit-learn Running setup.py egg_info for package scikit-learn Warning: Assuming default configuration (scikits/learn/{setup

Is there a way to convert nltk featuresets into a scipy.sparse array?

阅读更多关于 Is there a way to convert nltk featuresets into a scipy.sparse array?

问题 I'm trying to use scikit.learn which needs numpy/scipy arrays for input. The featureset generated in nltk consists of unigram and bigram frequencies. I could do it manually, but that'll be a lot of effort. So wondering if there's a solution i've overlooked. 回答1: Not that I know of, but note that scikit-learn can do n -gram frequency counting itself. Assuming word-level n -grams: from sklearn.feature_extraction.text import CountVectorizer, WordNGramAnalyzer v = CountVectorizer(analyzer

Scikit - 3D feature array for SVM

阅读更多关于 Scikit - 3D feature array for SVM

问题 I am trying to train an SVM in scikit. I am following the example and tried to adjust it to my 3d feature vectors. I tried the example from the page http://scikit-learn.org/stable/modules/svm.html and it ran through. While bugfixing I came back to the tutorial setup and found this: X = [[0, 0], [1, 1],[2,2]] y = [0, 1,1] clf = svm.SVC() clf.fit(X, y) works while X = [[0, 0,0], [1, 1,1],[2,2,2]] y = [0, 1,1] clf = svm.SVC() clf.fit(X, y) fails with: ValueError: X.shape[1] = 2 should be equal

scikit-learn roc_auc_score() returns accuracy values

阅读更多关于 scikit-learn roc_auc_score() returns accuracy values

问题 I am trying to compute area under the ROC curve using sklearn.metrics.roc_auc_score using the following method: roc_auc = sklearn.metrics.roc_auc_score(actual, predicted) where actual is a binary vector with ground truth classification labels and predicted is a binary vector with classification labels that my classifier has predicted. However, the value of roc_auc that I am getting is EXACTLY similar to accuracy values (proportion of samples whose labels are correctly predicted). This is not

how to Load CSV Data in scikit and using it for Naive Bayes Classification

阅读更多关于 how to Load CSV Data in scikit and using it for Naive Bayes Classification

问题 Trying to load custom data to perform NB Classification in Scikit. Need help in loading the sample data into Scikit and then perform NB. How to load categorical values for target. Use the same data for Train and Test or use a complete set just for test. Sl No,Member ID,Member Name,Location,DOB,Gender,Marital Status,Children,Ethnicity,Insurance Plan ID,Annual Income ($),Twitter User ID 1,70000001,Fly Dorami,New York,39786,M,Single,,Asian,2002,0,548900028 2,70000002,Bennie Ariana,Pennsylvania,6

scikit-learn roc_auc_score() returns accuracy values

阅读更多关于 scikit-learn roc_auc_score() returns accuracy values

I am trying to compute area under the ROC curve using sklearn.metrics.roc_auc_score using the following method: roc_auc = sklearn.metrics.roc_auc_score(actual, predicted) where actual is a binary vector with ground truth classification labels and predicted is a binary vector with classification labels that my classifier has predicted. However, the value of roc_auc that I am getting is EXACTLY similar to accuracy values (proportion of samples whose labels are correctly predicted). This is not a one-off thing. I try my classifier on various values of the parameters and every time I get the same

text classification with SciKit-learn and a large dataset

阅读更多关于 text classification with SciKit-learn and a large dataset

First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not able to work with these large datasets \? Am I doing something wrong (as it is my second day with python)?

Numpy: How to randomly split/select an matrix into n-different matrices

阅读更多关于 Numpy: How to randomly split/select an matrix into n-different matrices

问题 I have a numpy matrix with shape of (4601, 58). I want to split the matrix randomly as per 60%, 20%, 20% split based on number of rows This is for Machine Learning task I need Is there a numpy function that randomly selects rows? 回答1: you can use numpy.random.shuffle import numpy as np N = 4601 data = np.arange(N*58).reshape(-1, 58) np.random.shuffle(data) a = data[:int(N*0.6)] b = data[int(N*0.6):int(N*0.8)] c = data[int(N*0.8):] 回答2: A complement to HYRY's answer if you want to shuffle