问题
I'm learning Scikit-Learn to do some classifying for tweets. I have a csv with tweets on one column, and their class from 0-11 in next column. I went through this tutorial from Scikit-Learn site I think I understand how the actual classifying is done but I don't think I really understood the data format. In tutorial the material was in files in folders where folder names acted as a classification tag.
In my case I should load that data from csv file and apparently I need to construct the datastructure which is feed to vectorizer and classifier manually. How I should approach this? I think the tutorial was a bit ambiguous in this respect since the data loading was done automagically and left me in dark concerning the structure and loading of custom data.
回答1:
Normally you would use pandas.read_csv or if you don't want a pandas dependency numpy.load or even load the cvs to a list using the standard library. It would look like this:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv('example.csv', header=None, sep=',',
names=['tweets', 'class']) # columns names if no header
vect = TfidfVectorizer()
X = vect.fit_transform(df['tweets'])
y = df['class']
Once you have your X
and y
you can feed them to a classifier.
来源:https://stackoverflow.com/questions/27675395/load-classified-data-from-csv-to-scikit-learn-for-machine-learning