Load classified data from CSV to Scikit-Learn for machine learning

问题

I'm learning Scikit-Learn to do some classifying for tweets. I have a csv with tweets on one column, and their class from 0-11 in next column. I went through this tutorial from Scikit-Learn site I think I understand how the actual classifying is done but I don't think I really understood the data format. In tutorial the material was in files in folders where folder names acted as a classification tag.

In my case I should load that data from csv file and apparently I need to construct the datastructure which is feed to vectorizer and classifier manually. How I should approach this? I think the tutorial was a bit ambiguous in this respect since the data loading was done automagically and left me in dark concerning the structure and loading of custom data.

回答1:

Normally you would use pandas.read_csv or if you don't want a pandas dependency numpy.load or even load the cvs to a list using the standard library. It would look like this:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('example.csv', header=None, sep=',', 
                 names=['tweets', 'class'])   # columns names if no header
vect = TfidfVectorizer()
X = vect.fit_transform(df['tweets']) 
y = df['class']

Once you have your X and y you can feed them to a classifier.

来源：https://stackoverflow.com/questions/27675395/load-classified-data-from-csv-to-scikit-learn-for-machine-learning

标签

python

csv

machine-learning

scikit-learn

classification

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!