sorry for the vague title, but since I don't really know what the problem is... the thing is that I want to load a CSV file, then split it up into two arrays and perform a function on each of those arrays. It works for the first array but the second one is making problems even though every thing is the same. I'm really stuck. The Code is as follows:
from wordutility import wordutility from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn import cross_validation from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC import pandas as pd import numpy as np data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';', quotechar='"') # test = pd.read_csv('output.csv', header=None, # delimiter=';', quotechar='"') split_ratio = 0.9 train = data[:round(len(data)*split_ratio)] test = data[round(len(data)*split_ratio):] y = data[1] print("Cleaning and parsing tweets data...\n") traindata = [] for i in range(0, len(train[0])): traindata.append(" ".join(wordutility.tweet_to_wordlist (train[0][i], False))) testdata = [] for i in range(0, len(test[0])): testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))
The program works up until the very last line. The error is:
Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__ result = self.index.get_value(self, key) File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line 1417, in get_value return self._engine.get_value(s, k) File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097) File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826) File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692) File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201) File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139) KeyError: 0
(It says line 2 in the error code because I was trying the code in the python shell. So line 2 refers to the last line of the code above.)
Hopefully someone can help me :). Thanks
EDIT
Ok, it seems like the splitting is not working as I thought it would. I did get two arrays as I wanted but somehow the lines are still as if it was one file. So the array train is from 0 to 1830 and the array test is from 1831 to 2034... so the range was wrong... how would I go about splitting up the csv file "correctly"?
2 EDIT
>>> print(train[0:5]) 0 1 0 the angel is going to miss the athlete this we... negative 1 It looks as though Shaq is getting traded to C... negative 2 @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH negative 3 drinking a McDonalds coffee and not understand... negative 4 So dissapointed Taylor Swift doesnt have a Twi... negativ >>> print(test[0:5]) 0 1 1831 Why is my PSP always dead when I want to use it? negative 1832 @hillaryrachel oh i know how you feel. i took ... negative 1833 @daveknox awesome- corporate housing took awa... negative 1834 @lakersnation Is this a joke? I can't find them negative 1835 XBox Live still down negative
So as you can see the array "test" starts at the line number 1831. I would've thought it would start at 0... I fixed my problem now by editing the range in the for loop
for i in range(len(train[0], len(data)):
So my original problem is fixed, I'm just curious and eager to learn to write better code. Is this an ok thing to do or should I split the csv file in a different way?