Problems with Pandas

匿名 (未验证) 提交于 2019-12-03 08:59:04

问题:

sorry for the vague title, but since I don't really know what the problem is... the thing is that I want to load a CSV file, then split it up into two arrays and perform a function on each of those arrays. It works for the first array but the second one is making problems even though every thing is the same. I'm really stuck. The Code is as follows:

from wordutility import wordutility from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn import cross_validation from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC import pandas as pd import numpy as np  data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';',                quotechar='"')  # test = pd.read_csv('output.csv', header=None, #                   delimiter=';', quotechar='"')  split_ratio = 0.9 train = data[:round(len(data)*split_ratio)] test = data[round(len(data)*split_ratio):]  y = data[1]  print("Cleaning and parsing tweets data...\n")  traindata = []  for i in range(0, len(train[0])):      traindata.append(" ".join(wordutility.tweet_to_wordlist                           (train[0][i], False)))  testdata = []  for i in range(0, len(test[0])):     testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))

The program works up until the very last line. The error is:

Traceback (most recent call last):   File "<stdin>", line 2, in <module>   File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__     result = self.index.get_value(self, key)   File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line   1417, in get_value     return self._engine.get_value(s, k)   File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097)   File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826)   File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)   File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201)   File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139) KeyError: 0

(It says line 2 in the error code because I was trying the code in the python shell. So line 2 refers to the last line of the code above.)

Hopefully someone can help me :). Thanks

EDIT

Ok, it seems like the splitting is not working as I thought it would. I did get two arrays as I wanted but somehow the lines are still as if it was one file. So the array train is from 0 to 1830 and the array test is from 1831 to 2034... so the range was wrong... how would I go about splitting up the csv file "correctly"?

2 EDIT

>>> print(train[0:5])                                                0         1 0  the angel is going to miss the athlete this we...  negative  1  It looks as though Shaq is getting traded to C...  negative 2     @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH   negative 3  drinking a McDonalds coffee and not understand...  negative 4  So dissapointed Taylor Swift doesnt have a Twi...  negativ  >>> print(test[0:5])                                                   0         1 1831  Why is my PSP always dead when I want to use it?   negative 1832  @hillaryrachel oh i know how you feel. i took ...  negative 1833  @daveknox awesome-  corporate housing took awa...  negative 1834  @lakersnation Is this a joke?  I can't find them   negative 1835                              XBox Live still down   negative

So as you can see the array "test" starts at the line number 1831. I would've thought it would start at 0... I fixed my problem now by editing the range in the for loop

for i in range(len(train[0], len(data)):

So my original problem is fixed, I'm just curious and eager to learn to write better code. Is this an ok thing to do or should I split the csv file in a different way?

回答1:

When you do test[0], you are not getting the first index of test, it is more like you are getting the column of test with the "name" 0. When you split the pandas DataFrame in two, the original column names were preserved. This means that for the test DataFrame, it has no columns 0, since that column is in the first DataFrame.

Let me give you an example. Say you have the following DataFrame:

       0   1   2   3   4   5   6   7   8   9 Ind1   0   1   2   3   4   5   6   7   8   9 Ind2  10  11  12  13  14  15  16  17  18  19

When you split it, you end up with these DataFrames:

       0   1   2   3   4 Ind1   0   1   2   3   4 Ind2  10  11  12  13  14

and:

       5   6   7   8   9 Ind1   5   6   7   8   9 Ind2  15  16  17  18  19

Notice that the columns of the second DataFrame starts with 5, not 0, because those were the column names before the split. So when you try to get column 0, it isn't there. That is the source of your error.

The simplest solution would just be to use the index, rather than the column name. So instead of something like test[0], use test.iloc[0]. That will give the value based on positional index.



转载请标明出处:Problems with Pandas
文章来源: Problems with Pandas
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!