I\'ve read a csv (which is \\t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type
Use label-based selection and the .values
attribute of the resulting pandas
objects, which will be some sort of numpy
array:
>>> df
uid iid rat
0 196 242 3.0
1 186 302 3.0
2 22 377 1.0
>>> df.loc[:,['iid','rat']]
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
>>> df.loc[:,['iid','rat']].values
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
Note, your integer column will get promoted to float.
Also note, this particular selection could be approached in different ways:
>>> df.iloc[:, 1:] # integer-position based
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
>>> df[['iid','rat']] # plain indexing performs column-based selection
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
I like label-based because it is more explicit.
The reason you aren't seeing commas is an artifact of how numpy arrays are printed:
>>> df[['iid','rat']].values
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
>>> print(df[['iid','rat']].values)
[[ 242. 3.]
[ 302. 3.]
[ 377. 1.]]
And actually, it is the difference between the str and repr results of the numpy array:
>>> print(repr(df[['iid','rat']].values))
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242. 3.]
[ 302. 3.]
[ 377. 1.]]