Python (pyspark) Error = ValueError: could not convert string to float: “17”

风格不统一 提交于 2019-12-11 10:46:30

问题


I am working with Python on Spark and reading my dataset from a .csv file whose first a few rows are:

17  0.2  7
17  0.2  7
39  1.3  7
19   1   7
19   0   7

When I read from the file line by line with the code below:

# Load and parse the data
def parsePoint(line):
   values = [float(x) for x in line.replace(',', ' ').split(' ')]
   return LabeledPoint(values[0], values[1:])

I get the this error:

Traceback (most recent call last):
  File "<stdin>", line 3, in parsePoint
ValueError: could not convert string to float: "17"

Any help is greatly appreciated.


回答1:


Following the comments below this answer, you should use:

[float(x.strip(' "')) for x in line.split(',')]

You do not need to replace ',' with ' ', you should simply split on , and then remove leading and trailing whitespaces and quotes (x.strip(' "')) before converting to float.

Also, have a look at the csv packages which may simplify your work.


Below is the answer to the original question (before comments).

You need to use .split() instead of .split(' '). You have multiple consecutive space characters in your line, so splitting on ' ' results in empty strings, e.g. your first line is split into:

['17', '', '0.2', '', '7']

The problem are those empty strings that you (obviously) cannot convert to float.

Using split() will solve the problem thanks to the behaviour of split when its sep argument is None (or not present):

If the optional second argument sep is absent or None, the words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed).

See the doc of split, and a small example to understand the difference:

>>> sp5 = ' ' * 5
>>> sp5.split()
[]
>>> sp5.split(' ')
['', '', '', '', '', '']


来源:https://stackoverflow.com/questions/36113328/python-pyspark-error-valueerror-could-not-convert-string-to-float-17

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!