问题
Into a Pandas DataFrame, I'm reading a csv file that looks like:
A B +--------------+---------------+ 0 | | ("t1", "t2") | +--------------+---------------+ 1 | ("t3", "t4") | | +--------------+---------------+
Two of the cells have literal tuples in them, and two of the cells are empty.
df = pd.read_csv(my_file.csv, dtype=str, delimiter=',',
converters={'A': ast.literal_eval, 'B': ast.literal_eval})
The converter ast.literal_eval
works fine to convert the literal tuples into Python tuple objects within the code – but only as long as there are no empty cells. Because I have empty cells, I get the error:
SyntaxError: unexpected EOF while parsing
According to this S/O answer, I should try to catch the SyntaxError exception for empty strings:
ast uses compile to compile the source string (which must be an expression) into an AST. If the source string is not a valid expression (like an empty string), a SyntaxError will be raised by compile.
However, I am not sure how to catch exceptions for individual cells, within the context of the read_csv
converters
.
What would be the best way to go about this? Is there otherwise some way to convert empty strings/cells into objects which literal_eval
would accept or ignore?
NB: My understanding is that having literal tuples in readable files isn't always the best thing, but in my case it's useful.
回答1:
You can create a custom function which uses ast.literal_eval
conditionally:
from ast import literal_eval
from io import StringIO
# replicate csv file
x = StringIO("""A,B
,"('t1', 't2')"
"('t3', 't4')",""")
def literal_converter(val):
# replace first val with '' or some other null identifier if required
return val if val == '' else literal_eval(val)
df = pd.read_csv(x, delimiter=',', converters=dict.fromkeys('AB', literal_converter))
print(df)
A B
0 (t1, t2)
1 (t3, t4)
Alternatively, you can use try
/ except
to catch SyntaxError
. This solution is more lenient as it will deal with other malformed syntax, i.e. SyntaxError
/ ValueError
caused by reasons other than empty values.
def literal_converter(val):
try:
return literal_eval(val)
except SyntaxError, ValueError:
return val
回答2:
I would first read the data as normal, without literal_eval()
. That gives us:
A B
0 NaN ("t1", "t2")
1 ("t3", "t4") NaN
Then I would do this:
df.fillna('()').applymap(ast.literal_eval)
Which gives:
A B
0 () (t1, t2)
1 (t3, t4) ()
I think it's convenient to have tuples in all the cells, even the empty ones. This will make it easier to operate on the tuples later, for example:
newdf.sum(axis=1)
Which gives you:
0 (t1, t2)
1 (t3, t4)
Because "adding" tuples is concatenation. And even trickier but still very useful:
newdf.A.str[0]
Gives you:
0 NaN
1 t3
Because pd.Series.str
, despite looking like it would only work on strings, works just fine on lists and tuples. So you can efficiently and uniformly index elements within each column's tuples.
来源:https://stackoverflow.com/questions/53102545/pandas-read-csv-converter-how-to-handle-exceptions-literal-eval-syntaxerror