Python tabula-py error (pandas error?)

别来无恙 提交于 2019-12-13 03:55:43

问题


After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.

I wanted to start out with a simple script and see what it would do with a single page pdf file with some text and two tables ("table_p16.pdf").

The code:

from tabula import read_pdf
df = read_pdf("table_p16.pdf")

The error:

Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=c:\Windows\Sun\Java\Deployment\sam.security

Traceback (most recent call last):

File "H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py", line 41, in df = read_pdf("table_p16.pdf")

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\tabula\wrapper.py", line 117, in read_pdf return pd.read_csv(io.BytesIO(output), **pandas_options)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 455, in _read data = parser.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1069, in read ret = self._engine.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1839, in read data = self._reader.read(nrows)

File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read

File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows

File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 9, saw 9

Things I have tried:

  • Since the error seems to show problems with pandas I tried to read a single page pdf with one table. The same error holds.
  • Set user variable PATH to Java. Did not change anything. Can't set system variable PATH to Java, since it is currently used for our SVN programm.
  • Different code lines, with the same error:

    df = read_pdf(r"table_p9.pdf")
    df = read_pdf(r"table_p9.pdf")
    df = read_pdf("table_p9.pdf", output_format='json')
    

I hope someone can chip in and help me figure out where the problem lies. It could be a Java issue, but I am not that familiar with the required Java interaction. Your help is much appriciated.

Edit

I tried different tables and some seem to be working. It has been difficult to identify what type of tables work. Some with 'merged' columns and others with 'merged' rows seem to work. But clearly not all. Also, I have not been able to read multiple tables (2 or 3) using the argument multiple_tables=True.

Is there any source to what kind of tables Tabula can handle? And this makes me wonder whether Tabula is the right program to use. After all the reading I did, I was under the impression that Tabula would be good at this. The tables it seems to struggle with are not complex.

Is there a clear and simple source on how to maximize the use of Tabula? Or otherwise tips on how to deal with tables that Tabula struggles with?

Regards, Gabriel


回答1:


This is the rough guideline for tabula (or tabula-py) options.

1) Having merged cells with a lined table You can use lattice=True option. With lattice mode, tabula handles line of tables appropriately. Note that, you might need post editing some kind of fillna for merged cells. I experienced some merged columns is extracted with left-justified.

AFAIK, it's pretty hard for tabula to extract merged cell without line of table.

General tuning points for tabula are lattice, stream, guess.

2) Having multiple tables within one or more pages It's tabula-py specific option, you have to use multiple_tables=True option.

By default, tabula-py tries to extract tables via CSV. While this approach can get benefits from pandas.read_csv function like inferring of column names. read_csv assumes a single table (same column size table) in a PDF. pandas.read_csv with different size of columns causes ParserError.

On the other hand, with multiple_tables option, tabula-py creates DataFrame via JSON, which can represent multiple tables.

One more option. From tabula-py 1.3.0, you can use Tabla app templates with tabula-py. Getting area data from template, you could extract more appropriately with accurate area info.



来源:https://stackoverflow.com/questions/51326900/python-tabula-py-error-pandas-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!