Pdfplumber cannot recognise table python

纵然是瞬间 提交于 2021-01-25 07:44:29

问题


I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.

How can I get the table? link of the pdf which doesn't work: pdfA

link of the pdf which works: pdfB

Here is my code:

import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()

import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df

and the result is

But the table I want in page 2 is

However, this code works for pdfB (which I mentioned above).

Btw, the table I want in each pdf is in section 3.

Anyone can help?

Many thanks Joan


回答1:


Hey Here is the proper solution for that problem but first please read some of my points below

  • Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
  • As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
  • For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables

Below Is the solution of your problem,

import pandas as pd
import pdfplumber 
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines", 
                                         "horizontal_strategy": "text", 
                                         "snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df

See the output of the Above Code



来源:https://stackoverflow.com/questions/63000396/pdfplumber-cannot-recognise-table-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!