How do I extract rows from a PDF file into a csv file?

前端未结

关注

 4  1468

清酒与你 2021-01-06 10:51

I want to get a list of all the colleges in USA from this PDF file and put it into a CSV file. I will then import the CSV file into SQL server (so that I can run queries eas

4条回答

刺人心 (楼主)

2021-01-06 11:40
I once had a Ruby project that did this kind of work. I used the gem pdf/reader and evenutally it dit work but I advise against using this aproach, the contents of your PDF have no markers where the fields start and stop, instead you have to measure the position of each piece of text (and there are many pieces per field), here an example of the first field
```
"I
NUL ETX
Am"
NUL ETX
School
NUL ETX
Inc.
```
and compare it with boundaries you have to find by experimenting like "if the position is > 2.54cm from the left margin and < 5.78cm from the left margin" etc.. It is tedious and error prone.

The easiest solution is to somehow read the entire textcontents of your second url by manually scrolling, selecting and copying the contents into an editor and remove the head and tail extra's or use some web scraping gem like mechanize and then convert this text to CSV. The last part is easy since the structure is fixed
```
"I Am" School
118 Siskiyou Avenue
Mount Shasta , CA , 96067
5309266263  <--end of first record
424 Aviation
13230 SW 132 Ave.
Miami , FL , 33186
7862424848  <--end of second record
```
If you need help with this last part, no problem

If this is a once operation you could also use a tool like able2extract (if you'r on windows) it reads pdf and saves in Excel, the times I used it the result was decent and layout was intact.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...