Looking for recommendation on how to convert PDF into structured format

后端 未结 2 1472
予麋鹿
予麋鹿 2021-02-06 00:34

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured f

2条回答
  •  南旧
    南旧 (楼主)
    2021-02-06 00:49

    After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications.

    As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you.

    Here's a high-level view of my approach:

    • Convert the PDF into a text file using Poppler
    • Use RegEx foo to clean up and create XML nodes from the data
    • Use an XML beautifier / validator to find errors and do cleanup
    • Use Python/ruby to add Google Maps Link node, and link to LA County Assessors Map (http://assessormap.co.la.ca.us/mapping/rolldata.asp?ain=APN-GOES_HERE) and
    • Convert XML to CSV with Ruby

    I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers.

    Use Homebrew to install Poppler for Mac:

    brew install poppler
    

    After Poppler is installed, you should have access to the pdftotext utility to convert the PDF:

    pdftotext -layout -f 24 -l 687 AuctionBook2013.pdf auction_book.txt
    

    Here's a preview of the XML (Click here for full XML):

    
    
       
          536
          3,422
          2006 003 001
          03
          2006 003 001
          LICENSED SURVEYOR'S MAP
              AS PER BK 25 PG 28 OF L S LOT 1              
              BLK 1 ASSESSED TO    J   AND   S
              LIMITED LLC C/O DUNA CSARDAS -
              JULIUS JANCSO LOCATION COUNTY OF
              LOS ANGELES
          
    VACANT LOT

    Edit: Adding the Ruby I wrote to convert the XML to a CSV.

    require 'rexml/document'
    require 'CSV'
    
    class Auction
    
      def initialize
    
        f = File.new('AuctionBook2013.xml', 'r')
        doc = REXML::Document.new(f)
    
        CSV.open("auction.csv", "w+b") do |csv|
          csv << ['id', 'minbid', 'apn', 'delinquent_year', 'apn_old', 'description', 'address']
    
          doc.elements.each('/listings/item') do |item|
            csv << [item.attributes['id'],
                    item.elements['minbid'].text,
                    item.elements['apn'].text,
                    item.elements['delinquent_year'].text,
                    item.elements['apn_old'].text,
                    item.elements['description'].text,
                    item.elements['address'].text]
          end
        end
      end
    end
    
    a = Auction.new()
    

    Link to Final CSV

提交回复
热议问题