I would like to use python2.7 to remove anything that isn\'t the documents\' text from EDGAR filings (which are available online as .txt files). An example of what the file
The link below is a library that parses EDGAR filings into a SQLite DB. It contains functionality to pull Form10k and Form8Qk filings from the EDGAR FPT site for years that you specify and load them into a normalized format in SQLite DB tables. Considering the poorly adhered to standard for the filings, writing your own parsing script would be a significant undertaking. That library and code similar to the below will load filings for the wanted quarter and from there you can simply query the table for the data you are seeking.
edgar.database.create()
# Load quarterly master index files into local sqlite db
quarters = []
#Q3 2009
quarters.add(2009,3)
#Q3 2008
quarters.add(2008,3)
edgar.database.load(quarters)
http://rf-contrib.googlecode.com/svn/trunk/ha/src/main/python/edgar/