Extracting text from Microsoft Word files in Python with Scrapy

问题

Here is my sample code with Scrapy code with Python to extract word.doc and a .docx file extract from a website.

import StringIO
from functools import partial
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pyPdf import PdfFileReader
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector

import urlparse
from miette import DocReader
from scrapy.item import Item, Field

class wordSpiderItem(Item):

    link = Field()
    title = Field()
    Description = Field()

class wordSpider(CrawlSpider):

    name = "penyrheol"

    # Stay within these domains when crawling
    allowed_domains = ["penyrheol-comp.net"]
    start_urls = ["http://penyrheol-comp.net/vacancy"]


    def parse(self, response):
            ##hxs = Selector(response)

            listings = response.xpath('//div[@class="entry-content"]')
            links = []

            # Scrap listings page to get listing links
            for listing in listings:
                link = listing.xpath('//div[@class="afi-document-
                                    link"]/a/@href').extract()

                links.extend(link)

            # Parse listing URL to get content of the listing page

            for link in links:
                item = wordSpiderItem()
                item['link'] = link
                if "doc" in link:
                    yield Request(urlparse.urljoin(response.url, link),
                    meta = {'item':item}, callback=self.parse_data)


        def parse_data(self, response):
            #hxs = Selector(response)

            job = wordSpiderItem()
            job['link'] = response.url
            stream = StringIO.StringIO(response.body)
            reader = DocReader(stream)
            for page in reader.pages:
               job['Description'] = page.extractText()

               return job

I get the following error. Please check it and let me know how to implement with this code...

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\sureshp>cd D:\Final\penyrheolcomp
C:\Documents and Settings\sureshp>d:
D:\Final\penyrheolcomp>scrapy crawl penyrheol -o testd.json -t json
2014-09-05 17:49:55+0530 [scrapy] INFO: Scrapy 0.24.2 started (bot: penyrheolcom
p)
2014-09-05 17:49:55+0530 [scrapy] INFO: Optional features available: ssl, http11
2014-09-05 17:49:55+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'penyrheolcomp.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['penyrheolc
omp.spiders'], 'FEED_URI': 'testd.json', 'BOT_NAME': 'penyrheolcomp'}
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled item pipelines:
2014-09-05 17:49:56+0530 [penyrheol] INFO: Spider opened
2014-09-05 17:49:56+0530 [penyrheol] INFO: Crawled 0 pages (at 0 pages/min), scr
aped 0 items (at 0 items/min)
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-05 17:49:57+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/vacancy> (referer: None)
2014-09-05 17:49:59+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc> (referer: htt
p://penyrheol-comp.net/vacancy)
2014-09-05 17:49:59+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-English-0.6-Temp
orary.doc> (referer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-Englis
h-0.6-Temporary.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6.doc> (ref
erer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6
.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:01+0530 [penyrheol] INFO: Closing spider (finished)
2014-09-05 17:50:01+0530 [penyrheol] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1208,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 677942,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 5, 12, 20, 1, 140000),
'log_count/DEBUG': 6,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/TypeError': 3,
'start_time': datetime.datetime(2014, 9, 5, 12, 19, 56, 234000)}
2014-09-05 17:50:01+0530 [penyrheol] INFO: Spider closed (finished)
D:\Final\penyrheolcomp>`enter code here`

回答1:

Your error is hiding in your stacktrace:

reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instance found

According to https://github.com/rembish/Miette/blob/master/miette/doc.py , the DocReader __init__ takes the filename of the document you want - not its body.

To get around this, you could write response.body to a temporary file, and then point your DocReader to that temporary file.

来源：https://stackoverflow.com/questions/25686285/extracting-text-from-microsoft-word-files-in-python-with-scrapy

标签

windows

python-2.7

ms-word

scrapy

screen-scraping