scraper

XPath to select between two HTML comments?

断了今生、忘了曾经 提交于 2019-12-03 03:37:23
I have a big HTML page. But I want to select certain nodes using Xpath: <html> ........ <!-- begin content --> <div>some text</div> <div><p>Some more elements</p></div> <!-- end content --> ....... </html> I can select HTML after the <!-- begin content --> using: "//comment()[. = ' begin content ']/following::*" Also I can select HTML before the <!-- end content --> using: "//comment()[. = ' end content ']/preceding::*" But do I have to have XPath to select all the HTML between the two comments? I would look for elements that are preceded by the first comment and followed by the second comment

BeautifulSoup MemoryError When Opening Several Files in Directory

白昼怎懂夜的黑 提交于 2019-12-01 22:11:41
Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results". Problem: The following code works great when I do each file at a time. That is, instead of doing a for loop over the directory, I point get_data = open() to a specific file.

BeautifulSoup: Strip specified attributes, but preserve the tag and its contents

岁酱吖の 提交于 2019-12-01 02:28:27
问题 I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it. However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet: REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font', 'dir','face','size','color','style','class','width','height','hspace', 'border','valign','align','background

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

余生颓废 提交于 2019-11-30 20:35:00
Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery) You want to have a look at phantomjs. There is this php implementation: http://jonnnnyw.github.io/php-phantomjs/ if you need to have it working with php of course. You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this: How to get

Crawling LinkedIn while authenticated with Scrapy

时光总嘲笑我的痴心妄想 提交于 2019-11-30 10:41:30
问题 So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful. I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense. ====== UPDATED ====== from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors

How to scrape tables in thousands of PDF files?

坚强是说给别人听的谎言 提交于 2019-11-30 05:05:21
I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example). What I am looking for is a way to iterate over all these files (locally, if possible) and extract the actual contents of the table (as CSV, stored into a SQLite DB, whatever). I would love to do this in Node.js, but couldn't find any suitable libraries for parsing such stuff. Do you know of any? If not possible in Node.js, I could also code it in Python, if there are better methods available. I didn't know

Crawling LinkedIn while authenticated with Scrapy

我是研究僧i 提交于 2019-11-29 21:58:05
So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful. I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense. ====== UPDATED ====== from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import Rule from scrapy.spider import

django-dynamic-scraper(DDS)网页抓取环境安装搭建

拟墨画扇 提交于 2019-11-29 18:54:15
之前了解了scrapy的强大和速率的惊人,django的便捷,苦于没有机会真正的接触,前几天大哥让研究一下这个框架,wow,DDS强大的把这两个得力的工具整合到了一起,这样只需简单的安装和配置,就可以顺利的爬网抓取页面了。废话不多说将环境搭建过程整理出来,以作备份,希望也能帮助到一些人 ; 搭建django环境 查看前一篇博客 搭建Django开发环境 安装scrapy 最新版本是0.18,可以通过命令 easy_install Scrapy or pip install Scrapy,但是dds暂不支持0.18, 这里要安装0.16,是通过命令”pip install scrapy==0.16” 测试安装成功 scrapy shell http://www.baidu.com 如果是windows系统,需要预先安装一些额外的插件 : win32api Zope.Interface Twisted w3lib libxml2 pyOpenSSL lxml 安装django-Celery,设置调度计划任务 pip install django-celery 或者通过解压包文件( 下载 )安装”Python setup.py install” 安装PIL( Python Imaging Library ) 点击 下载 安装包,解压后通过命令“python setup.py

How to scrape tables in thousands of PDF files?

泄露秘密 提交于 2019-11-29 02:49:13
问题 I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example). What I am looking for is a way to iterate over all these files (locally, if possible) and extract the actual contents of the table (as CSV, stored into a SQLite DB, whatever). I would love to do this in Node.js, but couldn't find any suitable libraries for parsing such stuff. Do you know of any? If not

crawler vs scraper

风流意气都作罢 提交于 2019-11-28 05:12:14
Can somebody distinguish between a crawler and scraper in terms of scope and functionality. Jerry Coffin A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired. Depending on how you