web-crawler | 易学教程

How extract extract specific text from pdf file - python

阅读更多关于 How extract extract specific text from pdf file - python

问题 I am trying to extract this text: DLA LAND AND MARITIME ACTIVE DEVICES DIVISION PO BOX 3990 COLUMBUS OH 43218-3990 USA Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930 Email: Desmond.Forshey@dla.mil from this pdf file. I was able to extract some text between two references using the code below: import PyPDF2 pdfFileObj = open('SPE7M518T446E.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) print(pdfReader.numPages) pageObj1 = pdfReader.getPage(0) pagecontent

Automating a Python Web Crawler - How to prevent raw_input all the time?

阅读更多关于 Automating a Python Web Crawler - How to prevent raw_input all the time?

问题 I have been trying to create a Python Web Crawler that finds a web page, read a list of links, returns the link in pre-specified position, and does that for a certain number of times (defined by the count variable). My issue is that I have not been able to find a way to automate the process, and I have to continuously input the link that the code finds. Here is my code: The first URL is http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html The count_1 is equal to 7 The

Issue Crawling Amazon, Element Cannot Be Scrolled into View

阅读更多关于 Issue Crawling Amazon, Element Cannot Be Scrolled into View

问题 I'm having an issue crawling pages on Amazon. I've tried using: Executing JS Script Action Chains Explicit Waits Nothing seems to work. Everything throws one exception or error or another. Base Script ff = create_webdriver_instance() ff.get('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB

Skipping web-pages with extension pdf, zip from crawling in Anemone

阅读更多关于 Skipping web-pages with extension pdf, zip from crawling in Anemone

问题 I am developing crawler using anemone gem (Ruby- 1.8.7 and Rails 3.1.1). How should I skip web-pages with extensions pdf, doc, zip, etc. from crawling/downloading. 回答1: ext = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml) Anemone.crawl(url) do |anemone| anemone.skip_links_like /\.#{ext.join('|')}$/ ... end 来源： https://stackoverflow.com

rubyXL (Errno::ENOENT)

阅读更多关于 rubyXL (Errno::ENOENT)

问题 I'm having trouble with a crawler I'm building using rubyXL. It's correctly traversing my file system, but I am receiving an (Errno::ENOENT) error. I've checked out all the rubyXL code and everything appears to check out. My code is attached below - any suggestions? /Users/.../testdata.xlsx /Users/.../moretestdata.xlsx /Users/.../Lab 1 Data.xlsx /Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:404:in `initialize': No such file or directory - /Users/Dylan/...

Allowing to run Flash on all sites in Puppeteer

阅读更多关于 Allowing to run Flash on all sites in Puppeteer

问题 Disclaimer: I know that Flash will be abandoned by the end of 2020, but I simply cannot drop the case and need to have flash in Puppeteer, though I don't like it either. I need to crawl certain flash sites and take a screenshot of them, for later programatic comparison. I could provide a finite list of domains that I need to check against (though the list may change in time, so it'd be great to be able to somehow load them at the runtime). Been searching through the Internet after solutions

What should i know about search engine crawling?

阅读更多关于 What should i know about search engine crawling?

问题 I don't mean SEO things. What should i know. Such as Do engines run javascript? Do they use cookies? Will cookies carry across crawl sessions (say cookies from today and a craw next week or month). Are selected JS filters not loaded for any reason? (Such as suspected ad which is ignored for optimization reasons?) I don't want to accidental have all index page say some kind of error or warning msg like please turn on your cookie, browser not supported, or not be indexed because i did something

Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

阅读更多关于 Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

问题 The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property. In some cases, it is accessing the same og:image resource multiple times over the space of 1-5 minutes. In one example - the crawler accessed the same image 12 times over the course of 3 minutes using 12 different IP addresses. I only had to log requests for 10 minutes before I caught the following example: List of times and crawler IP addresses

How to hold the cache in selenium in a loop after a page gets refreshed?

阅读更多关于 How to hold the cache in selenium in a loop after a page gets refreshed?

问题 I am using this spider to click on the color and then page gets refreshed and then subsequent clicking on the links but it breaks in between and throw Element not found in the cache - perhaps the page has changed since it was looked up Error How to get hold of original page after completion of loop? Couldn't find any suitable solution for this. import scrapy from scrapy.contrib.spiders import CrawlSpider from selenium import webdriver from selenium.common.exceptions import

DRY search every page of a site with nokogiri

阅读更多关于 DRY search every page of a site with nokogiri

问题 I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well. So it starts very easily: page = 'http://example.com' nf = Nokogiri::HTML(open(page)) links = nf.xpath '//a' #find all links on current page main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq "main_links" is now an array of links from the active page that start with "/" (which