web-crawler

Web-Scraping with R

南笙酒味 提交于 2019-12-06 10:19:58
问题 I'm having some problems scraping data from a website. First, I have not a lot of experience with webscraping... My intended plan is to scrape some data using R from the following website: http://spiderbook.com/company/17495/details?rel=300795 Especially, I want to extract the links to the articles on this site. My idea so far: xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795") sources <- xpathApply(xmltext, "//body//div") sourcesCharSep <- lapply(sourcesChar

Is there a web crawler library available for PHP or Ruby? [closed]

元气小坏坏 提交于 2019-12-06 09:37:09
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Is there a web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used. 回答1: Check this page out for a Ruby library: Ruby Mechanize I'd like to mention that you would still be responsible

Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

牧云@^-^@ 提交于 2019-12-06 09:35:28
问题 After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20 I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced. bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext Ideally the output should be in this format:

Creating a static copy of a web page on UNIX commandline / shell script

眉间皱痕 提交于 2019-12-06 09:15:33
问题 I need to create a static copy of a web page (all media resources, like CSS, images and JS included) in a shell script. This copy should be openable offline in any browser. Some browsers have a similar functionality (Save As... Web Page, complete) which create a folder from a page and rewrite external resources as relative static resources in this folder. What's a way to accomplish and automatize this on Linux command line to a given URL? 回答1: You can use wget like this: wget --recursive -

How do I scrape HTML between two HTML comments using Nokogiri?

纵饮孤独 提交于 2019-12-06 09:12:48
I have some HTML pages where the contents to be extracted are marked with HTML comments like below. <html> ..... <!-- begin content --> <div>some text</div> <div><p>Some more elements</p></div> <!-- end content --> ... </html> I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments. I want to extract the full elements between these two HTML comments: <div>some text</div> <div><p>Some more elements</p></div> I can get the text-only version using this characters callback: class TextExtractor < Nokogiri::XML::SAX::Document def

File Crawler PHP

青春壹個敷衍的年華 提交于 2019-12-06 08:43:15
just wondering how it would be possible to recursively search through a website folder directory (the same one as the script is uploaded to) and open/read every file and search for a specific string? for example I might have this: search.php?string=hello%20world this would run a process then output somethign like "hello world found inside" httpdocs /index.php /contact.php httpdocs/private/ ../prviate.php ../morestuff.php ../tastey.php httpdocs/private/love ../../goodness.php I dont want it to link- crawl as private files and unlinked files are round, but i'd like every other non-binary file to

Downloading pdf files using mechanize and urllib

戏子无情 提交于 2019-12-06 08:41:14
I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url): import mechanize import urllib import sys mech = mechanize.Browser() mech.set_handle_robots(False) url = "http://www.xyz.com" try: mech.open(url, timeout = 30.0) except HTTPError, e: sys.exit("%d: %s" % (e.code, e.msg)) links = mech.links() for l in links: #Some are relative links path = str(l.base_url[:-1])+str(l.url) if path.find(".pdf") > 0: urllib.urlretrieve(path) The program runs without any errors, but I am

Using scrapy to find specific text from multiple websites

与世无争的帅哥 提交于 2019-12-06 08:14:51
I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation , but I can't seem to find this. Thank you. class FinalSpider(scrapy.Spider): name = "final" allowed_domains = ['example.com'] start_urls = [URL % starting_number] def __init__(self): self.page_number = starting_number

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503 (google scholar ban?)

你离开我真会死。 提交于 2019-12-06 08:08:04
I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors%3DAGH%2BUniversity%2Bof%2BScience%2Band%2BTechnology%26hl%3Dpl%26view_op%3Dsearch_authors&q

php crawler detection

徘徊边缘 提交于 2019-12-06 07:40:11
I'm trying to write a sitemap.php which acts differently depending on who is looking. I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page. This will all be controlled from within the php header, and I've found this code on the web which by the looks of it should work, but it's not. Can anyone help crack this for me? function getIsCrawler($userAgent) { $crawlers = 'firefox|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' . 'AcioRobot|ASPSeek