web-crawler

Getting text between all tags in a given html and recursively going through links

本小妞迷上赌 提交于 2019-11-29 17:33:38
i have checked a couple of posts on stack overflow regarding getting all the words between all the html tags! All of them confused me up! some people recommend regular expression specifically for a single tag while some have mentioned parsing techniques! am basically trying to make a web crawler! for that i have got the html of the link i fetched to my program in a string! i have also extracted the links from the html that i stored in my data string! now i want to crawl through the depth and extract words on the page of all links i extracted from my string! i got two questions! how can i fetch

mechanize._mechanize.FormNotFoundError: no form matching name 'q'

社会主义新天地 提交于 2019-11-29 16:57:26
Can anyone help me get this form selection correct? Trying to get a crawl of google, I get the error: mechanize._mechanize.FormNotFoundError: no form matching name 'q' Unusual, since I have seen several other tutorials using it, and: p.s. I don't plan to SLAM google with requests, just hope to use an automatic selector to take the effort out of finding academic citation pdfs from time to time. <f GET http://www.google.com.tw/search application/x-www-form-urlencoded <HiddenControl(ie=Big5) (readonly)> <HiddenControl(hl=zh-TW) (readonly)> <HiddenControl(source=hp) (readonly)> <TextControl(q=)>

Making my own web crawler in python which shows main idea of the page rank

╄→尐↘猪︶ㄣ 提交于 2019-11-29 16:51:24
I'm trying to make web crawler which shows basic idea of page rank. And code for me seems fine for me but gives me back errors e.x. `Traceback (most recent call last): File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 89, in <module> webpages() File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 17, in webpages get_single_item_data(href) File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 23, in get_single_item_data source_code = requests.get(item_url) File "C:\Python34\lib\site-packages\requests\api.py", line 65, in get return request('get', url, **kwargs) File

selenium implicitly wait doesn't work

元气小坏坏 提交于 2019-11-29 16:06:09
This is the first time I use selenium and headless browser as I want to crawl some web page using ajax tech. The effect is great, but for some case it takes too much time to load the whole page(especially when some resource is unavailable),so I have to set a time out for the selenium. First of all I tried set_page_load_timeout() and set_script_timeout() ,but when I set these timeouts, I won't get any page source if the page doesn't load completely, as the codes below: driver = webdriver.Chrome(chrome_options=options) driver.set_page_load_timeout(5) driver.set_script_timeout(5) try: driver.get

Scrapy not crawling subsequent pages in order

两盒软妹~` 提交于 2019-11-29 15:09:24
I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types). Here is the code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from lonelyplanet.items import LonelyplanetItem class LonelyplanetSpider(CrawlSpider): name = "lonelyplanetItemName_spider" allowed_domains = ["lonelyplanet.com"] def start_requests(self): for i in xrange(8): yield self.make_requests_from_url("http://www.lonelyplanet

Is the User-Agent line in robots.txt an exact match or a substring match?

百般思念 提交于 2019-11-29 15:05:43
When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent? Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match. However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC: The robot must obey the first record in /robots.txt that contains a User-Agent line whose

Nutch regex-urlfilter syntax

浪尽此生 提交于 2019-11-29 14:52:50
I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt . The site I want to crawl has a URL similar to this: http://www.example.com/foo.cfm On that page there are numerous links that match the following pattern: http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976 I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following: +^http://www.example.com/foo.cfm$ +^http://www.example.com/foo.cfm/(.+)*$ Nutch matches on the first one

Scrapy delay request

前提是你 提交于 2019-11-29 14:36:17
问题 every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated. # item class included here class DmozItem(scrapy.Item): # define the fields for your item here like: link = scrapy.Field() attr = scrapy.Field() class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["craigslist.org"] start_urls = [ "https://washingtondc.craigslist.org/search/fua" ] BASE_URL =

Is there a .NET equivalent of Perl's LWP / WWW::Mechanize?

瘦欲@ 提交于 2019-11-29 14:19:23
After working with .NET's HttpWebRequest / Response objects, I'd rather shoot myself than use this to crawl through web sites. I'm looking for an existing .NET library that can fetch URLs, and give you the ability to follow links, extract/fill in/submit forms on the page, etc. Perl's LWP and WWW::Mechanize modules do this very well, but I'm working with a .NET project. I've come across the HTML Agility Pack , which looks awesome, but it stops short of simulating links/forms. Does such a tool already exist? Somebody built a bit of code to run as an addon to the HTML Agility Pack (which I also

how to tell if a web request is coming from google's crawler?

与世无争的帅哥 提交于 2019-11-29 14:15:11
From the HTTP server's perspective. I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks. Requesting IP : 66.249.71.113 Client : Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html ) My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA. A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html . As I said there are many IPs