web-crawler

angularjs : different meta tags for each page

北城余情 提交于 2019-12-02 03:11:03
问题 i have developed website by using ruby on rails and angularjs(with JS and Jquery).I just want to know that Is it possible to have different meta tags for each page in angularjs?? According to me any crawler detects only the meta tags generated by server side.so please help me if this is possible in angularjs or not. Thanks, 来源: https://stackoverflow.com/questions/22065133/angularjs-different-meta-tags-for-each-page

Stop abusive bots from crawling?

帅比萌擦擦* 提交于 2019-12-01 23:58:20
问题 Is this a good idea?? http://browsers.garykeith.com/stream.asp?RobotsTXT What does abusive crawling mean? How is that bad for my site? 回答1: Not really. Most "bad bots" ignore the robots.txt file anyway. Abuse crawling usually means scraping. These bots are showing up to harvest email addresses or more commonly, content. As to how you can stop them? That's really tricky and often not wise. Anti-crawl techniques have a tendency to be less than perfect and cause problems for regular humans.

I can't get the whole source code of an HTML page

情到浓时终转凉″ 提交于 2019-12-01 23:38:32
Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user). Say the URL is the URL I am trying to crawl. I run the following code: import urllib2 usock = urllib2.urlopen(url) data = usock.read() usock.close() Data is supposed to contain the source of the page I am crawling, but for some reason, it doesn't contain all the characters that are available when I compare directly with the source of the page. I don't know what I am doing wrong. I know that the page I am trying to crawl has not been updated recently, so it is not due to the fact

JSOUP - How to crawl a “login required” page using JSOUP

回眸只為那壹抹淺笑 提交于 2019-12-01 23:17:34
I'm having trouble at crawling a determined website I wish to crawl. The problem is: after successfully logging in to that website I can't access a link which requires a valid login. For example: public Document executeLogin(String user, String password) { try { Connection.Response loginForm = Jsoup.connect(url) .method(Connection.Method.GET) .execute(); Document mainPage = Jsoup.connect(login-validation-url) .data("user", user) .data("senha", password) .cookies(loginForm.cookies()) .post(); Document evaluationPage = Jsoup.connect(login-required-url) .get(); return evaluationPage; } catch

I need to write a web crawler for specific user agent

倾然丶 夕夏残阳落幕 提交于 2019-12-01 22:15:32
问题 I need to write a web crawler, and want to be able to crawl using a known user agent. For example, I want my crawler to act as an iphone to crawl the mobile site of a website, then crawl again using Mozilla PC agent, etc. That way, Ill be able to crawl every "type" of site (mobile & PC). However, I also want to be able to set my crawler's user agent, so webmasters also see in their stats that it's a crawler that visited their whole website, not real users. So my question is, do you guys know

Google crawl 503 service unavailable

空扰寡人 提交于 2019-12-01 22:03:52
问题 I have got a very strange problem when I crawl google search engine with wget, curl or python on my servers. Google redirects me to an address starting with [ipv4|ipv6].google.fr/sorry/IndexRedirect... and finally send a 503 error, service unavailable... Sometimes crawl works correctly and sometimes not during the day, and I tried almost everything possible : forcing ipv4/ipv6 instead of hostname, referer, user agent, vpn, .com/.fr/, proxies and tor, ... I guess this is an error from Google

Stop abusive bots from crawling?

拜拜、爱过 提交于 2019-12-01 21:49:06
Is this a good idea?? http://browsers.garykeith.com/stream.asp?RobotsTXT What does abusive crawling mean? How is that bad for my site? Not really. Most "bad bots" ignore the robots.txt file anyway. Abuse crawling usually means scraping. These bots are showing up to harvest email addresses or more commonly, content. As to how you can stop them? That's really tricky and often not wise. Anti-crawl techniques have a tendency to be less than perfect and cause problems for regular humans. Sadly, like "shrinkage" in retail, it's a cost of doing business on the web. A user-agent (which includes

Scrapy - Spider crawls duplicate urls

天涯浪子 提交于 2019-12-01 21:45:29
I'm crawling a search results page and scrape title and link information from the same page. As its a Search page, I have the links to the next pages as well, which I have specified in the SgmlLinkExtractor to allow. The description of the problem is, In 1st page, i have found the links of Page2 and Page3 to crawl and it does perfectly. But when it is crawls 2nd page, it again has links to Page1(previous page) and Page3(next page). SO it again crawls Page1 with referrer as Page2 and its going in loop. The scrapy version, I use is 0.17. I have searched through web for answers and tried the

How can I scrape LinkedIn company pages with cURL and PHP? No CSRF token found in headers error

戏子无情 提交于 2019-12-01 21:00:03
I want to scrape some LinkedIn company pages with cURL and PHP. The API of LinkedIn is not build for that, so I have to do this with PHP. If there are any other options, please let me know... Before scraping the company page I have to sign in at LinkedIn with a personal account via cURL, but it doesn't seems to work. I've got a 'No CSRF token found in headers' error. Could someone help me out? Thanks! <?php require_once 'dom/simple_html_dom.php'; $linkedin_login_page = "https://www.linkedin.com/uas/login"; $username = 'linkedin_username'; $password = 'linkedin_password'; $ch = curl_init();

is IFrame crawled by Google?

倾然丶 夕夏残阳落幕 提交于 2019-12-01 19:00:42
I have iframe that it is source is got from servlet response, so does the content of the iframe will be crawled? Google does crawl the framed content now. Just not sure yet how much equity is passed to the links. http://www.seroundtable.com/google-iframe-link-14558.html http://www.rimmkaufman.com/blog/do-search-engines-follow-links-in-iframes/31012012/ What Google surely stil not do is associating the framed content with the parent page. So your pagerank will not be influenced. No I'm pretty sure Google doesn't. The robot could end up in an endless loop! EDIT : I followed the link given in the