web-crawler

Ajax used for image loading causes 404 errors

妖精的绣舞 提交于 2019-12-25 01:33:52
问题 We have a page with over 1000 image, we show only 10 on each page, we load them with ajax, when people "see the images", also using datatable. Everything works fine, however in Google webmaster tools, I just got thousands of 404 errors, with pages like this: http://example.com/ajax/%5C%22http:%5C/%5C/example.com%5C/image%5C/1937%5C/image-name%5C%22 Of course if I go to this page, I get a 404 error, because no page like this exists, but I don't understand then why Google fetches URLs like this

php spider breaks in middle (Domdocument, xpath, curl) - help needed

我与影子孤独终老i 提交于 2019-12-25 01:24:49
问题 I am a beginner programmer, designing a spider that crawls pages. Logic goes like this: get $url with curl create dom document parsing out href tags using xpath storing href attributes in $totalurls (that aren't already there) updating $url from $totalurls Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on. But if I begin with the page that was 10th in previous example it finds all links with no problem but

R data scraping / crawling with dynamic/multiple URLs

拟墨画扇 提交于 2019-12-24 20:12:20
问题 I try to get all decrees of the Federal Supreme Court of Switzerland available at: https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=&to_date=&x=12&y=12 Unfortunately, no API is provided. The CSS selectors of the data I want to retrieve is .para I am aware of http://relevancy.bger.ch/robots.txt. User-agent: * Disallow: /javascript Disallow: /css Disallow: /hashtables Disallow: /stylesheets

following the information using scrapy in nested div and span tags

有些话、适合烂在心里 提交于 2019-12-24 18:44:52
问题 I am trying to make web crawler, using scrapy from python, that extracts the information that google shows in the right side when you make a search, for example: I want to extract the information in the box in the rigth side The link is: search in google The source code: source code Part of the HTML code is: <div class="g rhsvw kno-kp mnr-c g-blk" lang="es-419" data-hveid="CAoQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQjh8oAHoECAoQAA"> <div class="kp-blk knowledge-panel Wnoohf OJXvsb"

Html Agility Pack Dll [duplicate]

孤街醉人 提交于 2019-12-24 17:15:14
问题 This question already has an answer here : From the Html Agility Pack download, which one of the 9 “HtmlAgilityPack.dll” do I use? (1 answer) Closed 6 years ago . I have downloaded the HTML Agility pack but I don't know which one should I import .There are lots of folders and I don't know which one to import dll . Folders: Net20 Net40 net40-client Net45 sl3-wp sl4 sl4-windowsphone71 sl5 winrt45 I tried importing winrt45 but am getting error when I use doc.DocumentElement.SelectNodes (There is

og:url tag in the header is not the same URL as rel='canonical' link in the html

痴心易碎 提交于 2019-12-24 16:42:49
问题 This is the page: https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fmacondo.tw%2F%23%21%2Fbook%2F50fa1589425b0dc41a000002. I got the warning "og:url tag in the header is not the same URL as rel='canonical' link in the html". However, "See exactly what our scraper sees for your URL" showed that they are the same. 来源: https://stackoverflow.com/questions/15000549/ogurl-tag-in-the-header-is-not-the-same-url-as-rel-canonical-link-in-the-html

Apache Nutch not adding internal links in a web page to fetchlist

对着背影说爱祢 提交于 2019-12-24 15:41:23
问题 I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links. However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore

Crawling a website with PHP, but the website runs JS to generate markup

只谈情不闲聊 提交于 2019-12-24 15:29:49
问题 I have been doing webcrawling for the last couple weeks. Using a PHP library (PHP Simple DOM), im running a php script (using terminal) to fetch some URLs and JSON some data from it. This has been working very nice so far. Recently i wanted to expand the crawling for a specific site and encountered the following problem: Unlike any other site so far, this one only echos a barebones markup server side and instead relies on a single JS script to build up the relevant markup onload. Obviously my

How to download Google search results? [closed]

Deadly 提交于 2019-12-24 13:23:56
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . Apologies if this is too ignorant a question or has been asked before. A cursory look did not find anything matching this exactly. The question is: how can I download all Word documents that Google has indexed? It would be a daunting task indeed to do it by hand... Thanks for all pointers. 回答1: I'm afraid, there

ImportError: No module named html.entities

戏子无情 提交于 2019-12-24 12:46:32
问题 I am new to python. I am using python 2.7.5. I want to write a web crawler. For that I have installed BeautifulSoup 4.3.2. I have installed it using this command(I haven't used pip) python setup.py install I am using eclipse 4.2 with pydev installed. When I try to import this library in my script from bs4 import BeautifulSoup I am getting this error ImportError: No module named html.entities Please explain me what should I do to rectify it. 回答1: Is there any reason why are you not using pip