web-crawler

selenium fire StaleElementReferenceException

烂漫一生 提交于 2019-12-11 15:29:17
问题 i try to make a web crawler with selenium. My program fire a StaleElementReferenceException. I thought that were because i crawl a page recursive and when a page have no more links the function navigate to next page and not previously to the parent page. Therefore i have introduced a tree data structure to navigate back to the parent when the current url not equal the parent url. But this was not the solution for my problem. Can anybody help me? Code: public class crawler { private static

Scrapy - LinkExtractor in control flow and why it doesn't work

ぐ巨炮叔叔 提交于 2019-12-11 15:07:45
问题 I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop? This is the page I'm crawling. There are 25 listings on each page and their links are parsed in parse_page Then each crawled link are parsed in parse_item This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages. I think

Requiring assistance in debugging a Python web crawler

孤街醉人 提交于 2019-12-11 15:07:29
问题 I can't run a crawler (named searchengine.py ) despite my best effort for the past couple of hours. It seems it could not successfully index the pages as it goes. I will give you the full crawler code. The kind of errors I'm receiving looks like below Indexing http://www.4futureengineers.com/company.html Could not parse page http://www.4futureengineers.com/company.html I am calling searchengine.py by entering the following commands in my Python interactive session (shell). >> import

criticism this python code (crawler with threadpool)

人走茶凉 提交于 2019-12-11 14:45:37
问题 how good this python code ? need criticism) there is a error in this code, some times script do print "ALL WAIT - CAN FINISH!" and freeze (no more actions are happend..) but i can't find reason why this happend? site crawler with threadpool: import sys from urllib import urlopen from BeautifulSoup import BeautifulSoup, SoupStrainer import re from Queue import Queue, Empty from threading import Thread W_WAIT = 1 W_WORK = 0 class Worker(Thread): """Thread executing tasks from a given tasks

Chrome Console logs not printing Violations

懵懂的女人 提交于 2019-12-11 14:19:06
问题 I am using Selenium with Chrome Driver to crawl websites. Need to get everything that gets printed on the chrome console. Ex: I need Warning and Violation from console below chrome_debug.log doesn't contain Violations. I have tried passing arg "--verbose", "--v0", "--v1" to chrome driver, I have also tried setting LoggingPreferences loggingPreferences.enable(LogType.BROWSER, Level.ALL) with no luck. What am I missing here? 来源: https://stackoverflow.com/questions/48579511/chrome-console-logs

Why scrapy Xpath can not find what is found by my browser(s) Xpath?

痴心易碎 提交于 2019-12-11 14:18:40
问题 I want to find something by Xpath in a page (first project by Scrapy), for example the page https://github.com/rg3/youtube-dl/pull/11272. In both my Opera inspect and firefox TryXpath add-on, this Xpath expression has the same result: //div[@class='file js-comment-container js-resolvable-timeline-thread-container has-inline-notes'] and it is like this: BUT in Scrapy 1.6 Xpath, when I want to get its result, it dose not find any thing and just return an empty list def parse(self, response):

Fastest service for crawling web pages or invoking APIs (iTunes in particular)?

让人想犯罪 __ 提交于 2019-12-11 13:53:20
问题 We need to download metadata for all iOS apps on a daily basis. We plan on extracting the information by crawling the iTunes website and by using the iTunes search API. Since there are 700K+ apps, we need an efficient way to do this. One approach is to set up a bunch of scripts on EC2 and run them in parallel. Before we embark down this path, are there services like 80legs that people have used to accomplish a similar task? Essentially, we want something to help us crawl hundreds of thousands

Google crawling, AJAX and HTML5

时光总嘲笑我的痴心妄想 提交于 2019-12-11 13:30:15
问题 HTML5 allows us to update the current URL without refreshing the browser. I've created a small framework on top of HTML5 which allows me to leverage this transparently, so I can do all requests using AJAX while still having bookmarkable URLs without hashtags. So e.g. my navigation looks like this: <ul> <li><a href="/home">Home</a></li> <li><a href="/news">News</a></li> <li>...</li> </ul> When a user clicks on the News link, my framework in fact issues an AJAX GET request (jQuery) for the page

How to access the subclass using jsoup

浪尽此生 提交于 2019-12-11 13:29:45
问题 I want to access this webpage: https://www.google.com/trends/explore#q=ice%20cream and extract the data within in the center line graph. The html file is(Here, I only paste the part that I use.): <div class="center-col"> <div class="comparison-summary-title-line">...</div> ... <div id="reportContent" class="report-content"> <!-- This tag handles the report titles component --> ... <div id="report"> <div id="reportMain"> <div class="timeSection"> <div class = "primaryBand timeBand">...</div> .

How to wait for page load to complete?

醉酒当歌 提交于 2019-12-11 13:02:54
问题 I'm trying to get available boot size (under $('option.addedOption')) from http://www.neimanmarcus.com/Stuart-Weitzman-Reserve-Suede-Over-the-Knee-Boot-Black/prod179890262/p.prod I tried below code, but it always returned before the size is got. # config.url = 'http://www.neimanmarcus.com/Stuart-Weitzman-Reserve-Suede-Over-the-Knee-Boot-Black/prod179890262/p.prod' import urllib2 import requests import config import time from lxml.cssselect import CSSSelector from lxml.html import fromstring