screen-scraping | 易学教程

How to scrape ID-less website elements with XPath-only regex patterns

阅读更多关于 How to scrape ID-less website elements with XPath-only regex patterns

问题 There are several similar questions related to the usage of regex in XPath searches -- However, some are not very illuminating to me, whereas others failed for my specific problem. Therefore and for future users that might come across the same, I post the following question: Using one call in Python/Selenium, I want to be able to scrape all elements below at once (for readability without code formatting): /html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**1**]/div/div[2]/div[1] /html

Undefined Offet Error in cURL Code

阅读更多关于 Undefined Offet Error in cURL Code

问题 I am building a php script to search and scrape google pages that uses curl, receiving the following error. Undefined offset: 1 in /home/content/53/7382753/html/Summer/wootsummer.php on line 25 The offending line is bellow, in the curl settings. curl_setopt($ch, CURLOPT_URL,$urls[$counter]); Any suggestions or comments would be much appreciated, as I am new to curl. For reference, the script wootsummer.php is bellow: <html> <body> <?php error_reporting(E_ALL); set_time_limit (0); $urls

How do I implement a screen scraper in PHP?

阅读更多关于 How do I implement a screen scraper in PHP?

问题 I have a user ID and a password to log in to a web site via my program. Once logged in, the URL will change from http://localhost/Test/loginpage.html to http://www.4wtech.com/csp/web/Employee/Login.csp. How can I "screen scrape" the data from the second URL using PHP? 回答1: You would use Curl. Curl can login to the page, then access the new referred page and download the entire page. Check out the php manual for curl as well as this tutorial: How to screen-scrape with PHP and Curl. 回答2: I'm

How do I screen scrape a website and get data within div?

阅读更多关于 How do I screen scrape a website and get data within div?

问题 How can I screen scrape a website using cURL and show the data within a specific div? 回答1: Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element. 回答2: After downloading with cURL use XPath to select the div and extract the content. 回答3: A possible alternative. # We will store the web page in a string variable. var string page # Read the page into the string

Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

阅读更多关于 Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

问题 I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save some time. But I keep wondering, not being an authentication guru, why couldn't I just have a sleek login page saying "Enter username and password, and check

Issue Crawling Amazon, Element Cannot Be Scrolled into View

阅读更多关于 Issue Crawling Amazon, Element Cannot Be Scrolled into View

问题 I'm having an issue crawling pages on Amazon. I've tried using: Executing JS Script Action Chains Explicit Waits Nothing seems to work. Everything throws one exception or error or another. Base Script ff = create_webdriver_instance() ff.get('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB

beautifulsoup 4: Segmentation fault (core dumped)

阅读更多关于 beautifulsoup 4: Segmentation fault (core dumped)

问题 I crawled the following page: http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html But I got Segmentation fault (core dumped) when calling: BeautifulSoup(page_html), where page_html is the content from requests library. Is this a bug for BeautifulSoup? Is there any way to get around with this? Even approach like try...except would help me to get my code running. Thanks in advance. The code is as following: import requests from bs4 import BeautifulSoup toy_url = 'http://www

How to scrape xml file using htmlagilitypack

阅读更多关于 How to scrape xml file using htmlagilitypack

问题 I need to scrape an xml file from http://feeds.feedburner.com/Torrentfreak for its links and description. I used this code : var webGet = new HtmlWeb(); var document = webGet.Load("http://feeds.feedburner.com/TechCrunch"); var TechCrunch = from info in document.DocumentNode.SelectNodes("//channel") from link in info.SelectNodes("//guid[@isPermaLink='false']") from content in info.SelectNodes("//description") select new { LinkURL = info.InnerText, Content = content.InnerText, }; lvLinks

Python Scrapy : allowed_domains adding new domains from database

阅读更多关于 Python Scrapy : allowed_domains adding new domains from database

问题 I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ". My app gets urls to fetch from a database, so I cant add them manually. I tried to override the spider init like this def __init__(self): super( CrawlSpider, self ).__init__() self.start_urls = [] for destination in Phpbb.objects.filter(disable=False): self.start_urls.append(destination.forum_link) self.allowed_domains.append(destination.link) start_urls was fine, this was my first issue to solve

How to make mechanize not fail with forms on this page?

阅读更多关于 How to make mechanize not fail with forms on this page?

问题 import mechanize url = 'http://steamcommunity.com' br=mechanize.Browser(factory=mechanize.RobustFactory()) br.open(url) print br.request print br.form for each in br.forms(): print each print The above code results in: Traceback (most recent call last): File "./mech_test.py", line 12, in <module> for each in br.forms(): File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 426, in forms File "build/bdist.linux-i686/egg/mechanize/_html.py", line 559, in forms File "build/bdist.linux