screen-scraping

How to scrape ID-less website elements with XPath-only regex patterns

空扰寡人 提交于 2019-12-13 03:57:28
问题 There are several similar questions related to the usage of regex in XPath searches -- However, some are not very illuminating to me, whereas others failed for my specific problem. Therefore and for future users that might come across the same, I post the following question: Using one call in Python/Selenium, I want to be able to scrape all elements below at once (for readability without code formatting): /html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**1**]/div/div[2]/div[1] /html

Undefined Offet Error in cURL Code

左心房为你撑大大i 提交于 2019-12-13 03:52:57
问题 I am building a php script to search and scrape google pages that uses curl, receiving the following error. Undefined offset: 1 in /home/content/53/7382753/html/Summer/wootsummer.php on line 25 The offending line is bellow, in the curl settings. curl_setopt($ch, CURLOPT_URL,$urls[$counter]); Any suggestions or comments would be much appreciated, as I am new to curl. For reference, the script wootsummer.php is bellow: <html> <body> <?php error_reporting(E_ALL); set_time_limit (0); $urls

How do I implement a screen scraper in PHP?

烂漫一生 提交于 2019-12-13 03:42:36
问题 I have a user ID and a password to log in to a web site via my program. Once logged in, the URL will change from http://localhost/Test/loginpage.html to http://www.4wtech.com/csp/web/Employee/Login.csp. How can I "screen scrape" the data from the second URL using PHP? 回答1: You would use Curl. Curl can login to the page, then access the new referred page and download the entire page. Check out the php manual for curl as well as this tutorial: How to screen-scrape with PHP and Curl. 回答2: I'm

How do I screen scrape a website and get data within div?

橙三吉。 提交于 2019-12-13 03:24:25
问题 How can I screen scrape a website using cURL and show the data within a specific div? 回答1: Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element. 回答2: After downloading with cURL use XPath to select the div and extract the content. 回答3: A possible alternative. # We will store the web page in a string variable. var string page # Read the page into the string

Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

半腔热情 提交于 2019-12-13 03:23:49
问题 I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save some time. But I keep wondering, not being an authentication guru, why couldn't I just have a sleek login page saying "Enter username and password, and check

Issue Crawling Amazon, Element Cannot Be Scrolled into View

橙三吉。 提交于 2019-12-12 18:23:30
问题 I'm having an issue crawling pages on Amazon. I've tried using: Executing JS Script Action Chains Explicit Waits Nothing seems to work. Everything throws one exception or error or another. Base Script ff = create_webdriver_instance() ff.get('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB

beautifulsoup 4: Segmentation fault (core dumped)

半城伤御伤魂 提交于 2019-12-12 16:29:00
问题 I crawled the following page: http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html But I got Segmentation fault (core dumped) when calling: BeautifulSoup(page_html), where page_html is the content from requests library. Is this a bug for BeautifulSoup? Is there any way to get around with this? Even approach like try...except would help me to get my code running. Thanks in advance. The code is as following: import requests from bs4 import BeautifulSoup toy_url = 'http://www

How to scrape xml file using htmlagilitypack

淺唱寂寞╮ 提交于 2019-12-12 16:14:45
问题 I need to scrape an xml file from http://feeds.feedburner.com/Torrentfreak for its links and description. I used this code : var webGet = new HtmlWeb(); var document = webGet.Load("http://feeds.feedburner.com/TechCrunch"); var TechCrunch = from info in document.DocumentNode.SelectNodes("//channel") from link in info.SelectNodes("//guid[@isPermaLink='false']") from content in info.SelectNodes("//description") select new { LinkURL = info.InnerText, Content = content.InnerText, }; lvLinks

Python Scrapy : allowed_domains adding new domains from database

Deadly 提交于 2019-12-12 12:41:09
问题 I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ". My app gets urls to fetch from a database, so I cant add them manually. I tried to override the spider init like this def __init__(self): super( CrawlSpider, self ).__init__() self.start_urls = [] for destination in Phpbb.objects.filter(disable=False): self.start_urls.append(destination.forum_link) self.allowed_domains.append(destination.link) start_urls was fine, this was my first issue to solve

How to make mechanize not fail with forms on this page?

旧街凉风 提交于 2019-12-12 11:41:07
问题 import mechanize url = 'http://steamcommunity.com' br=mechanize.Browser(factory=mechanize.RobustFactory()) br.open(url) print br.request print br.form for each in br.forms(): print each print The above code results in: Traceback (most recent call last): File "./mech_test.py", line 12, in <module> for each in br.forms(): File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 426, in forms File "build/bdist.linux-i686/egg/mechanize/_html.py", line 559, in forms File "build/bdist.linux