screen-scraping | 易学教程

Scraping AJAX e-commerce site using python

阅读更多关于 Scraping AJAX e-commerce site using python

问题 I have a problem on scraping an e-commerce site using BeautifulSoup. I did some Googling but I still can't solve the problem. Please refer on the pictures: 1 Chrome F12 : 2 Result : Here is the site that I tried to scrape : "https://shopee.com.my/search?keyword=h370m" Problem: When I tried to open up Inspect Element on Google Chrome (F12), I can see the for the product's name, price, etc. But when I run my python program, I could not get the same code and tag in the python result. After some

Interpreting JavaScript in PHP

阅读更多关于 Interpreting JavaScript in PHP

I'd like to be able to run JavaScript and get the results with PHP and is wondering if there is a library for PHP that allows me to parse it out. My first thought was to use node.js, but since node.js has access to sockets, files and things I think I'd prefer to avoid that. Rationale: I'm doing screen scraping in PHP and have encountered many scenarios where the data is being produced by JavaScript on the frontend, and I would like to avoid writing specialized filtering functions to act on the JavaScript on a per-case basis since that takes a lot of time. The more general case would be to

How to read someone else's forum

阅读更多关于 How to read someone else's forum

My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying. I am using Ruby's net/http library for the first time today, and I have encountered a problem. While my browser has no

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

阅读更多关于 Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html . But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

阅读更多关于 What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions. You can try Mechanize ( http://wwwsearch.sourceforge.net/mechanize/ ) for programmatic web-browsing, and definitely use Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/ ) for the scraping. Most of us use urllib2 to get the page; it can handle various forms of authentication and

Python HTML scraping

阅读更多关于 Python HTML scraping

It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: <a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e"> I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code? I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this... Huge thanks! Regex is usally a bad idea, try using BeautifulSoup Quick example: html = #get html soup = BeautifulSoup(html) links = soup.findAll('a', attrs={'class': 'myclass'}) for link in links: #process link

Scraping with Python?

阅读更多关于 Scraping with Python?

I'd like to grab all the index words and its definitions from here . Is it possible to scrape web content with Python? Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'. http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined what are the modules used? Is there any tutorial available? I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming. You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML. Example - retrieving all

What's the fastest way to scrape a lot of pages in php?

阅读更多关于 What's the fastest way to scrape a lot of pages in php?

I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically). Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process, which works faster, but sooner or later the slower requests start piling up, which ends up crashing

What is the most elegant way to do screen scraping in node.js?

阅读更多关于 What is the most elegant way to do screen scraping in node.js?

问题 I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating: Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish. Redirect following. I want each request to follow through redirects when a

Options for web scraping - C++ version only

阅读更多关于 Options for web scraping - C++ version only

问题 I'm looking for a good C++ library for web scraping. It has to be C/C++ and nothing else so please do not direct me to Options for HTML scraping or other SO questions/answers where C++ is not even mentioned. 回答1: libcurl to download the html file libtidy to convert to valid xml libxml to parse/navigate the xml 回答2: // download winhttpclient.h // -------------------------------- #include <winhttp\WinHttpClient.h> using namespace std; typedef unsigned char byte; #define foreach BOOST_FOREACH