screen-scraping

Scraping AJAX e-commerce site using python

亡梦爱人 提交于 2019-12-04 01:59:51
问题 I have a problem on scraping an e-commerce site using BeautifulSoup. I did some Googling but I still can't solve the problem. Please refer on the pictures: 1 Chrome F12 : 2 Result : Here is the site that I tried to scrape : "https://shopee.com.my/search?keyword=h370m" Problem: When I tried to open up Inspect Element on Google Chrome (F12), I can see the for the product's name, price, etc. But when I run my python program, I could not get the same code and tag in the python result. After some

Interpreting JavaScript in PHP

怎甘沉沦 提交于 2019-12-03 23:48:39
I'd like to be able to run JavaScript and get the results with PHP and is wondering if there is a library for PHP that allows me to parse it out. My first thought was to use node.js, but since node.js has access to sockets, files and things I think I'd prefer to avoid that. Rationale: I'm doing screen scraping in PHP and have encountered many scenarios where the data is being produced by JavaScript on the frontend, and I would like to avoid writing specialized filtering functions to act on the JavaScript on a per-case basis since that takes a lot of time. The more general case would be to

How to read someone else's forum

落花浮王杯 提交于 2019-12-03 22:09:12
My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying. I am using Ruby's net/http library for the first time today, and I have encountered a problem. While my browser has no

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

99封情书 提交于 2019-12-03 21:15:58
I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html . But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

本秂侑毒 提交于 2019-12-03 21:15:45
I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions. You can try Mechanize ( http://wwwsearch.sourceforge.net/mechanize/ ) for programmatic web-browsing, and definitely use Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/ ) for the scraping. Most of us use urllib2 to get the page; it can handle various forms of authentication and

Python HTML scraping

元气小坏坏 提交于 2019-12-03 21:14:07
It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: <a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e"> I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code? I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this... Huge thanks! Regex is usally a bad idea, try using BeautifulSoup Quick example: html = #get html soup = BeautifulSoup(html) links = soup.findAll('a', attrs={'class': 'myclass'}) for link in links: #process link

Scraping with Python?

混江龙づ霸主 提交于 2019-12-03 21:07:36
I'd like to grab all the index words and its definitions from here . Is it possible to scrape web content with Python? Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'. http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined what are the modules used? Is there any tutorial available? I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming. You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML. Example - retrieving all

What's the fastest way to scrape a lot of pages in php?

不问归期 提交于 2019-12-03 20:38:15
I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically). Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process, which works faster, but sooner or later the slower requests start piling up, which ends up crashing

What is the most elegant way to do screen scraping in node.js?

我怕爱的太早我们不能终老 提交于 2019-12-03 18:42:01
问题 I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating: Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish. Redirect following. I want each request to follow through redirects when a

Options for web scraping - C++ version only

馋奶兔 提交于 2019-12-03 18:18:44
问题 I'm looking for a good C++ library for web scraping. It has to be C/C++ and nothing else so please do not direct me to Options for HTML scraping or other SO questions/answers where C++ is not even mentioned. 回答1: libcurl to download the html file libtidy to convert to valid xml libxml to parse/navigate the xml 回答2: // download winhttpclient.h // -------------------------------- #include <winhttp\WinHttpClient.h> using namespace std; typedef unsigned char byte; #define foreach BOOST_FOREACH