screen-scraping | 易学教程

Screen scraping actual page not source html with R

阅读更多关于 Screen scraping actual page not source html with R

问题 I am trying to screen scrape tennis results data (point by point data, not just final result) from this page using R. http://www.scoreboard.com/au/match/wang-j-karlovic-i-2014/M1mWYtEF/#point-by-point;1 Using the regular R screen scraping functions like readlines(),htmlParseTree() etc I am able to scrape the source html for the page, but that does not contain the results data. Is it possible to scrape all the text from the page, as if I were on the page in my browser and selected all and then

Python Web Scraping; Beautiful Soup

阅读更多关于 Python Web Scraping; Beautiful Soup

问题 This was covered in this post: Python web scraping involving HTML tags with attributes But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland? I'm trying to scrape the values of: <td class="price city-2"> NZ$15.62 <span style="white-space:nowrap;">(AU$12.10)</span> </td> <td class="price city-1"> AU$15.82 </td> Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82) Currently have: import urllib2 from

BeautifulSoup find_all() returns no data

阅读更多关于 BeautifulSoup find_all() returns no data

问题 I am very new to Python. My recent project is scraping data from a betting website. What I want to scrape is the odds information from the webpage. Here is my code from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'http://bet.hkjc.com/default.aspx?url=football/odds/odds_allodds.aspx&lang=CH&tmatchid=120653' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") page_soup.findAll("div",{"class":

How can I extract td from html in bash?

阅读更多关于 How can I extract td from html in bash?

问题 I am querying London postcode data from geonames: http://www.geonames.org/postalcode-search.html?q=london&country=GB I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash? 回答1: I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited) html='http://www.geonames.org/postalcode-search.html?q=london&country=GB' wget -q "$html" -O - | w3m -dump -T 'text/html'|

How to find element value using Splinter?

阅读更多关于 How to find element value using Splinter?

问题 I have following piece of html: <p class="attrs"><span>foo:</span> <strong>foo</strong></p> <p class="attrs"><span>bar:</span> <strong>bar</strong></p> <p class="attrs"><span>foo2:</span> <strong></strong></p> <p class="attrs"><span>description:</span> <strong>description body</strong></p> <p class="attrs"><span>another foo:</span> <strong>foooo</strong></p> I would like to get description body using splinter. I've managed to get a list of p using browser.find_by_css("p.attrs") 回答1: xpath = '

Scraping a table using BeautifulSoup

阅读更多关于 Scraping a table using BeautifulSoup

问题 I have a question which i suspect is fairly straight forward. I have the following type of page from which I want to collect the information in the last table (if you scroll all the way down it is the one in the box labelled "Procedure"): http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN The html for the table I want to scrape looks like this: <tbody><tr class="doc_title"> <td style="background-image: url("/img/struct/navigation/gradient_blue

Ghost.py not finding PySide?

阅读更多关于 Ghost.py not finding PySide?

问题 I'm trying to get started with the Ghost.py headless browser on a Mac. I installed Ghost.py and its dependencies using these links/commands: Qt 5.0.1 for Mac, has a GUI installer PySide 1.1.0, which requires Qt Version >= 4.7.4 , has a GUI installer sudo pip install Ghost.py I launched Python, and confirmed that I can import PySide . However, when I do from ghost import Ghost , it fails to find PySide : Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library

How to know if the website being scraped has changed?

阅读更多关于 How to know if the website being scraped has changed?

问题 I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead. It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored. 回答1: I think you don't have any clean solutions

Find all tables in html using BeautifulSoup

阅读更多关于 Find all tables in html using BeautifulSoup

问题 I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables. I have created some code which works and it gives expected output. But, I don't like this solution, because it destroys 'soup' object. Do you know how to do it in more elegant way ? from BeautifulSoup import BeautifulSoup as bs input = '''<html><head><title>title</title></head> <body> <p>paragraph</p> <div><div> <table>table1<table>inner11<table>inner12</table></table></table> <div><table

How to screen scrape from another program

阅读更多关于 How to screen scrape from another program

问题 I need to automatically get data from a software to a file. However, I only get search results for web scraping when I did my research. So, is there a way to get data from a local desktop application that does not have export function? I need some local-desktop-application sort of scraping. For example, since a local desktop application such as Windows media player (random example) does not have an export function to put its music library data to a file, what do you need to create a program