screen-scraping

Screen scraping actual page not source html with R

和自甴很熟 提交于 2019-12-24 03:33:04
问题 I am trying to screen scrape tennis results data (point by point data, not just final result) from this page using R. http://www.scoreboard.com/au/match/wang-j-karlovic-i-2014/M1mWYtEF/#point-by-point;1 Using the regular R screen scraping functions like readlines(),htmlParseTree() etc I am able to scrape the source html for the page, but that does not contain the results data. Is it possible to scrape all the text from the page, as if I were on the page in my browser and selected all and then

Python Web Scraping; Beautiful Soup

岁酱吖の 提交于 2019-12-24 01:24:07
问题 This was covered in this post: Python web scraping involving HTML tags with attributes But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland? I'm trying to scrape the values of: <td class="price city-2"> NZ$15.62 <span style="white-space:nowrap;">(AU$12.10)</span> </td> <td class="price city-1"> AU$15.82 </td> Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82) Currently have: import urllib2 from

BeautifulSoup find_all() returns no data

倖福魔咒の 提交于 2019-12-24 00:37:53
问题 I am very new to Python. My recent project is scraping data from a betting website. What I want to scrape is the odds information from the webpage. Here is my code from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'http://bet.hkjc.com/default.aspx?url=football/odds/odds_allodds.aspx&lang=CH&tmatchid=120653' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") page_soup.findAll("div",{"class":

How can I extract td from html in bash?

狂风中的少年 提交于 2019-12-23 23:06:15
问题 I am querying London postcode data from geonames: http://www.geonames.org/postalcode-search.html?q=london&country=GB I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash? 回答1: I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited) html='http://www.geonames.org/postalcode-search.html?q=london&country=GB' wget -q "$html" -O - | w3m -dump -T 'text/html'|

How to find element value using Splinter?

时光怂恿深爱的人放手 提交于 2019-12-23 21:46:38
问题 I have following piece of html: <p class="attrs"><span>foo:</span> <strong>foo</strong></p> <p class="attrs"><span>bar:</span> <strong>bar</strong></p> <p class="attrs"><span>foo2:</span> <strong></strong></p> <p class="attrs"><span>description:</span> <strong>description body</strong></p> <p class="attrs"><span>another foo:</span> <strong>foooo</strong></p> I would like to get description body using splinter. I've managed to get a list of p using browser.find_by_css("p.attrs") 回答1: xpath = '

Scraping a table using BeautifulSoup

雨燕双飞 提交于 2019-12-23 21:23:46
问题 I have a question which i suspect is fairly straight forward. I have the following type of page from which I want to collect the information in the last table (if you scroll all the way down it is the one in the box labelled "Procedure"): http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN The html for the table I want to scrape looks like this: <tbody><tr class="doc_title"> <td style="background-image: url("/img/struct/navigation/gradient_blue

Ghost.py not finding PySide?

爷,独闯天下 提交于 2019-12-23 16:07:40
问题 I'm trying to get started with the Ghost.py headless browser on a Mac. I installed Ghost.py and its dependencies using these links/commands: Qt 5.0.1 for Mac, has a GUI installer PySide 1.1.0, which requires Qt Version >= 4.7.4 , has a GUI installer sudo pip install Ghost.py I launched Python, and confirmed that I can import PySide . However, when I do from ghost import Ghost , it fails to find PySide : Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library

How to know if the website being scraped has changed?

帅比萌擦擦* 提交于 2019-12-23 09:01:04
问题 I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead. It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored. 回答1: I think you don't have any clean solutions

Find all tables in html using BeautifulSoup

时光毁灭记忆、已成空白 提交于 2019-12-23 07:46:33
问题 I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables. I have created some code which works and it gives expected output. But, I don't like this solution, because it destroys 'soup' object. Do you know how to do it in more elegant way ? from BeautifulSoup import BeautifulSoup as bs input = '''<html><head><title>title</title></head> <body> <p>paragraph</p> <div><div> <table>table1<table>inner11<table>inner12</table></table></table> <div><table

How to screen scrape from another program

随声附和 提交于 2019-12-23 05:21:05
问题 I need to automatically get data from a software to a file. However, I only get search results for web scraping when I did my research. So, is there a way to get data from a local desktop application that does not have export function? I need some local-desktop-application sort of scraping. For example, since a local desktop application such as Windows media player (random example) does not have an export function to put its music library data to a file, what do you need to create a program