screen-scraping | 易学教程

Scraping with Python?

阅读更多关于 Scraping with Python?

问题 I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python? Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'. http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined what are the modules used? Is there any tutorial available? I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming. 回答1: You should use

Scraping with Python?

阅读更多关于 Scraping with Python?

Python HTML scraping

阅读更多关于 Python HTML scraping

问题 It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: <a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e"> I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code? I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this... Huge thanks! 回答1: Regex is usally a bad idea, try using BeautifulSoup Quick example: html = #get html soup

Scrape HTML tables from a given URL into CSV

阅读更多关于 Scrape HTML tables from a given URL into CSV

问题 I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV. Potential additional features: To be really fancy you could parse a table within a table, but for my purposes -- fetching data from

Screen scraping pages that use CSS for layout and formatting…how to scrape the CSS applicable to the html?

阅读更多关于 Screen scraping pages that use CSS for layout and formatting…how to scrape the CSS applicable to the html?

问题 I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it). So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extracting, so I can display on a new page with it's original formatting intact. If you are familiar with firebug, it is able to display which CSS styles are applicable

How can I Programmatically perform a search without using an API?

阅读更多关于 How can I Programmatically perform a search without using an API?

问题 I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters? Thanks 回答1: Theory What

Websites that are particularly challenging to crawl and scrape? [closed]

阅读更多关于 Websites that are particularly challenging to crawl and scrape? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I'm interested in public facing sites (nothing behind a login / authentication) that have things like: High use of internal 301 and 302 redirects Anti-scraping measures (but not banning crawlers via robots.txt) Non-semantic, or invalid mark-up Content loaded via AJAX in the form of onclicks or infinite scrolling

Reading and posting to web pages using C#

阅读更多关于 Reading and posting to web pages using C#

问题 I have a project at work the requires me to be able to enter information into a web page, read the next page I get redirected to and then take further action. A simplified real-world example would be something like going to google.com, entering "Coding tricks" as search criteria, and reading the resulting page. Small coding examples like the ones linked to at http://www.csharp-station.com/HowTo/HttpWebFetch.aspx tell how to read a web page, but not how to interact with it by submitting

Scraping javascript website

阅读更多关于 Scraping javascript website

问题 I'm able to scrape data off of basic html pages, but I'm having trouble scraping off the site below. It looks like the data is presented via javascript, and I'm not sure how to approach that issue. I'd prefer to use R to scrape, if possible, but could also use Python. Any ideas/suggestions? Edit: I need to grab the Year/Manufacturer/Model, the S/N, the Price, the Location, and the short description (starts with "Auction:") for each listing. http://www.machinerytrader.com/list/list.aspx?bcatid

Download all files of a particular type from a website using wget stops in the starting url

阅读更多关于 Download all files of a particular type from a website using wget stops in the starting url

问题 The following did not work. wget -r -A .pdf home_page_url It stop with the following message: .... Removing site.com/index.html.tmp since it should be rejected. FINISHED I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type. Any other way to recursively download all pdf files in an website. ? 回答1: It may be based on a robots.txt. Try adding -e robots=off . Other possible problems are cookie based authentication or agent rejection