R: extracting “clean” UTF-8 text from a web page scraped with RCurl

后端未结

关注

 2  1656

旧巷少年郎 2020-12-01 14:50

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I

2条回答

暖寄归人 (楼主)

2020-12-01 15:13
Hi I have wrote a scraping engine that allows for the scraping of data on webpages that are deeply embedded within the main listing page. I wonder if it might be helpful to use it as an aggregator for your web data prior to importing in R?

The location to the engine is here http://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm

The sample parameter I created to scrape the page you had in mind is as below.
```
{
  origin_url: 'http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203',
  columns: [
    {
      col_name: 'links_name',
      dom_query: 'a'   
    }, {
      col_name: 'links',
      dom_query: 'a' ,
      required_attribute: 'href'
    }]
};
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...