R: extracting “clean” UTF-8 text from a web page scraped with RCurl

后端 未结 2 1656
旧巷少年郎
旧巷少年郎 2020-12-01 14:50

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I

2条回答
  •  暖寄归人
    2020-12-01 15:13

    Hi I have wrote a scraping engine that allows for the scraping of data on webpages that are deeply embedded within the main listing page. I wonder if it might be helpful to use it as an aggregator for your web data prior to importing in R?

    The location to the engine is here http://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm

    The sample parameter I created to scrape the page you had in mind is as below.

    {
      origin_url: 'http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203',
      columns: [
        {
          col_name: 'links_name',
          dom_query: 'a'   
        }, {
          col_name: 'links',
          dom_query: 'a' ,
          required_attribute: 'href'
        }]
    };
    

提交回复
热议问题