Scraping dynamic table in R with POST

前端 未结 1 1030
南方客
南方客 2020-12-22 05:30

I\'m trying to scrape this table using R. So far, I\'ve managed to get only 27 lines of it, using the code below. I would like to get all the entries back and, ideally, modi

相关标签:
1条回答
  • 2020-12-22 06:02

    The table in question has a "Export to CSV" link:

    If you click on it, you get the 6.36MB CSV file directly, which is good. I'm assuming that you need/want to do this programmatically, so this worked for me:

    Steps to Programmatically "Click Export-to-CSV"

    1. I'm using Firefox, but Chrome has a similar capability: Inspector. I opened it (Ctrl-Shift-I) and went to the "Network" tab.
    2. Click on the "Export to CSV" button. You should see a new "POST" line in the inspector frame. When it's complete ...
    3. Right-click on the "POST" line and select "Copy POST Data"; this provides:

      __EVENTTARGET
      __EVENTARGUMENT
      __VIEWSTATE=...
      ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton=+
      ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year
      ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber
      ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName
      ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox=20
      ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState
      ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState
      ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState
      __VIEWSTATEGENERATOR=CA0B0334
      

      (I replaced the long base64-string with "...".) The notable line is the fourth, ending in $ExportToCsvButton=+. This is the parameter you need to include in your POST data (param).

    4. Using your code above up through and including defining param, continue with:

      param$`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton` <- "+"
      request <- httr::POST(url, body = param, encode = 'form')
      

    You'll now have:

    request
    # Response [http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/]
    #   Date: 2017-06-01 18:09
    #   Status: 200
    #   Content-Type: text/csv; charset-UTF-8;
    #   Size: 6.36 MB
    # <U+FEFF>"Year","Area Number","Area Name","Carcass Size","Harvest Date","Location"
    # "2000","101","LAKE PIERCE","11 ft. 5 in.","09-22-2000",""
    # "2000","101","LAKE PIERCE","9 ft. 0 in.","10-02-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 10 in.","10-06-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 0 in.","09-25-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 0 in.","10-07-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 0 in.","09-22-2000",""
    # "2000","101","LAKE PIERCE","7 ft. 2 in.","09-21-2000",""
    # "2000","101","LAKE PIERCE","7 ft. 1 in.","09-21-2000",""
    # "2000","101","LAKE PIERCE","6 ft. 11 in.","09-25-2000",""
    # ...
    

    Side note: the website starts the file with <U+FEFF>, a unicode character. This throws off read.csv and gives you a column name of X.U.FEFF.Year, is entirely cosmetic.

    Saving to File

    If you don't care about the suggested filename, you can simply do

    write(as.character(request), file="quux.csv")
    

    If you want to use the filename the website suggests for it, you can find it with:

    httr::headers(request)$`content-disposition`
    # [1] "inline;filename=\"FWCAlligatorHarvestData.csv\""
    

    Parsing that should be straight-forward.

    Immediate Consumption

    If you don't want/need to save to an intermediate file, you can always consume it immediately:

    head(read.csv(textConnection(as.character(request))))
    # Invalid encoding : defaulting to UTF-8.
    #   X.U.FEFF.Year Area.Number   Area.Name Carcass.Size Harvest.Date Location
    # 1          2000         101 LAKE PIERCE 11 ft. 5 in.   09-22-2000         
    # 2          2000         101 LAKE PIERCE  9 ft. 0 in.   10-02-2000         
    # 3          2000         101 LAKE PIERCE 8 ft. 10 in.   10-06-2000         
    # 4          2000         101 LAKE PIERCE  8 ft. 0 in.   09-25-2000         
    # 5          2000         101 LAKE PIERCE  8 ft. 0 in.   10-07-2000         
    # 6          2000         101 LAKE PIERCE  8 ft. 0 in.   09-22-2000         
    
    0 讨论(0)
提交回复
热议问题