BeautifulSoup -ing a website with login and site search engine

亡梦爱人 提交于 2019-12-13 12:23:13

问题


I'm trying to scrape International Maritime Organization's data (https://gisis.imo.org/Public/PAR/Search.aspx) on shipping vessel attacks between the dates ("is between" in the site's search engine) 2002-01-01, 2005-12-31.

I've used bs4 and requests modules in python previously to scrape financial data from yahoo, and weather data from wunderground, but this site requires a login and password (under the "public" account type). Furthermore, as I said the data requires a search / filter before I can access the html on the page:

Once I click on a row here, it expands to the image below. (Before anyone asks why I don't just download the dataset and pull from there: the DL is for some reason filtered, and not all the columns are given out (for example, the IMO number).

ULTIMATELY THE DATA I AM TRYING TO PULL IS FROM THIS PAGE, and I need (item, css path):

  • position of incident

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(1) > td:nth-child(2) > span
    
  • date

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(6) > td.content > span
    
  • ship name

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(4) > td:nth-child(2) > span
    

Needless to say this seems like a daunting task. Any recommendations?

Here is the OLD code I've been using to scrape the weather data (haven't changed anything yet because I don't know where to start in terms of the login/filter process: http://pythonfiddle.com/get-wx-data


回答1:


requests alone isn't going to be enough. You'll want to look into mechanize: http://wwwsearch.sourceforge.net/mechanize/

The nice thing about mechanize is that it maintains state from page to page, unlike requests. (You probably could do it with just requests, but I'm not quite that clever.) Here's an example of a simple login interaction.

This would be awesome, if the IMO site were that easy. Instead, it's ASP-based, and that means it's relatively irritating to scrape. Some of the details will vary from site to site, so I'll suggest two things in particular: looking at the Network tab of your browser's developer tools and reading this ScraperWiki post on dealing with ASP sites.

Best of luck!



来源:https://stackoverflow.com/questions/26305631/beautifulsoup-ing-a-website-with-login-and-site-search-engine

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!