crawling website that needs authentication

时光怂恿深爱的人放手 提交于 2019-12-07 11:23:53

问题


How would I write a simple script (in cURL/python/ruby/bash/perl/java) that logs in to okcupid and tallies how many messages I've received each day?

The output will be something like:

1/21/2011    1 messages
1/22/2011    0 messages
1/23/2011    2 messages
1/24/2011    1 messages

The main issue is that I have never written a web crawler before. I have no idea how to programmatically log in to a site like okcupid. How do you make the authentication persist while loading different pages? etc..

Once I get access to the raw HTML, I'll be okay via regex and maps etc.


回答1:


Here's a solution using cURL that downloads the first page of the inbox. A proper solution will iterate the last step for each page of messages. $USERNAME and $PASSWORD need to be filled in with your info.

#!/bin/sh

## Initialize the cookie-jar
curl --cookie-jar cjar --output /dev/null https://www.okcupid.com/login

## Login and save the resulting HTML file as loginResult.html (for debugging purposes)
curl --cookie cjar --cookie-jar cjar \
  --data 'dest=/?' \
  --data 'username=$USERNAME' \
  --data 'password=$PASSWORD' \
  --location \
  --output loginResult.html \
    https://www.okcupid.com/login

## Download the inbox and save it as inbox.html
curl --cookie cjar \
  --output inbox.html \
  http://www.okcupid.com/messages

This technique explained in a video tutorial about cURL.



来源:https://stackoverflow.com/questions/4787196/crawling-website-that-needs-authentication

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!