How to crawl a website after login in it with username and password

北慕城南 提交于 2019-12-05 06:01:29

问题


I have written a webcrawler that crawls a website with keyward but i want to login to my specified website and filter information by keyword.How to achive that. i posting my code so far i have done .

public class DB {

public Connection conn = null;

public DB() {
    try {
        Class.forName("com.mysql.jdbc.Driver");
        String url = "jdbc:mysql://localhost:3306/test";
        conn = DriverManager.getConnection(url, "root","root");
        System.out.println("conn built");
    } catch (SQLException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    }
}

public ResultSet runSql(String sql) throws SQLException {
    Statement sta = conn.createStatement();
    return sta.executeQuery(sql);
}

public boolean runSql2(String sql) throws SQLException {
    Statement sta = conn.createStatement();
    return sta.execute(sql);
}

@Override
protected void finalize() throws Throwable {
    if (conn != null || !conn.isClosed()) {
        conn.close();
    }
}
}


public class Main {
public static DB db = new DB();

public static void main(String[] args) throws SQLException, IOException {
    db.runSql2("TRUNCATE Record;");
    processPage("http://m.naukri.com/login");
}

public static void processPage(String URL) throws SQLException, IOException{
    //check if the given URL is already in database;
    String sql = "select * from Record where URL = '"+URL+"'";
    ResultSet rs = db.runSql(sql);
    if(rs.next()){

    }else{
        //store the URL to database to avoid parsing again
        sql = "INSERT INTO  `test`.`Record` " + "(`URL`) VALUES " + "(?);";
        PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
        stmt.setString(1, URL);
        stmt.execute();

        //get useful information
        Connection.Response res = Jsoup.connect("http://www.naukri.com/").data("username","jeet.chatterjee.88@gmail.com","password","Letmein321")
                 .method(Method.POST)
                    .execute();  
        //http://m.naukri.com/login
        Map<String, String> loginCookies = res.cookies();
        Document doc = Jsoup.connect("http://m.naukri.com/login")
                  .cookies(loginCookies)
                  .get();

        if(doc.text().contains("")){
            System.out.println(URL);
        }

        //get all links and recursively call the processPage method
        Elements questions = doc.select("a[href]");
        for(Element link: questions){
            if(link.attr("abs:href").contains("naukri.com"))
                processPage(link.attr("abs:href"));
        }
    }
}
}

And the table structure also

 CREATE TABLE IF NOT EXISTS `Record` (
 `RecordID` INT(11) NOT NULL AUTO_INCREMENT,
 `URL` text NOT NULL,
  PRIMARY KEY (`RecordID`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

Now i want to use my username and password for that crawling so that crawler can log in to the site dynamically and crawl infomation on the basis of keyword.. Lets say my username is lucifer & password is lucifer123


回答1:


your approach is for stateless web access. usually works for web services, while sites all stateful. u authenticate once and after that, they use the session key stored in your cookie to authenticate you. so it is required. u must send parameters that your browser is sending. try monitoring what your browser send to site with firebug, and reproduce that in your code

--update--

Jsoup.connect("url")
  .cookie("cookie-name", "cookie-value")
  .header("header-name", "header-value")
  .data("data-name","data-value");

u can add multi cookie | header | data. and there is function for adding values from Map.

to find out what must be set, add fire bug to your browser, they all have their default developer console which can be started with F12. go to the url u want to get data and just add all thing in there to your jsoup request. i added some images from your site result

i marked important part in red.

u can get required cookies in your code with sending these info to site and get cookie from that and after getting response.cookies you attach these cookies to every request u make ;)

p.s: change your password A.S.A.P



来源:https://stackoverflow.com/questions/28110219/how-to-crawl-a-website-after-login-in-it-with-username-and-password

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!