I have written a webcrawler that crawls a website with keyward but i want to login to my specified website and filter information by keyword.How to achive that. i posting my code so far i have done .
public class DB {
public Connection conn = null;
public DB() {
try {
Class.forName("com.mysql.jdbc.Driver");
String url = "jdbc:mysql://localhost:3306/test";
conn = DriverManager.getConnection(url, "root","root");
System.out.println("conn built");
} catch (SQLException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
public ResultSet runSql(String sql) throws SQLException {
Statement sta = conn.createStatement();
return sta.executeQuery(sql);
}
public boolean runSql2(String sql) throws SQLException {
Statement sta = conn.createStatement();
return sta.execute(sql);
}
@Override
protected void finalize() throws Throwable {
if (conn != null || !conn.isClosed()) {
conn.close();
}
}
}
public class Main {
public static DB db = new DB();
public static void main(String[] args) throws SQLException, IOException {
db.runSql2("TRUNCATE Record;");
processPage("http://m.naukri.com/login");
}
public static void processPage(String URL) throws SQLException, IOException{
//check if the given URL is already in database;
String sql = "select * from Record where URL = '"+URL+"'";
ResultSet rs = db.runSql(sql);
if(rs.next()){
}else{
//store the URL to database to avoid parsing again
sql = "INSERT INTO `test`.`Record` " + "(`URL`) VALUES " + "(?);";
PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
stmt.setString(1, URL);
stmt.execute();
//get useful information
Connection.Response res = Jsoup.connect("http://www.naukri.com/").data("username","jeet.chatterjee.88@gmail.com","password","Letmein321")
.method(Method.POST)
.execute();
//http://m.naukri.com/login
Map<String, String> loginCookies = res.cookies();
Document doc = Jsoup.connect("http://m.naukri.com/login")
.cookies(loginCookies)
.get();
if(doc.text().contains("")){
System.out.println(URL);
}
//get all links and recursively call the processPage method
Elements questions = doc.select("a[href]");
for(Element link: questions){
if(link.attr("abs:href").contains("naukri.com"))
processPage(link.attr("abs:href"));
}
}
}
}
And the table structure also
CREATE TABLE IF NOT EXISTS `Record` (
`RecordID` INT(11) NOT NULL AUTO_INCREMENT,
`URL` text NOT NULL,
PRIMARY KEY (`RecordID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
Now i want to use my username and password for that crawling so that crawler can log in to the site dynamically and crawl infomation on the basis of keyword.. Lets say my username is lucifer & password is lucifer123
your approach is for stateless web access. usually works for web services, while sites all stateful. u authenticate once and after that, they use the session key stored in your cookie to authenticate you. so it is required. u must send parameters that your browser is sending. try monitoring what your browser send to site with firebug, and reproduce that in your code
--update--
Jsoup.connect("url")
.cookie("cookie-name", "cookie-value")
.header("header-name", "header-value")
.data("data-name","data-value");
u can add multi cookie | header | data. and there is function for adding values from Map.
to find out what must be set, add fire bug to your browser, they all have their default developer console which can be started with F12. go to the url u want to get data and just add all thing in there to your jsoup request.
i added some images from your site result
i marked important part in red.
u can get required cookies in your code with sending these info to site and get cookie from that and after getting response.cookies you attach these cookies to every request u make ;)
p.s: change your password A.S.A.P
来源:https://stackoverflow.com/questions/28110219/how-to-crawl-a-website-after-login-in-it-with-username-and-password