How to write a crawler?

后端 未结 10 1831
感情败类
感情败类 2020-12-02 03:47

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

10条回答
  •  無奈伤痛
    2020-12-02 04:42

    You'll be reinventing the wheel, to be sure. But here's the basics:

    • A list of unvisited URLs - seed this with one or more starting pages
    • A list of visited URLs - so you don't go around in circles
    • A set of rules for URLs you're not interested in - so you don't index the whole Internet

    Put these in persistent storage, so you can stop and start the crawler without losing state.

    Algorithm is:

    while(list of unvisited URLs is not empty) {
        take URL from list
        remove it from the unvisited list and add it to the visited list
        fetch content
        record whatever it is you want to about the content
        if content is HTML {
            parse out URLs from links
            foreach URL {
               if it matches your rules
                  and it's not already in either the visited or unvisited list
                  add it to the unvisited list
            }
        }
    }
    

提交回复
热议问题