Designing a web crawler

前端 未结 10 1235
醉梦人生
醉梦人生 2020-12-04 04:54

I have come across an interview question \"If you were designing a web crawler, how would you avoid getting into infinite loops? \" and I am trying to answer it.

How

10条回答
  •  鱼传尺愫
    2020-12-04 05:19

    While everybody here already suggested how to create your web crawler, here is how how Google ranks pages.

    Google gives each page a rank based on the number of callback links (how many links on other websites point to a specific website/page). This is called relevance score. This is based on the fact that if a page has many other pages link to it, it's probably an important page.

    Each site/page is viewed as a node in a graph. Links to other pages are directed edges. A degree of a vertex is defined as the number of incoming edges. Nodes with a higher number of incoming edges are ranked higher.

    Here's how the PageRank is determined. Suppose that page Pj has Lj links. If one of those links is to page Pi, then Pj will pass on 1/Lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. So if we denote the set of pages linking to Pi by Bi, then we have this formula:

    Importance(Pi)= sum( Importance(Pj)/Lj ) for all links from Pi to Bi
    

    The ranks are placed in a matrix called hyperlink matrix: H[i,j]

    A row in this matrix is either 0, or 1/Lj if there is a link from Pi to Bi. Another property of this matrix is that if we sum all rows in a column we get 1.

    Now we need multiply this matrix by an Eigen vector, named I (with eigen value 1) such that:

    I = H*I
    

    Now we start iterating: IH, IIH, IIIH .... I^k *H until the solution converges. ie we get pretty much the same numbers in the matrix in step k and k+1.

    Now whatever is left in the I vector is the importance of each page.

    For a simple class homework example see http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html

    As for solving the duplicate issue in your interview question, do a checksum on the entire page and use either that or a bash of the checksum as your key in a map to keep track of visited pages.

提交回复
热议问题