Using Nutch to crawl a specified URL list

后端 未结 2 1748
星月不相逢
星月不相逢 2021-01-16 06:32

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch auto

2条回答
  •  南方客
    南方客 (楼主)
    2021-01-16 06:56

    • Delete the crawl and urls directory (if created before)
    • Create and Update the seed file ( where URLs are listed 1URL per row)
    • Restart the crawling process

    Command

    nutch crawl urllist -dir crawl -depth 3 -topN 1000000
    
    • urllist - Directory where seed file (url list) is present
    • crawl - Directory name

    Even if the problem persists, try to delete your nutch folder and restart the whole process.

提交回复
热议问题