how to read the content of a website?

时光毁灭记忆、已成空白 提交于 2019-12-25 04:39:53

问题


i want to read the content of a website and store it in a file by using c# and asp.net. I know we can read it by using httpwebrequest. But is it possible to read the all available links data also?

Ex: suppose i want to read http://www.msn.com i can directly give the url and can read the home page data that is no issue. But here that msn.com page contains so many links in the home page I want to read those pages content also. Is it possible?

Can somebody give me a starup to do this?

Thanks in advance


回答1:


  1. define queue of urls

  2. add main page url to queue

  3. while queue is not empy

3.1 currentUrl = Dequeue()

3.2 read current url

3.3 exctarct all urls from current page using regexp.

3.4 add all urls to the queue

You will have to limit the urls in queue to some sort of depth or to some domain, otherwise you will try to download the entire internet :)



来源:https://stackoverflow.com/questions/1531106/how-to-read-the-content-of-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!