Best way to Fingerprint and Verify html structure

半世苍凉 提交于 2019-12-25 01:53:24

问题


I just want to know what is your opinion about how to fingerprint/verify html/links structure.

The problem I want to solve is: fingerprint for example 10 different sites, html pages. And after some time I want to have possibility to verify them, so is, if site has been changed, links changed, verification fails, othervise verification success. My base Idea is to analyze link structure by splitting it in some way, doing some kind of tree, and from that tree generate some kind of code. But I'm still in brainstorm stage, where I need to discuss this with someone, and know other ideas.

So any ideas, algos, and suggestions would be usefull.


回答1:


Whatever data or structure you intend to hash, summarize and otherwise fingerprint, be sure to account for the various forms of noise on many of the web sites "out-there".

Example of such noise or random content are:

  • Company Stock value ticker
  • Weather condition in wherever city they are
  • several pages have a current (now) date-time somewhere in footers or headers
  • Advertisement content (more and more these are make to look indigenous to the site to defeat Ad blockers on web browsers)



回答2:


You could always hash the raw HTML of the site and compare it. I believe sites can maintain a "last edited" date, but am not sure if this is always updated.

Edit: My mistake, this is simply a way to compare the website to a previous version, but not really fingerprint it in the way you mean.




回答3:


Just throwing this out there:

Why don't you crawl the site, putting all the links into an XML document that would represent the map of the site.

Create an MD5 checksum on that file and store it. Then, any time in the future you could recrawl, recreate the XML, redo the checksum and compare it to your earlier checksum.

If they don't match, the link structure has changed - although you won't necessarily know where.



来源:https://stackoverflow.com/questions/1490686/best-way-to-fingerprint-and-verify-html-structure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!