Counting words on a html web page using php

前端 未结 5 789
难免孤独
难免孤独 2020-12-30 10:40

I need a PHP script which takes a URL of a web page and then echoes how many times a word is mentioned.

Example

This is a generic HTML page:

5条回答
  •  春和景丽
    2020-12-30 11:08

    This is a complex job that you should not attempt on your own.

    You have to extract text that is not part of tags/comments and is not a child for elements such as script and style. For this, you'll also need a lax HTML parser (like the one implemented in libxml2 and used in DOMDocument.

    Then you have to tokenize the text, which presents its own challenges. Finally, you'd interested in some form of stemming before proceeding to counting the terms.

    I recommend you use specialized tools for this. I haven't used any of these, but you can try HTMLParser for parsing and Lucene for tokenization/stemming (the purpose of Lucene is Text Retrieval, but those operations are necessary for building the index).

提交回复
热议问题