I need a PHP script which takes a URL of a web page and then echoes how many times a word is mentioned.
This is a generic HTML page:
This is a complex job that you should not attempt on your own.
You have to extract text that is not part of tags/comments and is not a child for elements such as script
and style
. For this, you'll also need a lax HTML parser (like the one implemented in libxml2 and used in DOMDocument
.
Then you have to tokenize the text, which presents its own challenges. Finally, you'd interested in some form of stemming before proceeding to counting the terms.
I recommend you use specialized tools for this. I haven't used any of these, but you can try HTMLParser for parsing and Lucene for tokenization/stemming (the purpose of Lucene is Text Retrieval, but those operations are necessary for building the index).