Counting words on a html web page using php

前端 未结 5 787
难免孤独
难免孤独 2020-12-30 10:40

I need a PHP script which takes a URL of a web page and then echoes how many times a word is mentioned.

Example

This is a generic HTML page:

5条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-30 11:29

    The below script will read the contents of the remote url, remove the html tags, and count the occurrences of each unique word therein.

    Caveat: In your expected output, "This" has a value of 2, but the below is case-sensitive, so both "this" and "This" are recorded as separate words. You coudl convert the whole input string to lower case before processing if the original case is not significant for your purposes.

    Additionally, as only a basic strip_tags is run on the input, mal-formed tags will not be removed, so the assumption is that your source html is valid.

    Edit: Charlie points out in the comments that things like the head section will still be counted. With the help of a function defined in the user notes of the strip_tags function, these are also now taken care of.

    generichtml.com

    
    
    

    This is the title

    some description text here, this is a word.

    parser.php

    // Fetch remote html
    $contents = file_get_contents($htmlurl);
    
    // Get rid of style, script etc
    $search = array('@]*?>.*?@si',  // Strip out javascript
               '@.*?@siU',            // Lose the head section
               '@]*?>.*?@siU',    // Strip style tags properly
               '@@'         // Strip multi-line comments including CDATA
    );
    
    $contents = preg_replace($search, '', $contents); 
    
    $result = array_count_values(
                  str_word_count(
                      strip_tags($contents), 1
                      )
                  );
    
    print_r($result);
    

    ?>

    Output:

    Array
    (
        [This] => 1
        [is] => 2
        [the] => 1
        [title] => 1
        [some] => 1
        [description] => 1
        [text] => 1
        [here] => 1
        [this] => 1
        [a] => 1
        [word] => 1
    )
    

提交回复
热议问题