Flesch-Kincaid Readability: Improve PHP function

问题

I wrote this PHP code to implement the Flesch-Kincaid Readability Score as a function:

function readability($text) {
    $total_sentences = 1; // one full stop = two sentences => start with 1
    $punctuation_marks = array('.', '?', '!', ':');
    foreach ($punctuation_marks as $punctuation_mark) {
        $total_sentences += substr_count($text, $punctuation_mark);
    }
    $total_words = str_word_count($text);
    $total_syllable = 3; // assuming this value since I don't know how to count them
    $score = 206.835-(1.015*$total_words/$total_sentences)-(84.6*$total_syllables/$total_words);
    return $score;
}

Do you have suggestions how to improve the code? Is it correct? Will it work?

I hope you can help me. Thanks in advance!

回答1:

The code looks fine as far as a heuristic goes. Here are some points to consider that make the items you need to calculate considerably difficult for a machine:

What is a sentence?

Seriously, what is a sentence? We have periods, but they can also be used for Ph.D., e.g., i.e., Y.M.C.A., and other non-sentence-final purposes. When you consider exclamation points, question marks, and ellipses, you're really doing yourself a disservice by assuming a period will do the trick. I've looked at this problem before, and if you really want a more reliable count of sentences in real text, you'll need to parse the text. This can be computationally intensive, time-consuming, and hard to find free resources for. In the end, you still have to worry about the error rate of the particular parser implementation. However, only full parsing will tell you what's a sentence and what's just a period's other many uses. Furthermore, if you're using text 'in the wild' -- such as, say, HTML -- you're going to also have to worry about sentences ending not with punctuation but with tag endings. For instance, many sites don't add punctuation to h1 and h2 tags, but they're clearly different sentences or phrases.
Syllables aren't something we should be approximating

This is a major hallmark of this readability heuristic, and it's one that makes it the most difficult to implement. Computational analysis of syllable count in a work requires the assumption that the assumed reader speaks in the same dialect as whatever your syllable count generator is being trained on. How sounds fall around a syllable is actual a major part of what makes accents accents. If you don't believe me, try visiting Jamaica sometime. What this means it that even if a human were to do the calculations for this by hand, it would still be a dialect-specific score.
What is a word?

Not to wax psycholingusitic in the slightest, but you will find that space-separated words and what are conceptualized as words to a speaker are quite different. This will make the concept of a computable readability score somewhat questionable.

So in the end, I can answer your question of 'will it work'. If you're looking to take a piece of text and display this readability score among other metrics to offer some kind of conceivable added value, the discerning user will not bring up all of these questions. If you are trying to do something scientific, or even something pedagogical (as this score and those like it were ultimately intended), I wouldn't really bother. In fact, if you're going to use this to make any kind of suggestions to a user about content that they have generated, I would be extremely hesitant.

A better way to measure reading difficulty of a text would more likely be something having to do with the ratio of low-frequency words to high-frequency words along with the number of hapax legomena in the text. But I wouldn't pursue actually coming up with a heuristic like this, because it would be very difficult to empirically test anything like it.

回答2:

Take a look at the PHP Text Statistics class on GitHub.

回答3:

Please have a look at following two classes and its usage information. It will surely help you.

Readability Syllable Count Pattern Library Class:

<?php class ReadabilitySyllableCheckPattern {

public $probWords = [
    'abalone' => 4,
    'abare' => 3,
    'abed' => 2,
    'abruzzese' => 4,
    'abbruzzese' => 4,
    'aborigine' => 5,
    'acreage' => 3,
    'adame' => 3,
    'adieu' => 2,
    'adobe' => 3,
    'anemone' => 4,
    'apache' => 3,
    'aphrodite' => 4,
    'apostrophe' => 4,
    'ariadne' => 4,
    'cafe' => 2,
    'calliope' => 4,
    'catastrophe' => 4,
    'chile' => 2,
    'chloe' => 2,
    'circe' => 2,
    'coyote' => 3,
    'epitome' => 4,
    'forever' => 3,
    'gethsemane' => 4,
    'guacamole' => 4,
    'hyperbole' => 4,
    'jesse' => 2,
    'jukebox' => 2,
    'karate' => 3,
    'machete' => 3,
    'maybe' => 2,
    'people' => 2,
    'recipe' => 3,
    'sesame' => 3,
    'shoreline' => 2,
    'simile' => 3,
    'syncope' => 3,
    'tamale' => 3,
    'yosemite' => 4,
    'daphne' => 2,
    'eurydice' => 4,
    'euterpe' => 3,
    'hermione' => 4,
    'penelope' => 4,
    'persephone' => 4,
    'phoebe' => 2,
    'zoe' => 2
];

public $addSyllablePatterns = [
    "([^s]|^)ia",
    "iu",
    "io",
    "eo($|[b-df-hj-np-tv-z])",
    "ii",
    "[ou]a$",
    "[aeiouym]bl$",
    "[aeiou]{3}",
    "[aeiou]y[aeiou]",
    "^mc",
    "ism$",
    "asm$",
    "thm$",
    "([^aeiouy])\1l$",
    "[^l]lien",
    "^coa[dglx].",
    "[^gq]ua[^auieo]",
    "dnt$",
    "uity$",
    "[^aeiouy]ie(r|st|t)$",
    "eings?$",
    "[aeiouy]sh?e[rsd]$",
    "iell",
    "dea$",
    "real",
    "[^aeiou]y[ae]",
    "gean$",
    "riet",
    "dien",
    "uen"
];

public $prefixSuffixPatterns = [
    "^un",
    "^fore",
    "^ware",
    "^none?",
    "^out",
    "^post",
    "^sub",
    "^pre",
    "^pro",
    "^dis",
    "^side",
    "ly$",
    "less$",
    "some$",
    "ful$",
    "ers?$",
    "ness$",
    "cians?$",
    "ments?$",
    "ettes?$",
    "villes?$",
    "ships?$",
    "sides?$",
    "ports?$",
    "shires?$",
    "tion(ed)?$"
];

public $subSyllablePatterns = [
    "cia(l|$)",
    "tia",
    "cius",
    "cious",
    "[^aeiou]giu",
    "[aeiouy][^aeiouy]ion",
    "iou",
    "sia$",
    "eous$",
    "[oa]gue$",
    ".[^aeiuoycgltdb]{2,}ed$",
    ".ely$",
    "^jua",
    "uai",
    "eau",
    "[aeiouy](b|c|ch|d|dg|f|g|gh|gn|k|l|ll|lv|m|mm|n|nc|ng|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y|z)e$",
    "[aeiouy](b|c|ch|dg|f|g|gh|gn|k|l|lch|ll|lv|m|mm|n|nc|ng|nch|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|th|v|y|z)ed$",
    "[aeiouy](b|ch|d|f|gh|gn|k|l|lch|ll|lv|m|mm|n|nch|nn|p|r|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y)es$",
    "^busi$"
]; } ?>

Another class which is readability algorithm class having two methods to calculate score:

<?php class ReadabilityAlgorithm {
function countSyllable($strWord) {
    $pattern = new ReadabilitySyllableCheckPattern();
    $strWord = trim($strWord);

    // Check for problem words
    if (isset($pattern->{'probWords'}[$strWord])) {
        return $pattern->{'probWords'}[$strWord];
    }

    // Check prefix, suffix
    $strWord = str_replace($pattern->{'prefixSuffixPatterns'}, '', $strWord, $tmpPrefixSuffixCount);

    // Removed non word characters from word
    $arrWordParts = preg_split('`[^aeiouy]+`', $strWord);
    $wordPartCount = 0;
    foreach ($arrWordParts as $strWordPart) {
        if ($strWordPart <> '') {
            $wordPartCount++;
        }
    }
    $intSyllableCount = $wordPartCount + $tmpPrefixSuffixCount;

    // Check syllable patterns 
    foreach ($pattern->{'subSyllablePatterns'} as $strSyllable) {
        $intSyllableCount -= preg_match('`' . $strSyllable . '`', $strWord);
    }

    foreach ($pattern->{'addSyllablePatterns'} as $strSyllable) {
        $intSyllableCount += preg_match('`' . $strSyllable . '`', $strWord);
    }

    $intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount;
    return $intSyllableCount;
}

function calculateReadabilityScore($stringText) {
    # Calculate score
    $totalSentences = 1;
    $punctuationMarks = array('.', '!', ':', ';');

    foreach ($punctuationMarks as $punctuationMark) {
        $totalSentences += substr_count($stringText, $punctuationMark);
    }

    // get ASL value
    $totalWords = str_word_count($stringText);
    $ASL = $totalWords / $totalSentences;

    // find syllables value
    $syllableCount = 0;
    $arrWords = explode(' ', $stringText);
    $intWordCount = count($arrWords);
    //$intWordCount = $totalWords;

    for ($i = 0; $i < $intWordCount; $i++) {
        $syllableCount += $this->countSyllable($arrWords[$i]);
    }

    // get ASW value
    $ASW = $syllableCount / $totalWords;

    // Count the readability score
    $score = 206.835 - (1.015 * $ASL) - (84.6 * $ASW);
    return $score;
} } ?>

// Example: how to use

<?php // Create object to count readability score
$readObj = new ReadabilityAlgorithm();
echo $readObj->calculateReadabilityScore("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into: electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently; with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum!");
?>

回答4:

I actually don't see any problems with that code. Of course, it could be optimized a bit if you really wanted to by replacing all the different functions with a single counting loop. However, I'd strongly argue that it isn't necessary and even outright wrong. Your current code is very readable and easy to understand, and any optimizations would probably make things worse from that perspective. Use it as it is, and don't try to optimize it unless it actually turns out to be a performance bottleneck.

来源：https://stackoverflow.com/questions/1076802/flesch-kincaid-readability-improve-php-function

标签

php

formula

readability

flesch-kincaid