How to extract citations from a text (PHP)?

白昼怎懂夜的黑 提交于 2019-12-12 13:45:18

问题


Hello!

I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.

Example:

“They think it’s ‘game over,’ ” one senior administration official said.

The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.

Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.

Example:

“I think it is serious and it is deteriorating,” Admiral Mullen said Sunday on CNN’s “State of the Union” program.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

How to start?

I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.

<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>

Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:

<?php
function extract_quotations($text) {
   $result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
   if ($result == TRUE) {
      return $found_quotations;
      // check for count of blank spaces
   }
   return array();
}
?>

How could you improve this?

I hope you can help me. Thank you very much in advance!


回答1:


As ceejayoz already pointed out, this won't fit into a single function. What you're describing in your question (detecting grammatical function of a quote-escaped part of a sentence - i.e. “I think it is serious and it is deteriorating,” vs "State of the Union") would be best solved with a library that can break down natural language into tokens. I am not aware of any such library in PHP, but you can have a look at the project size of something you would use in python: http://www.nltk.org/

I think the best you can do is define a set of syntax rules that you verify manually. What about something like this:

abstract class QuotationExtractor {

    protected static $instances;

    public static function getAllPossibleQuotations($string) {
        $possibleQuotations = array();
        foreach (self::$instances as $instance) {
            $possibleQuotations = array_merge(
                $possibleQuotations,
                $instance->extractQuotations($string)
            );
        }
        return $possibleQuotations;
    }

    public function __construct() {
        self::$instances[] = $this;
    }

    public abstract function extractQuotations($string);

}

class RegexExtractor extends QuotationExtractor {

    protected $rules;

    public function extractQuotations($string) {
        $quotes = array();
        foreach ($this->rules as $rule) {
            preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
            foreach ($matches as $match) {
                $quotes[] = array(
                    'quote' => trim($match[$rule[1]]),
                    'cited' => trim($match[$rule[2]])
                );
            }
        }
        return $quotes;
    }

    public function addRule($regex, $quoteIndex, $authorIndex) {
        $this->rules[] = array($regex, $quoteIndex, $authorIndex);
    }

}

$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);

class AnotherExtractor extends Quot...

If you have a structure like the above you can run the same text through any/all of them and list the possible quotations to select the correct ones. I've run the code with this thread as input for testing and the result was:

array(4) {
  [0]=>
  array(2) {
    ["quote"]=>
    string(15) "Not necessarily"
    ["cited"]=>
    string(8) "ceejayoz"
  }
  [1]=>
  array(2) {
    ["quote"]=>
    string(28) "They think it's `game over,'"
    ["cited"]=>
    string(34) "one senior administration official"
  }
  [2]=>
  array(2) {
    ["quote"]=>
    string(46) "I think it is serious and it is deteriorating,"
    ["cited"]=>
    string(14) "Admiral Mullen"
  }
  [3]=>
  array(2) {
    ["quote"]=>
    string(16) "Not necessarily,"
    ["cited"]=>
    string(0) ""
  }
}



回答2:


If there are less than 3 blank spaces it won't be a quotation, right?

"Not necessarily," said ceejayoz.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

b) doesn't even work for this very example - there are 3 blank spaces in "State of the Union".




回答3:


A quotation will always have punctuation--either a comma at the end, to signify that the speaker's name or title is to follow, or the end of the sentence (.!?).



来源:https://stackoverflow.com/questions/1323516/how-to-extract-citations-from-a-text-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!