Word Breaks in text extraction , Lxml Xpath

问题

I want to extract words with strikethroughs i.e with the <w:delText> tag. I have used an expression and it extracts it successfully except that some words appear broken . For example the word "They" appears as 'T' and 'hey' . Given below is an xml sample where the problem persists:

<w:delText
    xml:space="preserve">.
    </w:delText></w:r><w:r
    w:rsidR="0020338C"
    w:rsidDel="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:delText>T</w:delText></w:r><w:r
    w:rsidR="00DF6A7D"
    w:rsidDel="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:delText>hey</w:delText></w:r></w:del><w:ins
    w:id="5"
    w:author="Author"
    w:date="2014-08-13T10:08:00Z"><w:r
    w:rsidR="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:t
    xml:space="preserve">
    that
    helps
    them</w:t></w:r></w:ins>

I used the following code :

find =  etree.XPath("//w:p//.//*[local-name() = 'delText']//text()" ,namespaces={'w':"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})
list_of_deleted_words = (find(lxml_tree))

How could i possibly fix this??

Edit:

I realized the problem is only with words that have capital letters in them , words like "She" , "He" also get split.

回答1:

It is the words.." They" should be counted as one word rather than two (that my code is doing currently).

The problem arises because stretches of text are arbitrarily put into several so-called "runs". In OOXML, text is organized in w:p elements (paragraphs) like this (simplified structures):

<w:p>
  <w:r>
    <w:t>Simpli</w:t>
  </w:r>
  <w:r>
    <w:t>fied structures</w:t>
  </w:r>
</w:p>

As you can see, the actual text is inside w:telements that are in turn inside a w:r element, or "run". Unfortunately, this division in separate runs is so haphazard that it can be nothing but arbitrary. To my knowledge, nobody knows how the choice for starting a new run is made.

Now, turning to your question, w:delText is inside runs, too. And there, too, the fragmenation into runs appears to be purely abitrary.

With your current method, there is no way of knowing if the text content of a particular w:delText ever was a whole word or not. For that, you'd have to take into account the whole sequence of runs, both the ones that contain normal text and the ones containing deleted text.

Chances are that this would work, because deleted text is still in a run in the position where it was deleted. Showing OpenXML 2003, slightly different, but it does not matter:

<w:r>
  <w:t>Normal Text before deletion </w:t>
</w:r>
<aml:annotation aml:id="0"
               w:type="Word.Deletion"
               aml:author="Mathias Müller"
               aml:createdate="2014-09-26T22:25:00Z">
  <aml:content>
     <w:r wsp:rsidDel="00F647B7">
        <w:delText>T</w:delText>
     </w:r>
  </aml:content>
</aml:annotation>
<aml:annotation aml:id="1"
               w:type="Word.Deletion"
               aml:author="Mathias Müller"
               aml:createdate="2014-09-26T22:24:00Z">
  <aml:content>
     <w:r wsp:rsidDel="00F647B7">
        <w:delText>hey </w:delText>
     </w:r>
  </aml:content>
</aml:annotation>
<w:r>
  <w:t>Normal Text after deletion </w:t>
</w:r>

Put another way,

if there are two "deleted runs" (or more) in a row, with no whitespace in either of them, then you know that they are the parts of just one word.

As for the word boundaries,

if the deleted run is preceded by a normal run with a whitespace between them (either at the end of the normal run or the beginning of the deleted run) you know that the deleted run started a new word
if the deleted run is preceded by a normal run without any whitespace, then you should conclude that only a part of the word was deleted and that this deleted run is not a whole word
all of the above vice versa for a deleted run that is immediately followed by a normal run, with or without whitespace between them.

We all know, of course, that relying on whitespace to tell words apart is a crude method, but it might be sufficient in this case.

来源：https://stackoverflow.com/questions/26057180/word-breaks-in-text-extraction-lxml-xpath

标签

python

xml

xpath

lxml

openxml