问题
here is sample of the text file I am working with:
<Opera>
Tristan/NNP
and/CC
Isolde/NNP
and/CC
the/DT
fatalistic/NN
horns/VBZ
The/DT
passionate/JJ
violins/NN
And/CC
ominous/JJ
clarinet/NN
;/:
The capital letters after the forward slashes are weird tags. I want to be able to search the file for something like "NNP,CC,NNP" and have the program return for this segment "Tristan and Isolde", the three words in a row that match those three tags in a row.
The problem I am having is I want the search string to be user inputed so it will always be different.
I can read the file and find one match but I do not know how to count backwards from that point to print the first word or how to find whether the next tag matches.
回答1:
It appears your source text was possibly produced by Natural Language Toolkit (nltk).
Using nltk, you could tokenize the text, split the token into (word, part_of_speech) tuples, and iterate through ngrams to find those that match the pattern:
import nltk
pattern = 'NNP,CC,NNP'
pattern = [pat.strip() for pat in pattern.split(',')]
text = '''Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ
The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:'''
tagged_token = [nltk.tag.str2tuple(word) for word in nltk.word_tokenize(text)]
for ngram in nltk.ingrams(tagged_token,len(pattern)):
if all(gram[1] == pat for gram,pat in zip(ngram,pattern)):
print(' '.join(word for word, pos in ngram))
yields
Tristan and Isolde
Related link:
- Categorizing and Tagging Words (chapter 5 of the NLTK book)
回答2:
Build a regular expression dynamically from a list of tags you want to search:
text = ("Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ "
"The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN")
tags = ["NNP", "CC", "NNP"]
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"
from re import findall
print(findall(tags_pattern, text))
回答3:
>>> import re
>>> s = "Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:"
>>> re.findall("(\w+)/NNP (\w+)/CC (\w+)/NNP", s)
[('Tristan', 'and', 'Isolde')]
Similarly, you can do what you need.
EDIT: More generalized.
>>> import re
>>> pattern = 'NNP,CC,NNP'
>>> pattern = pattern.split(",")
>>> p = ""
>>> for i in pattern:
... p = p + r"(\w+)/"+i+ r"\n"
>>> f = open("yourfile", "r")
>>> s = f.read()
>>> f.close()
>>> found = re.findall(p, s, re.MULTILINE)
>>> found #Saved in found
[('Tristan', 'and', 'Isolde')]
>>> found_str = " ".join(found[0]) #Converted to string
>>> f = open("written.txt", "w")
>>> f.write(found_str)
>>> f.close()
来源:https://stackoverflow.com/questions/9143406/navigating-text-file-searches-in-python