问题:

here is sample of the text file I am working with:

<Opera>  Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:

The capital letters after the forward slashes are weird tags. I want to be able to search the file for something like "NNP,CC,NNP" and have the program return for this segment "Tristan and Isolde", the three words in a row that match those three tags in a row.

The problem I am having is I want the search string to be user inputed so it will always be different.
I can read the file and find one match but I do not know how to count backwards from that point to print the first word or how to find whether the next tag matches.

回答1:

It appears your source text was possibly produced by Natural Language Toolkit (nltk).

Using nltk, you could tokenize the text, split the token into (word, part_of_speech) tuples, and iterate through ngrams to find those that match the pattern:

import nltk pattern = 'NNP,CC,NNP' pattern = [pat.strip() for pat in pattern.split(',')] text = '''Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ           The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:''' tagged_token = [nltk.tag.str2tuple(word) for word in nltk.word_tokenize(text)] for ngram in nltk.ingrams(tagged_token,len(pattern)):     if all(gram[1] == pat for gram,pat in zip(ngram,pattern)):         print(' '.join(word for word, pos in ngram))

yields

Tristan and Isolde

回答2:

Build a regular expression dynamically from a list of tags you want to search:

text = ("Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ "     "The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN")  tags = ["NNP", "CC", "NNP"] tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b" # gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"  from re import findall print(findall(tags_pattern, text))

回答3:

>>> import re  >>> s = "Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:" >>> re.findall("(\w+)/NNP (\w+)/CC (\w+)/NNP", s) [('Tristan', 'and', 'Isolde')]

Similarly, you can do what you need.

EDIT: More generalized.

>>> import re >>> pattern = 'NNP,CC,NNP' >>> pattern = pattern.split(",") >>> p = "" >>> for i in pattern: ...     p = p + r"(\w+)/"+i+ r"\n" >>> f = open("yourfile", "r") >>> s = f.read() >>> f.close() >>> found = re.findall(p, s, re.MULTILINE) >>> found #Saved in found [('Tristan', 'and', 'Isolde')] >>> found_str = " ".join(found[0]) #Converted to string >>> f = open("written.txt", "w") >>> f.write(found_str) >>> f.close()

转载请标明出处:navigating text file searches in python

文章来源: navigating text file searches in python

标签

python

isolde