问题
This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):
[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
what is the best way to code the following function?
def GetSentences(word1, word2, position):
return ""
where given two words, word1
, word2
and a position position
, the function should return the list of all sentences satisfying that constraint. For example:
GetSentences("sentence", "another", 3)
should return sentences 1
and 3
as the index of the sentences. My current approach was using a dictionary like this:
Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))
for sentenceIndex, sentence in enumerate(sentences):
words = sentence.split()
for index, word in enumerate(words):
for i, word2 in enumerate(words[index:):
Index[word][word2][i+1].append(sentenceIndex)
But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?
回答1:
Use database for storing values.
- First add all the sentences to one table (they should have IDs). You may call it eg.
sentences
. - Second, create table with words contained within all the sentences (call it eg.
words
, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg.sentences_words
, it should have two columns, preferablyword_id
andsentence_id
). When searching for sentences containing all the mentioned words, your job will be simplified:
You should first find records from
words
table, where words are exactly the ones you search for. The query could look like this:SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
Second, you should find
sentence_id
values from tablesentences
that have requiredword_id
values (corresponding to the words fromwords
table). The initial query could look like this:SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ([here goes list of words' ids]);
which could be simplified to this:
SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ( SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') );
Filter the result within Python to return only
sentence_id
values that have all the requiredword_id
IDs you need.
This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.
EDIT:
- If you will only search for two words, you can do even more (almost everything) on DBMS' side.
- Considering you need also position difference, you should store the position of the word within third column of
sentences_words
table (lets call it justposition
) and when searching for appropriate words, you should calculate difference of this value associated with both words.
回答2:
Here's how I did it in Python. Though assuming this needs to be done more than once, a DBMS is the right tool for the job. However this seems to work pretty well for me with a million rows.
sentences = [
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
sentences = sentences * 200 * 1000
sentencesProcessed = []
def preprocess():
global sentences
global sentencesProcessed
# may want to do a regex split on whitespace
sentencesProcessed = [sentence.split(" ") for sentence in sentences]
# can deallocate sentences now
sentences = None
def GetSentences(word1, word2, position):
results = []
for sentenceIndex, sentence in enumerate(sentencesProcessed):
for wordIndex, word in enumerate(sentence[:-position]):
if word == word1 and sentence[wordIndex + position] == word2:
results.append(sentenceIndex)
return results
def main():
preprocess()
results = GetSentences("sentence", "another", 3)
print "Got", len(results), "results"
if __name__ == "__main__":
main()
来源:https://stackoverflow.com/questions/8017432/most-efficient-way-to-index-words-in-a-document