问题
I'm running the following code to process a list of documents, basically it's just two for loops.
from nltk.tokenize import TreebankWordTokenizer
from gensim.models import KeyedVectors
from nlpia.loaders import get_data
word_vectors = get_data('w2v', limit=200000)
def tokenize_and_vectorize(dataset):
tokenizer = TreebankWordTokenizer()
vectorized_data = []
expected = []
for sample in dataset:
tokens = tokenizer.tokenize(sample[1])
sample_vecs = []
for token in tokens:
try:
sample_vecs.append(word_vectors[token])
except KeyError:
pass
vectorized_data.append(sample_vecs)
#print(1)
return vectorized_data
then I call the function to process the top 25k elements
vectorized_data=tokenize_and_vectorize(dataset[0:25000])
However, this code seems taking forever running as the * sign never disappear. (Note: I did try running only 50 samples and results came back pretty fast)
In order to see where it got stuck, I naively added print(1)
ahead of return vectorized_data
so for every cycle of loop it returns me a 1. After 1min36sec, I got all results returned.
A side observation of the computer memory usage. In the case without adding print(1), I did observe that the memory usage were high at the beginning and dropped back to normal level after couple mins, not sure if this indicates the process is done though * sign is still showing.
What caused this issue and how do I fix it?
回答1:
I assume your dataset contains strings i.e. lines of text, a book, etc. Hence each of your lines is then broken up into words, which then are turned into word vectors.
It could be that your data takes a long time if your lines are very long or if you are trying to process a lot of lines at once.
Regarding your question what the '*' means (Source: answer by Gopi Kumar)
An asterisk on Jupyter cell means that cell is still waiting to run. Please check the preceding cells to see the one that is currently running. It is possible you may have an error on one of the previous cell. Also if you see a dark circle on the top right of the browser it means a cell is still executing. A clear circle means it is idle.
来源:https://stackoverflow.com/questions/59710100/python-code-non-stop-when-processing-text-documents