How can I reduce time for filtering my article dataset?

北城以北 提交于 2019-12-23 03:42:21

问题


I'm trying to filter my dataset which contains nearly 50K articles. From each article I want to filter out stop words and punctuation. But the process is taking long time. I've already filtered the dataset and it took 6 hours. Now I've got another dataset to filter which contains 300K articles.

I'm using python in anaconda environment. PC configuration: 7th Gen. Core i5, 8GB RAM and NVIDIA 940MX GPU. To filter my dataset I've wrote a code which takes each article in dataset, tokenize words and then remove stop words, punctuations and numbers.

def sentence_to_wordlist(sentence, filters="!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n?,।!‍.'0123456789০১২৩৪৫৬৭৮৯‘\u200c–“”…‘"):
    translate_dict = dict((c, ' ') for c in filters)
    translate_map = str.maketrans(translate_dict)
    wordlist = sentence.translate(translate_map).split()
    global c,x;
    return list(filter(lambda x: x not in stops, wordlist))

Now I want to reduce the time for this process. Is there any way to optimize this?


回答1:


I've been trying to optimize your process:

from nltk.corpus import stopwords

cachedStopWords = set(stopwords.words("english"))

filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n?,।!‍.'0123456789০১২৩৪৫৬৭৮৯‘\u200c–“”…‘"
trnaslate_table = str.maketrans('', '', filters)
def sentence_to_wordlist(sentence, filters=filters):
    wordlist = sentence.translate(trnaslate_table).split()
    return [w for w in wordlist if w not in cachedStopWords] 

from multiprocessing.pool import Pool

p = Pool(10)
results  = p.map(sentence_to_wordlist, data)
  • data is a list with your articles

  • I've been using the stop words from nltk but you can use your own stopwords, please make sure your stopwords is a set not a list (because to find if a element is in a set is O(1) time complexity and in a list is O(n))

I've been testing with a list of 100k articles, each article having around 2k characters, took me less than 9 seconds.




回答2:


I am not sure if you can really speed up your code significantly. str.translate() is already pretty fast! It might not change much, but maybe you can start by moving the first 2 lines in the body of your function above, so that you don't create translate_map each time you call the function.

You might also consider using the multiprocessing python package to run your script on multiple cores.



来源:https://stackoverflow.com/questions/56713358/how-can-i-reduce-time-for-filtering-my-article-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!