Fastest way to build string feature vectors Python

感情迁移 提交于 2019-12-23 01:41:49

问题


I have a list of strings to search along with a list of words and want to produce boolean feature vectors indicating presence of each word within the strings. What is the fastest way to build these feature vectors in Python?

In the example below the first string would output [1, 0, 1, 1] for example. I'm currently using the Aho Corasick algorithm (Python - find occurrences of list of strings within string) to search each string in a list comprehension but think it's possible there's a faster way. The code below uses this method and averages the time over 10 runs.

import time
import numpy as np
import ahocorasick

def check_strings(A, search_list, string_to_search):
    """Use Aho Corasick algorithm to produce boolean list indicating
    prescence of strings within a longer string"""
    index_list = []
    for item in A.iter(string_to_search):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()


word_list = ["foo", "bar", "hello", "world"]
strings_to_check = ["hello world foo", "foo bar", "bar world"]

A = ahocorasick.Automaton()
for idx, s in enumerate(word_list):
    A.add_word(s, (idx, s))
A.make_automaton()

run_times = []
for i in range(10):
    t0 = time.time()
    feature_vectors = [check_strings(A, word_list, s) for s in strings_to_check]
    run_times.append(time.time()-t0)

print(feature_vectors)
print(np.mean(np.array(run_times)))

Output is:

[[1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 1]]
1.65939331055e-05

来源:https://stackoverflow.com/questions/44744895/fastest-way-to-build-string-feature-vectors-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!