n-gram

N-grams: Explanation + 2 applications

﹥>﹥吖頭↗ 提交于 2019-12-03 02:48:29
问题 I want to implement some applications with n-grams (preferably in PHP). Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP? First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams: Sentence: "I live in NY." word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #' character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve",

词向量

两盒软妹~` 提交于 2019-12-02 23:30:02
来源:https://www.numpy.org.cn/deep/basics/word2vec.html 词向量 本教程源代码目录在 book/word2vec ,初次使用请您参考 Book文档使用说明 。 # 说明 本教程可支持在 CPU/GPU 环境下运行 Docker镜像支持的CUDA/cuDNN版本 如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败; 文档和脚本中代码的一致性问题 请注意:为使本文更加易读易用,我们拆分、调整了 train.py 的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行train.py进行验证。 # 背景介绍 本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。 在这些互联网服务里,我们经常要比较两个词或者两段文本之间的相关性。为了做这样的比较,我们往往先要把词表示成计算机适合处理的方式。最自然的方式恐怕莫过于向量空间模型(vector space model)。 在这种方式里,每个词被表示成一个实数向量(one-hot vector),其长度为字典大小

Ngram model and perplexity in NLTK

送分小仙女□ 提交于 2019-12-02 20:54:41
To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against). So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted it here: import nltk print "... build" brown = nltk.corpus.brown corpus = [word.lower() for word in brown.words()] # Train on 95% f the corpus and test on

Extract keyphrases from text (1-4 word ngrams)

风格不统一 提交于 2019-12-02 18:53:25
What's the best way to extract keyphrases from a block of text? I'm writing a tool to do keyword extraction: something like this . I've found a few libraries for Python and Perl to extract n-grams, but I'm writing this in Node so I need a JavaScript solution. If there aren't any existing JavaScript libraries, could someone explain how to do this so I can just write it myself? I like the idea, so I've implemented it: See below (descriptive comments are included). Preview at: http://fiddle.jshell.net/WsKMx/ /*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (http:/

Creating ARPA language model file with 50,000 words

混江龙づ霸主 提交于 2019-12-02 17:18:44
I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words? I thought I'd answer this one since it has a few votes, although based on Christina's other questions I don't think this will be a usable answer for her since a 50,000-word language model almost certainly won't have an acceptable word error rate or recognition speed (or most likely even function for long) with in-app recognition systems for iOS that use this

Remove uni-grams from a list of bi-grams

こ雲淡風輕ζ 提交于 2019-12-02 05:06:11
问题 I have managed to create 2 lists from text documents. The first is my bi-gram list: keywords = ['nike shoes','nike clothing', 'nike black', 'nike white'] and a list of stop words: stops = ['clothing','black','white'] I want to remove the Stops from my Keywords list. Using the above example, the output I am after should look like this: new_keywords = ['nike shoes','nike', 'nike', 'nike'] --> eventually I'd like to remove those dupes. This is what I've done so far: keywords = open("keywords.txt

do searching in a very big ARPA file in a very short time in java

故事扮演 提交于 2019-12-02 02:37:56
I have an ARPA file which is almost 1 GB. I have to do searching in it in less than 1 minute. I have searched a lot, but I have not found the suitable answer yet. I think I do not have to read the whole file. I just have to jump to a specific line in the file and read the whole line. The lines of the ARPA file do not have the same length. I have to mention that ARPA files have a specific format. File Format \data\ ngram 1=19 ngram 2=234 ngram 3=1013 \1-grams: -1.7132 puluh -3.8008 -1.9782 satu -3.8368 \2-grams: -1.5403 dalam dua -1.0560 -3.1626 dalam ini 0.0000 \3-grams: -1.8726 itu dan tiga

do searching in a very big ARPA file in a very short time in java

情到浓时终转凉″ 提交于 2019-12-02 01:55:30
问题 I have an ARPA file which is almost 1 GB. I have to do searching in it in less than 1 minute. I have searched a lot, but I have not found the suitable answer yet. I think I do not have to read the whole file. I just have to jump to a specific line in the file and read the whole line. The lines of the ARPA file do not have the same length. I have to mention that ARPA files have a specific format. File Format \data\ ngram 1=19 ngram 2=234 ngram 3=1013 \1-grams: -1.7132 puluh -3.8008 -1.9782

Remove uni-grams from a list of bi-grams

别来无恙 提交于 2019-12-02 01:51:48
I have managed to create 2 lists from text documents. The first is my bi-gram list: keywords = ['nike shoes','nike clothing', 'nike black', 'nike white'] and a list of stop words: stops = ['clothing','black','white'] I want to remove the Stops from my Keywords list. Using the above example, the output I am after should look like this: new_keywords = ['nike shoes','nike', 'nike', 'nike'] --> eventually I'd like to remove those dupes. This is what I've done so far: keywords = open("keywords.txt", "r") new_keywords = keywords.read().split(",") stops = open("stops.txt","r") new_stops = stops.read(

CPU-and-memory efficient NGram extraction with R

雨燕双飞 提交于 2019-12-01 11:20:26
I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for example : aa,ab,ac,...,a8,a9,a/,a ,ba,bb,... Then I carry out a loop for each address and extract for