linguistics

Best practices for seaching for alternate forms of a word with Lucene

梦想的初衷 提交于 2019-12-09 18:32:00
问题 I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this? Please note: this site only has English content . Some approaches I've thought of: Look up the

words usage database?

我怕爱的太早我们不能终老 提交于 2019-12-09 18:24:04
问题 Is there any free database/place out there with commonality/usage ratios of English words? (British or U.S. English, doesn't matter) I don't care about the exact numbers, only relative to eachother. Something like: the | 0.2 car | 0.08 chroma | 0.005 overspread | 0.0000007 Edit: I have found http://en.wiktionary.org/wiki/Wiktionary%3aFrequency_lists which I can scrape for data. However I would prefer an sql-format which is easier to work with. 回答1: The term you want to google is "word

How do I preserve my float number in ruby

女生的网名这么多〃 提交于 2019-12-08 01:34:39
问题 So I'm trying some code out to convert numbers into strings. However, I noticed that in certain cases it does not preserve the last two decimal places. For instance I type 1.01 and 1.04 for addition and I get back 2.04. If I type just 1.05 it preserves the number and returns it exactly. I get whats going on things are being rounded. I don't know how to prevent it from being rounded though. Should I just consider sending (1.01+1.04) to self as only one input? Warning! I haven't tried this yet

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

ε祈祈猫儿з 提交于 2019-12-07 01:09:19
问题 I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed? The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance' In the novel, the Sheep Man is translated as saying things like: "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo." So, some punctuation is kept, but not all. Enough for a human to read, but

Thesaurus class or API for PHP [edited]

╄→尐↘猪︶ㄣ 提交于 2019-12-06 20:26:32
问题 TL;DR Summary: I need a single command-line application which I can use to get synonyms and other related words. It needs to be multi-lingual and works cross platform. Can anyone suggest a suitable program for me, or help me with the ones I've already found? Thanks. Longer version: I've been tasked with writing a system in PHP that can come up with alternative suggestions for words entered by the user. I need to find a thesaurus application / API or similar which I can use to generate these

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

妖精的绣舞 提交于 2019-12-05 04:46:54
I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed? The inspiration for the question is the Sheep Man character in the Murakami novel ' Dance Dance Dance ' In the novel, the Sheep Man is translated as saying things like: "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo." So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary. What would be your strategy for building a parser for this? Common combinations of

Thesaurus class or API for PHP [edited]

杀马特。学长 韩版系。学妹 提交于 2019-12-05 00:36:04
TL;DR Summary: I need a single command-line application which I can use to get synonyms and other related words. It needs to be multi-lingual and works cross platform. Can anyone suggest a suitable program for me, or help me with the ones I've already found? Thanks. Longer version: I've been tasked with writing a system in PHP that can come up with alternative suggestions for words entered by the user. I need to find a thesaurus application / API or similar which I can use to generate these suggestions. Importantly, it needs to be multilingual (English, Danish, French and German). This rules

Best practices for seaching for alternate forms of a word with Lucene

末鹿安然 提交于 2019-12-04 10:12:54
I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this? Please note: this site only has English content . Some approaches I've thought of: Look up the word in some kind of thesaurus file to determine alternate forms of a given word. Some examples:

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

女生的网名这么多〃 提交于 2019-12-04 09:37:07
问题 I want to include hyphen words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on Stackoverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below. import re from spacy.tokenizer import Tokenizer prefix_re = re.compile(r'''^[\[\("']''') suffix_re = re.compile(r'''[\]\)"']$''') infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''') def custom_tokenizer(nlp): return Tokenizer(nlp.vocab, prefix_search

Where can I find a list of English phrases? [closed]

会有一股神秘感。 提交于 2019-12-04 08:28:31
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm tasked with searching for the use of cliches and common phrases in text. The phrases are similar to the phrases you might see for the phrase puzzles on Wheel of Fortune. Here are a few examples: Easy Come Easy Go Too Good To be True Winning Isn't Everything I cannot find a list of phrases however. Does anybody