nlp | 易学教程

how to fine-tune word2vec when training our CNN for text classification?

阅读更多关于 how to fine-tune word2vec when training our CNN for text classification?

问题 I have 3 Questions about fine-tuning word vectors. Please, help me out. I will really appreciate it! Many thanks in advance! When I train my own CNN for text classification, I use Word2vec to initialize the words, then I just employ these pre-trained vectors as my input features to train CNN, so if I never had a embedding layer, it surely can not do any fine-tunes through back-propagation. my question is if I want to do fine-tuning, does it means to create a Embedding layer?and how to create

How do I list out all English terms in a sentence that indicate an animal?

阅读更多关于 How do I list out all English terms in a sentence that indicate an animal?

问题 For example, in the sentence " The two horses had just lain down when a brood of ducklings, which had lost their mother, filed into the barn, cheeping feebly and wandering from side to side to find some place where they would not be trodden on. ", there are two animals: horse and duck. I was looking for vocabulary lists for animal names but was unable to get anything that's complete enough. The WordNet database looks promising but may be overkill and not broad enough either. 回答1: WordNet is

Word Base/Stem Dictionary

阅读更多关于 Word Base/Stem Dictionary

问题 It seems my Google-fu is failing me. Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful. 回答1: This is called lemmatization, and what you call the "base of a word" is

Snowball Stemming: defining Regions

阅读更多关于 Snowball Stemming: defining Regions

问题 I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel. R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel. http://snowball.tartarus.org/texts/r1r2.html Examples are b e a u t i f u l |<-

NLTK words lemmatizing

阅读更多关于 NLTK words lemmatizing

问题 I am trying to do lemmatization on words with NLTK . What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement". When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer() , it returns "acknowledg" rather than "acknowledge". Can anyone tell me how to eliminate the affixes of words? Say, when

Get gender from noun using NLTK with German corpora

阅读更多关于 Get gender from noun using NLTK with German corpora

问题 I'm experimenting with NTLK. My question is if the library can detect the gender of a noun in German. I want to receive this information in order to determine if a text is written gender neutral. See here for more information: https://en.wikipedia.org/wiki/Gender_neutrality_in_languages_with_grammatical_gender The underlying code categorizes my sentence, but I can't see any information about the gender of "Mitarbeiter" . My code so far: sentence = """Der Mitarbeiter geht.""" tokens = nltk

Get gender from noun using NLTK with German corpora

阅读更多关于 Get gender from noun using NLTK with German corpora

Corpus/data set of English words with syllabic stress information?

阅读更多关于 Corpus/data set of English words with syllabic stress information?

问题 I know this is a long shot, but does anyone know of a dataset of English words that has stress information by syllable? Something as simple as the following would be fantastic: AARD vark A ble a BOUT ac COUNT AC id ad DIC tion ad VERT ise ment ... 回答1: I closest thing I'm aware of is the CMU Pronouncing Dictionary. I don't think it explicitly marks the stressed syllable, but it should be a start. 来源： https://stackoverflow.com/questions/2839548/corpus-data-set-of-english-words-with-syllabic

Why is a self trained NER-Model incompatible with the version of OpenNLP?

阅读更多关于 Why is a self trained NER-Model incompatible with the version of OpenNLP?

问题 I trained OpenNLP NER-Model to detect a new Entity but when I am using this model I encountered the following Exception: Exception in thread "main" java.lang.IllegalArgumentException: opennlp.tools.util.InvalidFormatException: Model version 1.6.0 is not supported by this (1.5.3) version of OpenNLP! I am using OpenNLP version 1.6.0 and my source code is this: String [] sentences = Fragmentation.getSentences(Document); InputStream modelIn = new FileInputStream("Models/en-ner-cvskill.bin");

Regex for Parsing JSON

阅读更多关于 Regex for Parsing JSON

问题 I have a column of data I'm reading in Tableau directly from Redshift. This column contains a JSON object. It looks like this: {"Age": 58, "City": "Wisconsin Rapids", "Race": "Other", "State": "Wisconsin", "Gender": "Female", "Country": "United States"} I wish to extract this data by generating a column with a calculated field for each data point of interest using Tableau's REGEXP_EXTRACT function. I.e. an Age column, a City column etc. How do I write a line of regular expressions to get the