text-analysis | 易学教程

Java text analysis libraries

阅读更多关于 Java text analysis libraries

I'm looking for a java driven solution to a requirement for analysing sentences to log whether a key word was used positively or negatively. Ie The key word might be 'cabbages' and the sentence:- 'I like cabbages but not peas' And I'd like a java text analyser of some kind to log this as positive. Can the lucene (Hibernate-Search) libraries be utilized to for this? Any thoughts? You're looking for "sentiment analysis". One possibility is LingPipe , who kindly link to their competitors also . Jeff Dalton also has a great list of natural language processing tools in his blog . I doubt there's

Training data for sentiment analysis [closed]

阅读更多关于 Training data for sentiment analysis [closed]

Where can I get a corpus of documents that have already been classified as positive/negative for sentiment in the corporate domain? I want a large corpus of documents that provide reviews for companies, like reviews of companies provided by analysts and media. I find corpora that have reviews of products and movies. Is there a corpus for the business domain including reviews of companies, that match the language of business? Gregory Marton http://www.cs.cornell.edu/home/llee/data/ http://mpqa.cs.pitt.edu/corpora/mpqa_corpus You can use twitter, with its smileys, like this: http://web.archive

How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? - python

阅读更多关于 How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? - python

How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html I could extract the text features by word or char separately but how do i create a charword_vectorizer ? Is there a way to combine the vectorizers? or use more than one analyzer? >>> from sklearn.feature_extraction.text import CountVectorizer >>> word_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=1) >>> char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2), min_df=1) >>> x = [

Any tutorial or code for Tf Idf in java

阅读更多关于 Any tutorial or code for Tf Idf in java

I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :) OR If you can tell me some good java tutorial for this. Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :( Please also do not refer me to Lucene :( Term Frequency is the square root of the number of times a term occurs

How to find common phrases in a large body of text

阅读更多关于 How to find common phrases in a large body of text

问题 I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following: The dog jumped over the woman. The dog jumped into the car. The dog jumped up the stairs. From the above example I would want to extract " the dog jumped " as it is the most common phrase in the text. At first I thought, "oh lets use a directed graph [with repeated nodes]": directed graph http://img.skitch.com/20091218

Training data for sentiment analysis [closed]

阅读更多关于 Training data for sentiment analysis [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Where can I get a corpus of documents that have already been classified as positive/negative for sentiment in the corporate domain? I want a large corpus of documents that provide reviews for companies, like reviews of companies provided by analysts and media. I find corpora that have reviews of products and

How can I compute TF/IDF with SQL (BigQuery)

阅读更多关于 How can I compute TF/IDF with SQL (BigQuery)

I'm doing text analysis over reddit comments, and I want to calculate the TF-IDF within BigQuery. This query works on 5 stages: Obtain all reddit posts I'm interested in. Normalize words (LOWER, only letters and ' , unescape some HTML). Split those words into an array. Calculate the tf (term frequency) for each word in each doc - count how many times it shows up in each doc, relative to the number of words in said doc. For each word, calculate the number of docs that contain it. From (3.), obtain idf (inverse document frequency): "inverse fraction of the documents that contain the word,

NLP: Qualitatively “positive” vs “negative” sentence

阅读更多关于 NLP: Qualitatively “positive” vs “negative” sentence

I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure: - hopefully with wordlists - hopefully trainable on my set of data Thanks! What you are looking for is commonly dubbed Sentiment Analysis . Typically, sentiment analysis is not able to handle delicate subtleties, like sarcasm or irony, but it fares pretty well if you throw a large set of data at

How to extract common / significant phrases from a series of text entries

阅读更多关于 How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try the hamburger" (in 44 reviews) e.g., the "Review Highlights" section of this page: http://www.yelp.com/biz/sushi-gen-los-angeles/ I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common

Extracting text from garbled PDF [closed]

阅读更多关于 Extracting text from garbled PDF [closed]

I have a PDF file with valuable textual information. The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails. I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly? My questions: What is the culprit of this weird text garbling ? How to extract the text content from the PDF (programmatically, with a tool, manipulating the