text-analysis

How to detect duplicates among text documents and return the duplicates' similarity?

半世苍凉 提交于 2019-11-27 09:35:28
I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example: Text 1:"I'm writing a crawler to" Text 2:"I'm writing a some text crawler to get" The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that? You are facing a

How can I compute TF/IDF with SQL (BigQuery)

旧街凉风 提交于 2019-11-27 02:31:30
问题 I'm doing text analysis over reddit comments, and I want to calculate the TF-IDF within BigQuery. 回答1: This query works on 5 stages: Obtain all reddit posts I'm interested in. Normalize words (LOWER, only letters and ' , unescape some HTML). Split those words into an array. Calculate the tf (term frequency) for each word in each doc - count how many times it shows up in each doc, relative to the number of words in said doc. For each word, calculate the number of docs that contain it. From (3.

NLP: Qualitatively “positive” vs “negative” sentence

久未见 提交于 2019-11-27 00:15:34
问题 I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure: - hopefully with wordlists - hopefully trainable on my set of data Thanks! 回答1: What you are looking for is commonly dubbed Sentiment Analysis. Typically, sentiment analysis is not able to handle

How to extract common / significant phrases from a series of text entries

断了今生、忘了曾经 提交于 2019-11-26 23:50:10
问题 I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try the hamburger" (in 44 reviews) e.g., the "Review Highlights" section of this page: http://www.yelp.com/biz/sushi-gen-los-angeles/ I have NLTK installed and I've

Extracting text from garbled PDF [closed]

こ雲淡風輕ζ 提交于 2019-11-26 20:15:45
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I have a PDF file with valuable textual information. The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails. I have used all tools I could get my

Stemmers vs Lemmatizers

老子叫甜甜 提交于 2019-11-26 11:10:34
Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems. Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms . Stemmers [in]: having [out]: hav Lemmatizers [in]: having [out]: have So the question is, are English stemmers any useful at all today?