text-processing | 易学教程

Java text classification problem [closed]

阅读更多关于 Java text classification problem [closed]

问题 I have a set of Books objects, classs Book is defined as following : Class Book{ String title; ArrayList<tags> taglist; } Where title is the title of the book, example : Javascript for dummies . and taglist is a list of tags for our example : Javascript, jquery, "web dev", .. As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it.. I have to classify automaticaly those books into separated sets by topic,

Algorithms to detect phrases and keywords from text

阅读更多关于 Algorithms to detect phrases and keywords from text

问题 I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together. If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next

Finding common value across multiple files containing single column values

阅读更多关于 Finding common value across multiple files containing single column values

问题 I have 100 text files containing single columns each. The files are like: file1.txt 10032 19873 18326 file2.txt 10032 19873 11254 file3.txt 15478 10032 11254 and so on. The size of each file is different. Kindly tell me how to find the numbers which are common in all these 100 files. The same number appear only once in 1 file. 回答1: awk to the rescue! to find the common element in all files (assuming uniqueness within the same file) awk '{a[$1]++} END{for(k in a) if(a[k]==ARGC-1) print k}'

Removing stop words from single string

阅读更多关于 Removing stop words from single string

问题 My query is string = 'Alligator in water' where in is a stop word. How can I remove it so that I get stop_remove = 'Alligator water' as output. I have tried it with ismember but it returns integer value for matching word, I want to get the remaining words as output. in is just an example, I'd like to remove all possible stop words. 回答1: Use this for removing all stop-words. Code % Source of stopwords- http://norm.al/2009/04/14/list-of-english-stop-words/ stopwords_cellstring={'a', 'about',

What's the fastest way to strip and replace a document of high unicode characters using Python?

阅读更多关于 What's the fastest way to strip and replace a document of high unicode characters using Python?

问题 I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: http://www.designmeme.com/mtplugins/lowdown.txt Is there a fast way of doing this in Python without using s.replace(...).replace(...).replace(...)...? I've

Eliminate partially duplicate lines by column and keep the last one

阅读更多关于 Eliminate partially duplicate lines by column and keep the last one

问题 I have a file that looks like this: 2011-03-21 name001 line1 2011-03-21 name002 line2 2011-03-21 name003 line3 2011-03-22 name002 line4 2011-03-22 name001 line5 for each name, I only want its last appearance. So, I expect the result to be: 2011-03-21 name003 line3 2011-03-22 name002 line4 2011-03-22 name001 line5 Could someone give me a solution with bash/awk/sed? 回答1: This code get uniq lines by second field but from the end of file or text (like in your result example) tac temp.txt | sort

Converting a \\u escaped Unicode string to ASCII

阅读更多关于 Converting a \\u escaped Unicode string to ASCII

After reading all about iconv and Encoding , I am still confused. I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig' ). I want to convert this to the ASCII string, which should be 'pretty=>big' . More simply, if I set x <- 'pretty\\u003D\\u003Ebig' How do I perform a conversion on x to yield pretty=>big ? Any suggestions? Use parse, but don't evaluate the results: x1 <- 'pretty\\u003D\\u003Ebig' x2 <- parse(text = paste0("'", x1, "'")) x3 <- x2[[1]] x3 # [1] "pretty=>big" is.character

Expanding English language contractions in Python

阅读更多关于 Expanding English language contractions in Python

问题 The English language has a couple of contractions. For instance: you've -> you have he's -> he is These can sometimes cause headache when you are doing natural language processing. Is there a Python library, which can expand these contractions? 回答1: I made that wikipedia contraction-to-expansion page into a python dictionary (see below) Note, as you might expect, that you definitely want to use double quotes when querying the dictionary: Also, I've left multiple options in as in the wikipedia

Select random lines from a file [duplicate]

阅读更多关于 Select random lines from a file [duplicate]

This question already has an answer here: What's an easy way to read random line from a file in Unix command line? 13 answers In a Bash script, I want to pick out N random lines from input file and output to another file. How can this be done? dogbane Use shuf with the -n option as shown below, to get N random lines: shuf -n N input > output user881480 Sort the file randomly and pick first 100 lines: $ sort -R input | head -n 100 >output 来源： https://stackoverflow.com/questions/9245638/select-random-lines-from-a-file

How can I sum values in column based on the value in another column?

阅读更多关于 How can I sum values in column based on the value in another column?

I have a text file which is: ABC 50 DEF 70 XYZ 20 DEF 100 MNP 60 ABC 30 I want an output which sums up individual values and shows them as a result. For example, total of all ABC values in the file are (50 + 30 = 80) and DEF is (100 + 70 = 170). So the output should sum up all unique 1st column names as - ABC 80 DEF 170 XYZ 20 MNP 60 Any help will be greatly appreciated. Thanks $ awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file ABC 80 XYZ 20 MNP 60 DEF 170 $ perl -lane \ '$sum{$F[0]} += $F[1]; END { print "$_ $sum{$_}" for sort grep length, keys %sum }' \ input ABC 80 DEF 170 MNP 60 XYZ 20