text-processing | 易学教程

processing text from a non-flat file (to extract information as if it were a flat file)

阅读更多关于 processing text from a non-flat file (to extract information as if it *were* a flat file)

I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables): time subject var1 var2 var3 t1 subjectA ... t2 subjectB ... and subject name subjectA nameA subjectB nameB However, the file generated writes a data file in a format similar to the following: time t1 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 time t2 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 ...(and so on) I have been using a (python) script to process this output data into a flat text file so that I

Deleting the last line of a file with Java

阅读更多关于 Deleting the last line of a file with Java

问题 I have a .txt file, which I want to process in Java. I want to delete its last line. I need ideas on how to achieve this without having to copy the entire content into another file and ignoring the last line. Any suggestions? 回答1: You could find the beginning of the last line by scanning the file and then truncate it using FileChannel.truncate or RandomAccessFile.setLength. 回答2: By taking RandomAccessFile you can: use method seek(long) to jump forward and read those lines. But you won't know

tm custom removePunctuation except hashtag

阅读更多关于 tm custom removePunctuation except hashtag

I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes = TRUE) The problem now is, that by doing so I also loose the hashtag (#). Is there a way to remove punctuation with tm_map but remain the hashtag? You could adapt the existing removePunctuation to suit your needs. For example removeMostPunctuation<- function (x, preserve_intra_word_dashes = FALSE) { rmpunct <- function(x) { x <- gsub("#", "\002", x) x <-

How to extract data from a text file using R or PowerShell?

阅读更多关于 How to extract data from a text file using R or PowerShell?

I have a text file containing data like this: This is just text ------------------------------- Username: SOMETHI C: [Text] Account: DFAG Finish time: 1-JAN-2011 00:31:58.91 Process ID: 2028aaB Start time: 31-DEC-2010 20:27:15.30 This is just text ------------------------------- Username: SOMEGG C: [Text] Account: DFAG Finish time: 1-JAN-2011 00:31:58.91 Process ID: 20dd33DB Start time: 12-DEC-2010 20:27:15.30 This is just text ------------------------------- Username: SOMEYY C: [Text] Account: DFAG Finish time: 1-JAN-2011 00:31:58.91 Process ID: 202223DB Start time: 15-DEC-2010 20:27:15.30 Is

Split text on paragraphs where paragraph delimiters are non-standard

阅读更多关于 Split text on paragraphs where paragraph delimiters are non-standard

If I have text with standard paragraph formatting (a blank line followed by an indent) such as text 1 it's easy enough to extract the paragraphs using text.split("\n\n"). Text 1: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sit amet sapien velit, ac sodales ante. Integer mattis eros non turpis interdum et auctor enim consectetur, etc. Praesent molestie suscipit bibendum. Donec justo purus, venenatis eget convallis sed, feugiat vitae velit,etc. But what if I have text with non-standard paragraph formatting such as text 2? No blank lines and variable leading whitespace. Text

How to add double quotes to a line with SED or AWK?

阅读更多关于 How to add double quotes to a line with SED or AWK?

I have the following list of words: name,id,3 I need to have it double quoted like this: "name,id,3" I have tried sed 's/.*/\"&\"/g' and got: "name,id,3 Which has only one double quote and is missing the closing double quote. I've also tried awk {print "\""$1"\""} with exactly the same result. I need help. Your input file has carriage returns at the end of the lines. You need to use dos2unix on the file to remove them. Or you can do this: sed 's/$.*$\r/"\1"/g' which will remove the carriage return and add the quotes. Use this to pipe your input into: sed 's/^/"/;s/$/"/' ^ is the anchor for

Identifying verb tenses in python

阅读更多关于 Identifying verb tenses in python

How can I use Python + NLTK to identify whether a sentence refers to the past/present/future ? Can I do this only using POS tagging? This seems a bit inaccurate, seems to me that I need to consider the sentence context and not only the words alone. Any suggestion for another library that can do that? It won't be too hard to do this yourself. This table should help you identify the different verb tenses and handling them will just be a matter of parsing the result of nltk.pos_tag(string) Im not sure if you want to get into all of the irregular verb tenses like 'could have been' etc... but if

TFIDF calculating confusion

阅读更多关于 TFIDF calculating confusion

I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) But i am confused for two things: I get negative values in some cases, is this correct? I am confused with line 62, 63 and 64. Code: documentNumber = 0 for word in documentList[documentNumber].split(None): words[word] = tfidf(word,documentList[documentNumber],documentList) Should TFIDF be

Given a document, select a relevant snippet

阅读更多关于 Given a document, select a relevant snippet

问题 When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question? My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words

How to get bag of words from textual data? [closed]

阅读更多关于 How to get bag of words from textual data? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 months ago . I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model. What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I