text-processing

Parse string into a tree structure?

独自空忆成欢 提交于 2019-12-05 06:42:10
I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]] I've tried playing with some regular expressions for this (such as #"{([^{}]*)}" ), but everything I've tried seems to "flatten" the tree into a big list of lists. I could be approaching this from the wrong angle, or maybe a regex just isn't the right tool for the job. Thanks for your help! Don't

Matlab - read file with varying line lengths

放肆的年华 提交于 2019-12-05 02:43:21
问题 I have a data file with varying amount of data per line that I would like to load into Matlab as an array. As an example, suppose the data file looks like 1 2 3 4 5 6 7 8 9 10 I want to read it into Matlab as an array that looks like 1 2 nan nan 3 4 5 6 7 nan nan nan 8 9 10 nan I can do this by doing a for loop over all lines of the file but my files are very large and I am looking for an efficient solution. Any ideas would be highly appreciated. If it helps, I also know an upper bound on the

Which function should I use to read unstructured text file into R? [closed]

二次信任 提交于 2019-12-05 02:01:14
This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : ) I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it. Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it? Thank you in advance. PN P.S. If I use "." as my delimeter, it would treat things like "Mr." as a

processing text from a non-flat file (to extract information as if it *were* a flat file)

牧云@^-^@ 提交于 2019-12-05 00:51:39
问题 I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables): time subject var1 var2 var3 t1 subjectA ... t2 subjectB ... and subject name subjectA nameA subjectB nameB However, the file generated writes a data file in a format similar to the following: time t1 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 time t2 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 ...(and

tm custom removePunctuation except hashtag

北城以北 提交于 2019-12-05 00:47:50
问题 I have a Corpus of tweets from twitter. I clean this corpus (removeWords, tolower, delete URls) and finally also want to remove punctuation. Here is my code: tweetCorpus <- tm_map(tweetCorpus, removePunctuation, preserve_intra_word_dashes = TRUE) The problem now is, that by doing so I also loose the hashtag (#). Is there a way to remove punctuation with tm_map but remain the hashtag? 回答1: You could adapt the existing removePunctuation to suit your needs. For example removeMostPunctuation<-

Count the number of unique words and occurrence of each word from txt file

一个人想着一个人 提交于 2019-12-04 22:10:56
currently i trying to create an application to do some text processing to read in a text file, then I use a dictionary to create index of words, technically it will be like this .. program will be run and reading a text file then checking it, to see if the word is already in that file or not and what the id word for it as a unique word . If so, it will print out the index number and total of appearance for each word they meet and continue to check for entire file. and produce something like this: http://pastebin.com/CjtcYchF Here is an example of the text file I'm inputting: http://pastebin

Parse log files programmatically in .NET

冷暖自知 提交于 2019-12-04 20:55:32
We have a large number (read: 50,000) of relatively small (read under 500K, typically under 50K) log files created using log4net from our client application. A typical log looks like: Start Painless log Framework:8.1.7.0 Application:8.1.7.0 2010-05-05 19:26:07,678 [Login ] INFO Application.App.OnShowLoginMessage(194) - Validating Credentials... 2010-05-05 19:26:08,686 [1 ] INFO Application.App.OnShowLoginMessage(194) - Checking for Application Updates... 2010-05-05 19:26:08,830 [1 ] INFO Framework.Globals.InstanceStartup(132) - Application Startup 2010-05-05 19:26:09,293 [1 ] INFO Framework

TFIDF calculating confusion

本小妞迷上赌 提交于 2019-12-04 14:14:30
问题 I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) But i am confused for two things: I get negative values in some cases, is this correct? I am confused with line 62, 63 and 64. Code: documentNumber = 0 for word in documentList[documentNumber]

Balanced word wrap (Minimum raggedness) in PHP

风格不统一 提交于 2019-12-04 08:24:06
问题 I'm going to make a word wrap algorithm in PHP. I want to split small chunks of text (short phrases) in n lines of maximum m characters ( n is not given, so there will be as much lines as needed). The peculiarity is that lines length (in characters) has to be much balanced as possible across lines. Example of input text: How to do things Wrong output (this is the normal word-wrap behavior), m=6 : How to do things Desired output, always m=6 : How to do things Does anyone have suggestions or

One nearest neighbour using awk

谁说胖子不能爱 提交于 2019-12-04 04:42:07
问题 This is what I am trying to do using AWK language. I have a problem with mainly step 2. I have shown a sample dataset but the original dataset consists of 100 fields and 2000 records. Algorithm 1) initialize accuracy = 0 2) for each record r Find the closest other record, o, in the dataset using distance formula To find the nearest neighbour for r0, I need to compare r0 with r1 to r9 and do math as follows: square(abs(r0.c1 - r1.c1)) + square(abs(r0.c2 - r1.c2)) + ...+square(abs(r0.c5 - r1.c5