text-parsing

Strategy for parsing natural language descriptions into structured data

烈酒焚心 提交于 2019-12-04 00:53:10
I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do). I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-): What's the best / simplest way to solve this problem? Should I use a natural language parser, dsl, lucene/solr, or some other

How to create an array from the lines of a command's output

半腔热情 提交于 2019-12-03 08:46:08
I have a file called failedfiles.txt with the following content: failed1 failed2 failed3 I need to use grep to return the content on each line in that file, and save the output in a list to be accessed. So I want something like this: temp_list=$(grep "[a-z]" failedfiles.txt) However, the problem with this is that when I type echo ${temp_list[0]} I get the following output: failed1 failed2 failed3 But what I want is when I do: echo ${temp_list[0]} to print failed1 and when I do: echo ${temp_list[1]} to print failed2 Thanks. You did not create an array. What you did was Command Substitution

Resume/CV Parsing in PHP [closed]

这一生的挚爱 提交于 2019-12-03 08:44:15
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center . We are developing a requirement base social media site using LAMP. For that we want to do Resume/CV Parsing in PHP. We were able to parse Email-id and Phone, but not sure how to parse the other information like full name, address, education, employment etc from the resume. Plus resume/CV can be in various formats like doc,html,rtf

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

蹲街弑〆低调 提交于 2019-12-03 08:28:14
I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure. In terms of getting counts for occurrence: vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus) print X.toarray() gives: [[0 0 0 0]] What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams: vocabulary = ['hi', 'bye', 'run'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus

Java: How read a File line by line by ignoring “\\n”

陌路散爱 提交于 2019-12-03 03:46:36
I'm trying to read a tab separated text file line per line. The lines are separated by using carriage return ("\r\n") and LineFeed (\"n") is allowed within in tab separated text fields. Since I want to read the File Line per Line, I want my programm to ignore a standalone "\n". Unfortunately, BufferedReader uses both possibilities to separate the lines. How can I modify my code, in order to ignore the standalone "\n"? try { BufferedReader in = new BufferedReader(new FileReader(flatFile)); String line = null; while ((line = in.readLine()) != null) { String cells[] = line.split("\t"); System.out

Disturbing odd behavior/bug in Python itertools groupby?

最后都变了- 提交于 2019-12-02 06:54:22
问题 I am using itertools.groupby to parse a short tab-delimited textfile. the text file has several columns and all I want to do is group all the entries that have a particular value x in a particular column. The code below does this for a column called name2 , looking for the value in variable x . I tried to do this using csv.DictReader and itertools.groupby . In the table, there are 8 rows that match this criteria so 8 entries should be returned. Instead groupby returns two sets of entries, one

Delete row which has more than X columns in a csv

空扰寡人 提交于 2019-12-02 04:41:07
I need to delete all the rows in a csv file which have more than a certain number of columns. This happens because sometimes the code, which generates the csv file, skips some values and prints the following on the same line. Example: Consider the following file to parse. I want to remove all the rows which have more than 3 columns (i.e. the columns of the header): timestamp,header2,header3 1,1val2,1val3 2,2val2,2val3 3,4,4val2,4val3 5val1,5val2,5val3 6,6val2,6val3 The output file I would like to have is: timestamp,header2,header3 1,1val2,1val3 2,2val2,2val3 5val1,5val2,5val3 6,6val2,6val3 I

Randomizing text between delimiters

白昼怎懂夜的黑 提交于 2019-12-01 17:14:44
问题 I have this simple input I have {red;green;orange} fruit and cup of {tea;coffee;juice} I use Perl to identify patterns between two external brace delimiters { and } , and randomize the fields inside with the internal delimiter ; . I'm getting this output I have green fruit and cup of coffee This is my working Perl script perl -plE 's!\{(.*?)\}!@x=split/;/,$1;$x[rand@x]!ge' <<< 'I have {red;green;orange} fruit and cup of {tea;coffee;juice}' My task is to process this input format I have { {red

A better way to parse integer values from a T-SQL delimited string

血红的双手。 提交于 2019-12-01 10:11:30
问题 I have a SQLServer2008 R2 Stored Procedure that contains an algorithm for parsing out integers from a delimited string. Here's an example of the SQL code that I made for looping through the delimited string and extracting any numbers that may exist in the delimited string: -- Create a delimited list for testing DECLARE @NumericList nvarchar(MAX) = N'1, 33,44 ,55, foo ,666,77 77,8,bar,9,10' -- Declare the delimiter DECLARE @ListDelimiter VARCHAR(1) = ',' -- Remove white space from the list SET

Match the body of a function using Regex

拜拜、爱过 提交于 2019-12-01 08:39:08
Given a dummy function as such: public function handle() { if (isset($input['data']) { switch($data) { ... } } else { switch($data) { ... } } } My intention is to get the contents of that function, the problem is matching nested patterns of curly braces {...} . I've come across recursive patterns but couldn't get my head around a regex that would match the function's body. I've tried the following (no recursion): $pattern = "/function\shandle\([a-zA-Z0-9_\$\s,]+\)?". // match "function handle(...)" '[\n\s]?[\t\s]*'. // regardless of the indentation preceding the { '{([^{}]*)}/'; // find