text-processing

What is the preferred way to implement 'yield' in Scala?

怎甘沉沦 提交于 2019-12-03 04:19:21
问题 I am doing writing code for PhD research and starting to use Scala. I often have to do text processing. I am used to Python, whose 'yield' statement is extremely useful for implementing complex iterators over large, often irregularly structured text files. Similar constructs exist in other languages (e.g. C#), for good reason. Yes I know there have been previous threads on this. But they look like hacked-up (or at least badly explained) solutions that don't clearly work well and often have

NLTK for Named Entity Recognition

拜拜、爱过 提交于 2019-12-03 01:43:22
问题 I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out: sentence = "Let's meet tomorrow at 9 pm"; tokens = nltk.word_tokenize(sentence) pos_tags = nltk.pos_tag(tokens) print nltk.ne_chunk(pos_tags, binary=True) I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my

Balanced word wrap (Minimum raggedness) in PHP

瘦欲@ 提交于 2019-12-03 00:31:38
I'm going to make a word wrap algorithm in PHP. I want to split small chunks of text (short phrases) in n lines of maximum m characters ( n is not given, so there will be as much lines as needed). The peculiarity is that lines length (in characters) has to be much balanced as possible across lines. Example of input text: How to do things Wrong output (this is the normal word-wrap behavior), m=6 : How to do things Desired output, always m=6 : How to do things Does anyone have suggestions or guidelines on how to implement this function? Basically, I'm searching something for pretty print short

How to convert all text to lowercase in Vim

和自甴很熟 提交于 2019-12-03 00:04:40
问题 How do you convert all text in Vim to lowercase? Is it even possible? 回答1: If you really mean small caps, then no, that is not possible – just as it isn’t possible to convert text to bold or italic in any text editor (as opposed to word processor). If you want to convert text to lowercase, create a visual block and press u (or U to convert to uppercase). Tilde ( ~ ) in command mode reverses case of the character under the cursor. If you want to see all text in Vim in small caps, you might

How to obtain the first letter in a Bash variable?

我是研究僧i 提交于 2019-12-02 21:32:35
I have a Bash variable, $word , which is sometimes a word or sentence, e.g.: word="tiger" Or: word="This is a sentence." How can I make a new Bash variable which is equal to only the first letter found in the variable? E.g., the above would be: echo $firstletter t Or: echo $firstletter T initial="$(echo $word | head -c 1)" Every time you say "first" in your problem description, head is a likely solution. word="tiger" firstletter=${word:0:1} word=something first=${word::1} A portable way to do it is to use parameter expansion (which is a POSIX feature) : $ word='tiger' $ echo "${word%"${word#?}

What is the preferred way to implement 'yield' in Scala?

情到浓时终转凉″ 提交于 2019-12-02 18:39:59
I am doing writing code for PhD research and starting to use Scala. I often have to do text processing. I am used to Python, whose 'yield' statement is extremely useful for implementing complex iterators over large, often irregularly structured text files. Similar constructs exist in other languages (e.g. C#), for good reason. Yes I know there have been previous threads on this. But they look like hacked-up (or at least badly explained) solutions that don't clearly work well and often have unclear limitations. I would like to write code something like this: import generator._ def yield_values

Given a document, select a relevant snippet

前提是你 提交于 2019-12-02 17:45:22
When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question? My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in

NLTK for Named Entity Recognition

断了今生、忘了曾经 提交于 2019-12-02 15:13:05
I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out: sentence = "Let's meet tomorrow at 9 pm"; tokens = nltk.word_tokenize(sentence) pos_tags = nltk.pos_tag(tokens) print nltk.ne_chunk(pos_tags, binary=True) I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code: (S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN) Can someone help me understand

How do I extract column from CSV with quoted commas, using the shell?

懵懂的女人 提交于 2019-12-02 15:05:11
问题 I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g. foo,bar,baz,quux 11,"first line, second column",13.0,6 210,"second column of second line",23.1,5 (of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we

How do I extract column from CSV with quoted commas, using the shell?

萝らか妹 提交于 2019-12-02 07:53:52
I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g. foo,bar,baz,quux 11,"first line, second column",13.0,6 210,"second column of second line",23.1,5 (of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we can make is that there are no quoted newlines, so we can split lines trivially using \n . Now, I'd like