multiple files input to stanford NER preserving naming for each output

问题

I have many files, (the NYTimes corpus for '05, '06, & '07) , I want to run them all through the Stanford NER, "easy" you might think, "just follow the commands in the README doc", but if you thought that just now, you would be mistaken, because my situation is a bit more complicated. I don't want them all outputted into some big jumbled mess, I want to preserve the naming structure of each file, so for example, one file is named 1822873.xml and I processed it earlier using the following command:

java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile /home/matthias/Workbench/SUTD/nytimes_corpus/1822873.xml -outputFormat inlineXML >> output.curtis

If I were to follow this question, i.e. many files all listed in the command one after the other, and then pipe that to somewhere, wouldn't it just send them all to the same file? That sounds like a headache disastor of the highest order.

Is there some way to send each file to a seperate output file, so for instance, our old friend 1822873.xml would emerge from this process as, say 1822873.output.xml, and likewise for each of the other thousand some odd files. Please keep in mind that I'm trying to achieve this expeditiously.

I guess this should be possible, but what is the best way to do it? with some kind of terminal command, or maybe write a small script?

Maybe one among you has some experience with this type of thing.

Thank you for your consideration.

回答1:

If you use the -filelist option and the -outputDirectory option, you can read in a list of files you wish to process, and the directory in which you would like to save the processed files. Example:

java -cp "*" -mx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -prop annotators.prop -filelist list_of_files_to_process.txt -outputDirectory "my_output_directory"

For reference, here are the contents of list_of_files_to_process.txt:

C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0334.txt
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0335.txt
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0336.txt
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0337.txt

Here are the contents of my annotators.prop file:

annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, gender, sentiment, natlog, entitymentions, relation

And here's what the contents of my_output_directory will look like:

回答2:

UPDATE

you can do it with a bash script like this.

@duhaime I tried that but I had an issue with the classifier, also is it possible to formulate the output for that as inline xml?

With respect to my original question, check out what I've found:

Unfortunately, there is no option to have multiple input files go to multiple output files. The best you can do in the current situation is to run the CRFClassifier once for each input file you have. If you have a ton of small files, loading the model will be an expensive part of this operation, and you might want to use the CRFClassifier server program and feed files one at a time through the client. However, I doubt that will be worth the effort except in the specific case of having very many small files.

We will try to add this as a feature for the next distribution (we have a general fix-it day coming up) but no promises.

John

My files are all numbered in ascending order, do you think it would be possible to write some kind of bash script with a loop to processes each of them one at a time?

来源：https://stackoverflow.com/questions/29577238/multiple-files-input-to-stanford-ner-preserving-naming-for-each-output

标签

java

bash

stanford-nlp