Seeking citation parser

问题

I need a parser that will scan scholarly texts, extract citations, and parse those citations into their component parts (author, title, publication date, etc).

I've tried Paracite, but it is abominably slow and doesn't produce high quality results.

Any language is OK, but Java is preferred.

回答1:

Take a look at ParsCit:

This is the home page of the ParsCit project, which performs two tasks: 1) reference string parsing, sometimes also called citation parsing or citation extraction, and 2) logical structure parsing of scienfific documents. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service. The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used on this web site).

回答2:

We recently faced a similar problem and ended up writing our own parser based on ParsCit but using Wapiti instead of CRF++ for the conditional random fields model. Like Mike mentions above, the problem with ML-based parsers is getting good tagged training data; for this we wrote a visual editor that lets you tag the results (and save them as training data). This approach works pretty well for parsing bibliographies.

If anyone is interested, we've made both parser and editor available here at anystyle.io.

回答3:

A list of projects is here: https://forums.zotero.org/discussion/1211/

Cb2bib uses regexes http://www.molspaces.com/cb2bib/

Citeseer uses a big list of author names and titles. You can have a look at their publication list

Here is a project but in python: https://code.google.com/p/pdfssa4met/

Also see these stackoverflow questions: * Extracting information from PDFs of research papers

回答4:

You can also try this little tool for parsing academic citations into fields:

http://citationparser.com

Citationparser.com is still beta but the 2017 version is working well especially for Journal Articles but also for Monographs and Book Chapters.

The list doesn't have to be in ONE style, but can be a mixture of different official or unofficial styles

You can walk through the references and check for fulltext or you can EXPORT as Endnote File (.ENL). I developed this tool only for smaller Lists of hundreds of titles. If you paste a list with more than 1000 titles it will run much slower.

回答5:

You could try looking into an indexing / searching library like Lucene

来源：https://stackoverflow.com/questions/7444057/seeking-citation-parser

标签

java

parsing

text

citations