reverse-lookup Digital Object Identifier given table of citations?

问题

I have a table of citations that includes the last name of the first author, the title, journal, year, and page numbers for each citation.

I have posted the first few lines of the table on google docs, or the csv version (not all records have a doi)

I would like to be able to query the digital object identifier for each of these citations. For the titles, it would be best if the query could handle "fuzzy matching".

How can I do this?

The table is currently in MySQL, but it would be sufficient to start and end with a .csv file (I would appreciate an answer that goes from start to finish) (or, since I mostly use R, an R data frame).

回答1:

I don't know of any complete packages or functions that do this already, but this is the general approach I would use. crossref.org offers a web based approach for determining a DOI from bibliographic data at http://www.crossref.org/guestquery/

On that page are several different ways to search, including the last one which takes an XML formatted search. The page includes information about how to create the appropriate XML. You would need to the submit the XML over HTTP (determining the details by picking apart the page to figure out form destinations and any additional information that needs to be included) and then parse out the response.

Additionally, you would need to verify that doing this in an automated manner does not violate the terms of service of the website in any way.

Below is the xml form for the crossref questquery, the searchable terms include: article_title, author, year, journal_title, volume, and first page:

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
  <query enable-multiple-hits="false|exact|multi_hit_per_rule|one_hit_per_rule|true"
            list-components="false"
            expanded-results="false" key="key">
    <article_title match="fuzzy"></article_title>
    <author search-all-authors="false"></author>
    <component_number></component_number>
    <edition_number></edition_number>
    <institution_name></institution_name>
    <isbn></isbn>
    <issn></issn>
    <volume></volume>
    <issue></issue>
    <year></year>
    <first_page></first_page>
    <journal_title></journal_title>
    <proceedings_title></proceedings_title>
    <series_title></series_title>
    <volume_title></volume_title>
    <unstructured_citation></unstructured_citation>
  </query>
</body>
</query_batch>

回答2:

This is an open problem. There are better and worse ways to attack it but, start by reading Karen Coyle's summary of this problem. The bibliography attached that at article as well is excellent.

In short, the problem of quantifying sameness between two bibliographic records is hard, and a substantial amount machine-learning research has centered around this topic.

回答3:

Here are two options

CSV upload

I have found another promising solution that does not work as well in practice as in

CrossRef allows you to upload the linked csv directly, and then performs a text query here: http://www.crossref.org/stqUpload/

However, only 18 of the 250 queries (~7%) returned a doi.

XML Query

Based on the answer by Brian Diggs, here an attempt that does 95% of the work - toward writing the xml-based query, it still has a few bugs that require some deletion using sed. But the biggest problem that my "session timed out" when the query was submitted.

the xml syntax includes an option to use fuzzy matching.

the doiquery.xml contains the template text in @Brians answer; the citations.csv is linked above.

library(XML)
doiquery.xml <- xmlTreeParse('doiquery.xml')

query <- doiquery.xml$doc$children$query_batch[["body"]]

citations <- read.csv("citations.csv")

new.query <- function(citation, query = query){
  xmlValue(query[["author"]]) <- as.character(citation$author)
  xmlValue(query[["year"]]) <- as.character(citation$year)
  xmlValue(query[["article_title"]][["text"]]) <- citation$title
  xmlValue(query[["journal_title"]]) <- citation$journal
  return(query)
}


for (i in 1:nrow(citations)){
  q <- addChildren(q, add.query(citations[i,]))
}
axml <- addChildren(doiquery.xml$doc$children$query_batch, q )

saveXML(axml, file = 'foo.xml')