Shell script to parse CSV to an XML query?

前端未结

关注

 2  745

挽巷

I have a list of citations in a csv file that I would like to use to fill out the XML based query form at CrossRef

CrossRef provides an XML template (below, with unu

相关标签:

2条回答

猫巷女王i

2020-12-12 04:03

Unlike the approaches using text substitution (ie. awk), this one is guaranteed to always emit a well-formed XML document, with content properly escaped. It's ugly, but it's far more correct. Note that this requires a 3rd-party tool; nothing included with the shell proper is capable of safely editing XML.

First, put a document with no body in template.xml:

<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body/>
</query_batch>

Second, build an XMLStarlet command line describing the edits desired, and invoke it:

#!/bin/bash
xmlstarlet_command=( )
read_header=0
while IFS=, read author year article_title journal_title volume first_page; do
  if (( read_header == 0 )); then read_header=1; continue; fi
  xmlstarlet_command+=( -s /qs:query_batch/qs:body -t elem -n query -v '' )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n list-components -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n expanded-results -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n key -v key )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n article_title -v "$article_title" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/article-title' -t attr -n match -v fuzzy )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n author -v "$author" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/author' -t attr -n search-all-authors -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n volume -v "$volume" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n year -v "$year" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n first_page -v "$first_page" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n journal_title -v "$journal_title" )
done <in.csv
xmlstarlet ed -N qs=http://www.crossref.org/qschema/2.0 "${xmlstarlet_command[@]}" <template.xml

Note that, like other solutions given here, this doesn't strip the double quotes from the beginning and end of the CSV elements; like other aspects of advanced CSV parsing, this is better left to something like the Python CSV module, which actually knows how to recognize escaped quotes, text fields containing newlines, and all the other little oddities that can happen inside valid CSV files.

As an aside -- be aware that older versions of XMLStarlet have a limit on the number of operations per invocation fixed in the latest release. I have a workaround for this (which also allows edit lists longer than the ~32K or so maximum command line length), but it probably deserves to be its own question.

0 讨论(0)

挽巷

2020-12-12 04:15

#!/usr/bin/awk -f
# XML Attributes Must be Quoted. Attribute values must always be quoted. Either single or double quotes can be used.

BEGIN{
    FS=","
    print "<?xml version = '1.0' encoding='UTF-8'?>"
    print "<query_batch xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' version='2.0' xmlns='http://www.crossref.org/qschema/2.0'"
    print "  xsi:schemaLocation='http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd'>"
    print "<head>"
    print "   <email_address>test@crossref.org</email_address>"
    print "   <doi_batch_id>test</doi_batch_id>"
    print "</head>"
    print "<body>"
}

NR>1{
    print "  <query enable-multiple-hits='true'"
    print "            list-components='false'"
    print "            expanded-results='false' key='key'>"
    print "    <article_title match='fuzzy'>" $3 "</article_title>"
    print "    <author search-all-authors='false'>" $1 "</author>"
    print "    <volume>" $5 "</volume>"
    print "    <year>" $2 "</year>"
    print "    <first_page>" $6 "</first_page>"
    print "    <journal_title>" $4 "</journal_title>"
    print "  </query>"
}

END{
    print "</body>"
    print "</query_batch>"
}

$ awk -f script.awk input.csv

0 讨论(0)