Filter element nodes in XML with Clojure zippers

被刻印的时光 ゝ 提交于 2020-01-24 10:38:06

问题


How can you filter text nodes in XML with Clojure zippers? For example, you may have a pretty-printed XML document that interleaves element nodes with text nodes containing whitespace:

(def doc
  "<?xml version=\"1.0\"?>
  <root>
    <a>1</a>
    <b>2</b>
  </root>")

If you want to retrieve the content of the root's children, you can do this:

(require '[clojure.data.xml :as xml]
         '[clojure.zip :as zip]
         '[clojure.data.zip :as zf]
         '[clojure.data.zip.xml :as zip-xml])

(-> doc
    xml/parse-str
    zip/xml-zip
    (zip-xml/xml-> :root zf/children zip-xml/text))

However, this returns (" " "1" " " "2" " "), including the whitespace.

How do you filter the zipper, so that only element nodes are selected?

I've come up with this.

(def filter-elements (comp (partial filter (comp xml/element? zip/node)) zf/children))

(-> doc
    xml/parse-str
    zip/xml-zip
    (zip-xml/xml-> :root filter-elements zip-xml/text))
; => ("1" "2")

I suspect it's unnecessarily complex and hence I'm looking for a better solution.


回答1:


I think this relates to the general XML parsing problem of deciding which whitespace is meaningful and which isn’t. See for example this Q&A: Why am I getting extra text nodes as child nodes of root node?

I checked and found that data.xml does support skipping whitespace via an option :skip-whitespace. It’s undocumented though (source).

So best solve this at the parsing stage.

(-> doc
    (xml/parse-str :skip-whitespace true)
    zip/xml-zip
    (zip-xml/xml-> :root zf/children zip-xml/text))
; => ("1" "2")



回答2:


You can do this using the Tupelo library, which offers XML parsing using both clojure.data.xml and tagsoup parsers:

(ns tst.demo.core
  (:use demo.core tupelo.core tupelo.test)
  (:require
    [tupelo.forest :as tf]
    [tupelo.parse.tagsoup :as tagsoup]
    [tupelo.string :as ts] ))

(dotest
  (let [doc "<?xml version=\"1.0\"?>
             <root>
               <a>1</a>
               <b>2</b>
             </root>"
        result-enlive (tagsoup/parse (ts/string->stream doc))
        result-hiccup (tf/enlive->hiccup result-enlive)
        ]
    (is= result-enlive
      {:tag   :root,
       :attrs {},
       :content
              [{:tag :a, :attrs {}, :content ["1"]}
               {:tag :b, :attrs {}, :content ["2"]}]})

    (is= result-hiccup
      [:root
       [:a "1"]
       [:b "2"]])))


来源:https://stackoverflow.com/questions/47475799/filter-element-nodes-in-xml-with-clojure-zippers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!