How to parse XML in Bash?

前端 未结 15 2819
情深已故
情深已故 2020-11-22 03:11

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path=\'/html/head/title\' |
sed -e \'s%(^|</title&>         
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
     style="display:block"
     data-ad-client="ca-pub-5408099190056760"
     data-ad-slot="7305827575"
     data-ad-format="auto"
     data-full-width-responsive="true"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>        </div>
      </div>
      
      <div class="fly-panel detail-box" id="flyReply">
        <fieldset class="layui-elem-field layui-field-title" style="text-align: center;">
          <legend>15条回答</legend>        </fieldset>

        <ul class="jieda" id="jieda">
                    <li data-id="111" class="jieda-daan">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                         <a class="fly-avatar" href="">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000025.jpg" alt=" 执笔经年 ">
              </a>
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite> 执笔经年</cite>
                                             
                </a>
                
                <span>(楼主)</span>
            
              </div>              <div class="detail-hits">
                <span>2020-11-22 03:38</span>
              </div>

            </div>
            <div class="detail-body jieda-body photos">
              <p>          
<p>This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...</p>

<pre><code>rdom () { local IFS=\> ; read -d \< E C ;}
</code></pre>

<p>Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:</p>

<pre><code>read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
}
</code></pre>

<p>Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:</p>

<pre><code><tag>value</tag>
</code></pre>

<p>The first call to <code>read_dom</code> get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: <code>ENTITY=tag</code> and <code>CONTENT=value</code>. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: <code>ENTITY=/tag</code> and <code>CONTENT=</code>. The fourth call will return a non-zero status because we've reached the end of file.</p>

<p>Now his while loop cleaned up a bit to match the above:</p>

<pre><code>while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
</code></pre>

<p>The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the <code>read_dom</code> function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).</p>

<p>Now given the following (similar to what you get from listing a bucket on S3) for <code>input.xml</code>:</p>

<pre><code><ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>item-apple-iso@2x.png</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>
</code></pre>

<p>and the following loop:</p>

<pre><code>while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml
</code></pre>

<p>You should get:</p>

<pre class="lang-none prettyprint-override"><code> => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => item-apple-iso@2x.png
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents => 
</code></pre>

<p>So if we wrote a <code>while</code> loop like Yuzem's:</p>

<pre><code>while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml
</code></pre>

<p>We'd get a listing of all the files in the S3 bucket.</p>

<p><strong>EDIT</strong>
If for some reason <code>local IFS=\></code> doesn't work for you and you set it globally, you should reset it at the end of the function like:</p>

<pre><code>read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=\>
    read -d \< ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}
</code></pre>

<p>Otherwise, any line splitting you do later in the script will be messed up.</p>

<p><strong>EDIT 2</strong>
To split out attribute name/value pairs you can augment the <code>read_dom()</code> like so:</p>

<pre><code>read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}
</code></pre>

<p>Then write your function to parse and get the data you want like this:</p>

<pre><code>parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}
</code></pre>

<p>Then while you <code>read_dom</code> call <code>parse_dom</code>:</p>

<pre><code>while read_dom; do
    parse_dom
done
</code></pre>

<p>Then given the following example markup:</p>

<pre><code><example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>
</code></pre>

<p>You should get this output:</p>

<pre><code>$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789
</code></pre>

<p><strong>EDIT 3</strong> another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:</p>

<pre><code>read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}
</code></pre>

<p>I don't see any reason why that shouldn't work</p>
    </p>
             <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59691'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                   <span type="reply" class="showpinglun" data-id="59691">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
                                                  
              
              <div class="jieda-admin">
                          
             
       
          
              </div>
                                       <div class="noreplaytext bb">
<center><div>   <a href="https://www.e-learn.cn/qa/q-19606.html">  查看其它15个回答
</a>
</div></center>
</div>            </div>
                         <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59691">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59691">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
                              			
        </ul>
        
        <div class="layui-form layui-form-pane">
          <form id="huidaform"  name="answerForm"  method="post">
            
            <div class="layui-form-item layui-form-text">
              <a name="comment"></a>
              <div class="layui-input-block">
            
    
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script>
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script>
<script type="text/plain" id="editor"  name="content"  style="width:100%;height:200px;"></script>                                 
<script type="text/javascript">
                                 var isueditor=1;
            var editor = UE.getEditor('editor',{
                //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个
                toolbars:[['source','fullscreen',  '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']],
            
                initialContent:'',
                //关闭字数统计
                wordCount:false,
                zIndex:2,
                //关闭elementPath
                elementPathEnabled:false,
                //默认的编辑区域高度
                initialFrameHeight:250
                //更多其他参数,请参考ueditor.config.js中的配置项
                //更多其他参数,请参考ueditor.config.js中的配置项
            });
                        editor.ready(function() {
            	editor.setDisabled();
            	});
                            $("#editor").find("*").css("max-width","362px");
        </script>              </div>
            </div>
                          
    

        
         <div class="layui-form-item">
                <label for="L_vercode" class="layui-form-label">验证码</label>
                <div class="layui-input-inline">
                  <input type="text"  id="code" name="code"   value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input">
                </div>
                <div class="layui-form-mid">
                  <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode"  href="javascript:updatecode();"> 看不清?</a></span>
                </div>
              </div>
                                  <div class="layui-form-item">
                    <input type="hidden" value="19606" id="ans_qid" name="qid">
   <input type="hidden" id="tokenkey" name="tokenkey" value=''/>
                <input type="hidden" value="How to parse XML in Bash?" id="ans_title" name="title"> 
             
              <div class="layui-btn    layui-btn-disabled"  id="ajaxsubmitasnwer" >提交回复</div>
            </div>
          </form>
        </div>
      </div>
      <input type="hidden" value="19606" id="adopt_qid"	name="qid" /> 
      <input type="hidden" id="adopt_answer" value="0"	name="aid" />
    </div>
    <div class="layui-col-md4">
          
 <!-- 热门讨论问题 -->
     
 <dl class="fly-panel fly-list-one">
        <dt class="fly-panel-title">热议问题</dt>
            <!-- 本周热门讨论问题显示10条-->