from HTML <figure> and <figcaption> to Microsoft Word

懵懂的女人 提交于 2019-12-01 05:13:40

This may be more roundabout than you would like, but if you save the file as a pdf (I went into adobe and created a pdf from a html file containing figure/figcaption, but you could do that programatically obviously), and then export that pdf file to word, then you can create a word document. Perhaps a middle step too much but it does work!

Hope this is of some assistance (perhaps a pdf would do??)

EDIT 1: I just found a jquery plugin by Mark Windsoll which converts HTML to Word. I made a codepen to include figure /figcaption here. When you press the button it prints as Word. (I suppose you could save it either, but his original code pen didn't actually do anything on click of the link that said export to doc.. sigh..)

 jQuery(document).ready(function print($)  {   
$(".word-export").click(function(event) {
         $("#page-content").wordExport();
     });
 });
img{width:300px;
height:auto;}
figcaption{width:350px;text-align:center;}
h1{margin-top:10px;}
h1, h2{margin-left:35px;}
p{width:95%;
  padding-top:20px;
  margin:0px auto;}
button{margin: 15px 30px; 
padding:5px;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/FileSaver.js"></script>
<script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/jquery.wordexport.js"></script>

<link href="https://www.jqueryscript.net/css/jquerysctipttop.css" rel="stylesheet"/>

<h1>jQuery Word Export Plugin Demo</h1>
<div id="page-content">
<h2>Lovely Trees</h2>
<figure>
  <img src="http://www.rachelgallen.com/images/autumntrees.jpg"></figure>
  <figcaption>Autumn Trees</figcaption>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec vehicula bibendum lacinia. Pellentesque placerat interdum nisl non semper. Integer ornare, nunc non varius mattis, nulla neque venenatis nibh, vitae cursus risus quam ut nulla. Aliquam erat volutpat. Aliquam erat volutpat. </p>
  <p>And some more text here, but that's quite enough lorem ipsum rubbish!</p>
</div>
<button class="word-export" onclick="print();"> Export as .doc </button>

EDIT 2: To convert HTML to Word using C# you can use Gembox, which is free unless you buy the professional version (you could use it free for a while to evaluate it).

The C# code is

// Convert HTML to Word (DOCX) document.
DocumentModel.Load("Document.html").Save("Document.docx");

Rachel

I never used pandoc, i guess it don't support many advanced CSS3 features now.

1. Using Aspose.Words

I copied you CSS&HTML codes to make a Html file named figure.htm, and using Aspose.Words to converted this html file, it works as well as your hope.

I using C# to code to like below:

using Aspose.Words;

        Document doc = new Document();
        DocumentBuilder builder = new DocumentBuilder(doc); 
        using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm"))
        {
            string html = sr.ReadToEnd();
            builder.InsertHtml(html);
        }

        doc.Save("d:\\DocumentBuilder.InsertTableFromHtml Out.doc");

My Aspose.Words version is 16.7.0.0.

2. Format figcaption tag

There is an other way to keep using pandoc to make it work. You can handle the Html file to fix format before you convert using pandoc. In your question, the base point is pandoc can't works on many advanced CSS3 features, so if you can finish this then it works well too.

I give some test code for you, and i using 'RegularExpressions'. Run below code, figure1.htm is a new HTML file and it's replace all figcaption's innter HTML to a fix format HTML.

        Regex regex = new Regex("<(?<tag>[a-zA-Z]+?)>(?<html>.+)</\\1>", RegexOptions.Compiled);
        using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm", Encoding.UTF8))
        {
            string html = sr.ReadToEnd();
            int i = 1;

            string newHtml = regex.Replace(html, new MatchEvaluator((m) =>
            {
                string tag = m.Groups["tag"].Value;
                string text = m.Groups["html"].Value;
                if (tag.ToLower() == "figcaption")
                {
                    return $"<{tag}>Fig. {i++} - {text}</{tag}>";
                }
                return m.Value;
            }));

            using (System.IO.StreamWriter sw = new System.IO.StreamWriter("./figure1.htm", false, Encoding.UTF8))
            {
                sw.Write(newHtml);
                sw.Flush();
            }
        }

Wish my answer can help you!

Pandoc already downloads the images and embeds them in the docx file with the command you posted.

I've just implemented and submitted a pull request to parse the figure and figcaption HTML elements properly which has been merged into master now (so it will be in the nightly builds shortly or later in pandoc 2.0). With that code, your example generates a docx file with the caption text having Paragraph Style "Image Caption".

To expand on Rachel Gallan's excellent find; the following is code I think might be used to run the converter on a string that contains a full HTML page generated by the Loop:

Would this work to convert output from a process that creates a page (the loop)? (Javascript and CSS loaded with wp_enqueue.. commands previous to calling this code)

    <?php 
    $x = $post_output ;  // $post_output contains an HTML page with doctype/head/body/etc that was generated by the loop
    $dom = new DOMDocument;
    libxml_use_internal_errors(false); // supress errors
    $dom->loadHTML($x, LIBXML_NOERROR); // supress errors
?>
<script type="text/javascript">
         $dom.wordExport();
</script>

...Rick...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!