How can I efficiently run XSLT transformations for a large number of files in parallel?

问题

I have to regularly transform large amount of XML files (min. 100K) within 1 folder each time (basically, from the unzipped input dataset), and I'd like to learn how to do that in the most efficient way as possible. My technological stack consists of XLTs and the Saxon XSLT Java libraries, called from Bash scripts. And it runs on an Ubuntu server with 8 cores and a Raid of SSD with 64Gb of Ram. Keep in mind I handle nicely XSLT but I'm still in the process of learning Bash and how to distribute the loads properly for such tasks (and Java is almost just a word at that point too).

I previously created a post regarding this issue, as my approach seemed very inefficient and was actually in need of help to properly run (See this SOF post). A lot of comments later, it makes sense to present the issue differently, therefore this post. I was proposed several solutions, one currently working much better than mine, but it could still be more elegant and efficient.

Now, I'm running this :

printf -- '-s:%s\0' input/*.xml | xargs -P 600 -n 1 -0 java -jar saxon9he.jar -xsl:some-xslt-sheet.xsl

I set 600 processes based on some previous tests. Going higher would just throw memory errors from Java. But it is only using between 30 to 40Gb of Ram now (all 8 cores are at 100% though).

To put it in a nutshell, here is all the advices/approaches I have so far :

Splitting the whole XML files among subfolders (e.g. containing each 5K files), and use this as a way to run in parallel transformation scripts for each subfolder
Use specifically the Saxon-EE library (allowing multithreaded execution) with the collection() function to parse the XML files
Set the Java environment with a lower number of tasks, or decrease the memory per process
Specifying Saxon regarding if the XSLT sheets are compatible with libxml/libxslt (isn't it only for XSLT1.0?)
Use a specialized shell such as xmlsh

I can handle the solution #2, and it should directly enable to control the loop and load JVM only once ; the #1 seems more clumsy and I still need to improve in Bash (load distribution & perf, tackling relative/absolute paths) ; the #3, #4 and #5 are totally new to me and I may need more explanations to see how to tackle that.

Any input would be greatly appreciated.

回答1:

"the most efficient way possible" is asking a lot, and is not usually a reasonable objective. I doubt, for example, that you would be prepared to put in 6 months' effort to improve the efficiency of the process by 3%. What you are looking for is a way of doing it that meets performance targets and can be implemented with minimum effort. And "efficiency" itself begs questions about what your metrics are.

I'm pretty confident that the design I have suggested, with a single transformation processing all the files using collection() and xsl:result-document (which are both parallelized in Saxon-EE) is capable of giving good results, and is likely to be a lot less work than the only other approach I would consider, which is to write a Java application to hold the "control logic": although if you're good at writing multi-threaded Java applications then you can probably get that to go a bit faster by taking advantage of your knowledge of the workload.

回答2:

Try using the xsltproc command line tool from libxslt. It can take multiple xml files as arguments. To call it like that, you'll need to create an output directory first. Try calling it like this:

mkdir output
xsltproc -o output/ some-xslt-sheet.xsl input/*.xml

来源：https://stackoverflow.com/questions/43141161/how-can-i-efficiently-run-xslt-transformations-for-a-large-number-of-files-in-pa

标签

java

bash

xslt

saxon

libxslt