Docx to pdf using openoffice headless way too slow

I've been using PHPWord for docx files generation. And it's been working great. But now I have the need to also make available some of those files on a pdf version.

After a few research I found PyODConverter which use OOo. Seemed quite a good option since I don't want to depend on third party web services. I tried it out on my machine and it works fined, so I've applied it on my server as well. It took a little longer but I've managed to get it working on there too.

There is however an (bad) issue. On the server this takes about 21 seconds to get it done, while on my machine it doesn't take longer than 2. :( This is way too much time for my needs so I've been trying to spot what might be causing this delay. Starting openoffice in healess mode with socket creation is okay. So I've been looking at the python script trying to find out which instruction might be causing to slow down. I've narrowed it down to this line:

context = resolver.resolve("uno:socket,host=127.0.0.1,port=8100;urp;StarOffice.ComponentContext")

This is the action that's taking about 20secs to execute. The code where it is inserted:

localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
try:
    context = resolver.resolve("uno:socket,host=127.0.0.1,port=8100;urp;StarOffice.ComponentContext")
except NoConnectException:
    raise DocumentConversionException, "failed to connect to OpenOffice.org on port %s" % port
self.desktop = context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context)

Any clues on what might be causing this delay? I've ruled out the document that I'm trying to convert since this operations occur before that. Could it be a problem with 'uno'? Or maybe another missing library that might be causing useless testing on during the resolve() operation?

Any ideas are welcome. :)

Best regards, Restless

I manage to eliminate the delay by using pipes instead of sockets for the connection.

context = resolver.resolve("uno:pipe,name=myuser_OOffice;urp;StarOffice.ComponentContext")

I still have one problem though... the user executing the python script must be the same that starts OOo for everything to work okay. Usually it would not be much of an issue, but I'm trying to execute python from my web application and I still didn't manage to get it working. I'm trying with something like this:

exec('sudo -u#1000 -s python path/to/DocumentConverter.py filename.docx filename.pdf');

I'm getting nothing from this.. and I don't get why. Maybe the user (www-data) running exec() does not have permission to execute sudo??

Perhaps the name resolver on the server doesn't know localhost (which would be very odd, but 20 seconds does sound like a DNS timeout). You could try replacing it with 127.0.0.1.

Alternatively, perhaps it's doing the lookup fine, getting both IPv6 and IPv4 addresses back for localhost, trying to make the connection via IPv6 and failing (i.e. the component may not support IPv6, or doesn't bind to that interface by default) and only then falling back to IPv4. In that case, the remedy would be the same: replace localhost with 127.0.0.1.

Its a pity that openoffice is so heavy. I was also considering it, but then I found lighter solution that is abiword.

I had to generate the previews of 4 first pages from uploaded document. This is what I did:

abiword document.doc --to=ps --exp-props="pages:1-4"
gs -q -dNOPAUSE -dBATCH -dTextAlphaBits=4  -dGraphicsAlphaBits=4 -r72 -sDEVICE=pnggray -sOutputFile=preview%d.png document.ps

So you may get the recent abiword and try something like this:

abiword document.docx --to=pdf

来源：https://stackoverflow.com/questions/5504308/docx-to-pdf-using-openoffice-headless-way-too-slow

标签

php

python

pdf-generation

docx

headless