Monday, February 01, 2010

Convert PowerPoint to HTML with python

After I converted MS Word to HTML (and fed it to the application..) the next stage was to convert MS PowerPoint to HTML.
I thought it would be rather straight forward, given the success I experienced with openoffice headless api converting Word to HTML. It wasn't.
openoffice converts ppt to html (filter "impress_html_Export"), that's right. The output is a set of files, in which each ppt slide is converted to image (screenshot) and HTML. While the screenshots are good, the HTML is not satisfactory. Embedded images in the ppt doesn't appear in the converted HTML, and the same happened for tables. In addition, using the "2 column layout" produced HTML with only the left-column text, leaving the right-column text out. Same happened for any content added to a blank layout template (e.g. text boxes). In addition, numbered list (ol) where converted to bullets (ul).
Needless to say this solution is out of the question.

So here I was, looking for a way to convert ppt to html, using Java or (preferably) Python.
Looking for a Python module to do the job I found win32com, which may be good but not relevant for me since our servers don't run Windows. Although win32com CAN run on debian I preferred working with software that is not Windows dependant.

AND THEN... I found odfpy.
It's a GPL software defining itself as "Python API and tools to manipulate OpenDocument files".
Since openoffice document is basically an archive file, this module reads and writes the archive structure, allowing for easy manipulation of all kinds of openoffice formats.
In addition, it has some built-in scripts for common tasks, e.g. odf2xhtml(which I'm using), odfoutline, csv2odfand more.
SO, I'm converting the ppt to odp using openoffice headless api, and then convert the odp to HTML using odfpy.

And it works !

5 comments:

Anonymous said...

hey im trying to do exactly what you seem to have accomplised would you mind posting the code / command run involved in this process? thanks

Chris Field said...

Thanks for the posting. I'm trying to get plone/cyn.in extended such that I can load a ppt and have a slide viewer displayed. I'm thinking initially of using flash as the viewer (and images as each slide) - sort of like how it is done here: http://www.cynapse.com/solutions/technology-solutions/knowledge-management

Do you think I should follow your approach?

Naor Rosenberg said...

Chris -
If all you need is a viewer IMHO you can use Open Office to convert your ppt to images, and then display the images.
My task required text annotation, so I couldn't use the images that OO generates.

Kenn E. Thompson said...

Can this be done on a Linux box?

Naor Rosenberg said...

Yes, I'm running it on a linux box (debian 64).