Tuesday, January 24, 2012

Problems with PDFbox Extension in Greenstone 2.85

The new features document for 2.85 indicates that there is significant improvement in the PDFbox extension which is usually an add-in to the Apache web server.

When you run Greenstone it pops up a message box noting the PDFbox extension has not been installed and listing a URL for the greenstone developer's TRAC site.

Problem #1

The files referenced by the URL in the box and on the developer's blog are not valid archive files. Both the .zip file and the .gz file are 8kb in size instead of the expect 9mb and cannot be opened in an archive manager. So you cannot get the latest version. Searching the TRAC site is a lesson in frustration.

I tried packaging the source files from the TRUNK and adding them manually to the greenstone\ext folder but when launching the GLI it fails to load until they are removed. Next i looked at the pdfbox project page but their files don't seem compatible. So i decided to use the an older version.

Problem #2

Finding an old version was a bit of work but i managed to find a link in the TRAC site and to download the .zip (9mb) and to extract the files to the greenstone\ext folder. Ran the GLI and configured the PDFplugin to use the PDFbox  extension. Created a new collection and imported a bunch of small PDF files. Only 2 got imported as the others were rejected. The problem was a 'writable error'. I checked the pdf-box folder in greenstone\gli and it was marked as read-only so i reset the permissions for the folder and its contents and re-ran the GLI and the import. This time it looked liked the files were imported but the PDFbox failed with a java error. The workaround was to untick the pdfbox checkbox in the pdfplugin. Now all the pdf's were imported ok but none of the enhanced features offered by pdfbox were available.

After wasting a morning chasing this problem down I had to advise the students NOT to use pdfbox but to go back to the old workarounds of converting their PDF files to older 1.4 versions. A big disappointment!

2 comments:

gnickers said...

If you click on the .zip file instead of downloading it you get taken to a page. If you then click on the download link you get a working copy. However, there can still be importing problems:

import.pl> Exception in thread "main" java.io.IOException:
You do not have permission to extract text
at org.apache.pdfbox.ExtractText.main(ExtractText.java:195)

WARNING: No plugin could process AppleScript Reference Guide.pdf

gnickers said...

If you click on the .zip file instead of downloading it you get taken to a page. If you then click on the download link you get a working copy. However, there can still be importing problems:

import.pl> Exception in thread "main" java.io.IOException:
You do not have permission to extract text
at org.apache.pdfbox.ExtractText.main(ExtractText.java:195)

WARNING: No plugin could process AppleScript Reference Guide.pdf