BoxLang 🚀 A New JVM Dynamic Language Learn More...
Utilize the PDFBox Java library to manipulate PDFs with CFML.
This is an early stage project. Feel free to use the issue tracker to report bugs or suggest improvements.
Why not just use cfpdf
and cfdocument
?
CFML's built in methods have their place - if they work for you, keep using them.
PDFBox's performance is generally faster that CFML's built in functions, particularly for extracting text. It provides more fine-grained control and insight into the underlying structures and data that make up a PDF (forms, links, javascript, metadata, etc.). Some PDF functionality is restricted to certain ColdFusion versions and engines, while PDFBox functions the same across engines and versions, providing flexibility in a codebase.
Instances of pdfbox.cfc
are created by passing it the
absolute path to a PDF document or a PDF file input stream; the
component then provides methods for working with that PDF. It's not a
singleton, so it shouldn't be stored in a permanent scope; you need to
instantiate pdfbox.cfc
for each PDF you're working with.
pdf = new pdfbox( src = 'absolute/path/to/pdf' );
Once created, pdfbox.cfc
provides a growing list of
actions you can take on the PDF. For example:
//Extract Text
text = pdf.getText();
//Flatten a form
pdf.flatten();
//Save a copy of the edited pdf
pdf.save( expandPath( "./output/flattened.pdf" ) );
getText()
Returns the text extracted from the PDF document.
getPageText(
required numeric startpage, numeric endpage = 0 )
Returns the text extracted from specific pages of the pdf document.
The endpage
argument defaults to the same as the
startpage
if not provided.
getTextAsHtml()
Returns the text extracted from the PDF, wrapped in simple html. The underlying class used is PDFText2HTML.
drawRectangle(
required numeric page, required numeric x, required numeric y,
required numeric width, required numeric height, numeric lineWidth =
2, string color = "black")
Draws a rectangle on the PDF at given coordinates.
flatten()
Flattens any forms on the pdf.
Note: Data in XFA forms is not visible after this process. Chrome/Firefox/Safari/Preview no longer support XFA PDFs; the format seems to be on its way out and is only supported by Adobe (via Acrobat) and IE. Adobe ColdFusion does not allow cfpdf's 'sanitize' action on PDFs with XFA content.
listAnnotations()
Returns all annotations within the pdf as an array; the type of each object returned is PDAnnotation, so you'll need to look at the javadocs for that to see what methods are available.
removeAnnotations()
Strips out comments and other annotations.
Note: Form fields are made visible/usable via
annotations (as I understand it); consequently, removing all
annotations renders forms, effectively, invisible and unusable, though
the markup remains present (visible via a PDF Debugger). The default
behavior of pdfbox.cfc
, therefore, is to leave
annotations related to forms present, so that the forms remain
functional. While you can remove form annotations by setting
preserveForm = false
, the better approach is to use flatten()
.
Additionally, be aware that links are a type of annotation (PDAnnotationLink) so they're removed by this method.
removeEmbeddedFiles()
Removes embedded files.
removeJavaScript()
Attempts to remove all javascript from the PDF. Javascript can appear in a lot of places; this tackles the standard locations. If more are found, they'll be incorporated here.
removeEmbeddedJavaScript()
Removes the javascript embedded in the document itself.
removeDocumentJavaScriptActions()
Removes the actions that can be triggered on open, before close, before/after printing, and before/after saving.
removeFormFieldActions()
Removes actions embedded in the form fields ( triggered onFocus, onBlur, etc )
removeLinkActions()
Removes actions embedded in the links ( triggered onFocus, onBlur, etc )
removeMetaData()
Removes metadata from the document.
removeEmbeddedIndex()
If there is an embedded search index, this removes it (at least instances of an embedded searches that I've encountered).
removeBookmarks()
Removes the document outline (bookmarks)
sanitize()
Modeled after cfpdf
's "sanitize" action, this
runs all data removal methods on the PDF. As new methods are added to
the component, they'll be added here as well. Please be aware that I'm
not a PDF expert and make no claims that this is a comprehensive
sanitization. Sensitive data may remain in the PDF, even after running
this method.
addPages( required
any pdfPages )
Add a page or pages to the end of the PDF. The pdfPages
argument must be either the absolute path to a pdf file on disk, or a
ColdFusion PDF object like those created via cfdocument
.
splitPages(
required string dest, required numeric startpage, required numeric
endpage )
Split pages from the source pdf into a separate file. The
dest
argument provides the location for the new file. The
startpage
is the first page to include in the new file,
up to and including the endpage
.
save( string dest =
"" )
By default, this saves the PDF to the same path that it was loaded
from. You can use the dest
argument to save the modified
PDF to a new location. If the destination does not exist, it is
created automatically. Note that the dest
argument is
required in order to save PDFs loaded from file input streams.
Note: For convenience, saving the document also automatically closes the PDFBox instance that was created, so it should be the last thing you do with this object.
close()
PDFBox instances that opened also need to be closed. While calling
save()
will close them automatically, if you're just
extracting data from a PDF, it's preferable to just manually close it
using this method.
getVersion()
Returns the version of the underlying PDFBox Java library being used.
getAcroForm()
If present, this returns the Acroform object. I haven't put this to any use yet. It's more a placeholder for future development.
getEmbeddedFiles()
If the pdf includes embedded files, this returns them as a struct.
hasEmbeddedSearchIndex()
Checks to see if an embedded search index can be found in the pdf.
This includes the same disclaimer as
removeEmbeddedIndex()
- that is, it checks the places
that I've seen embedded search indexes. If different search index
locations are found, it will be updated.
getDocumentOutlineTitles()
Returns an array of with the titles for the document outline sections
(bookmarks). I only added this to make it easier to confirm that the
outline was being removed via removeBookmarks()
For methods not explicity provided, this project uses
onMissingMethod()
to invoke the underlying PDFBox library
class for PDDocument
, which is its in-memory
representation of the PDF document, documented here.
Consequently, you can utilize some of the methods provided by PDFBox
directly. For example, pdfbox.getNumberOfPages()
will
return the number of pages the document has; it does this by
delegating to the getNumberOfPages()
method in the PDDocument
class.
This component depends on the .jar files contained in the
/lib
directory. All of these files can be downloaded from
https://pdfbox.apache.org/download.cgi
There are two ways that you can include them in your project.
Include the files in your <cf_root>/lib
directory. You will need to restart the ColdFusion server.
Use this.javaSettings
in your Application.cfc to
load the .jar files. Just specify the directory that you place
them in; something along the lines of
this.javaSettings = {
loadPaths = [ '.\path\to\jars\' ]
};
When using pdfbox.cfc
with Lucee CFML, you have the
option to provide the directory that contains the PDFBox .jar files
when initializing the object:
classpath = expandPath( "/path/to/pdfbox/jars" );
// will use the PDFBox jars in the class path provided
pdf = new pdfbox( src = 'absolute/path/to/pdf', classpath );
This can be helpful if you want to avoid using
this.javaSettings
(for example, because of LDEV-2516).
To be clear, this approach 1) is not possible with Adobe ColdFusion,
2) is not required for Lucee, and 3) when used with Lucee, means that
you do not need to add the .jars to your
<cf_root>/lib
directory or this.javasettings
.
PDFs can be suprisingly complex; the spec for the PDF document format available online is, no joke, 1,300 pages. While I've browsed it, I am not an expert. As a consequence, you should verify that this component doing what you expect, particularly when it comes to the data sanitization methods. Metadata, javascript, and other functionality and information can be encoded in a range of places within a PDF. As I learn about and encounter examples of these, I'm happy to address them with this component, insofar as it's possible with the underlying PDFBox library.
For questions that aren't about bugs, feel free to hit me up on the CFML Slack Channel; I'm @mjclemente. You'll likely get a much faster response than creating an issue here.
👍 🎉 First off, thanks for taking the time to contribute! 🎉 👍
Before putting the work into creating a PR, I'd appreciate it if you opened an issue. That way we can discuss the best way to implement changes/features, before work is done.
Changes should be submitted as Pull Requests on the
develop
branch.
I will attempt to document all notable changes to this project in this file. I did not keep a changelog for pre-1.0 releases. Apologies.
The format is based on Keep a Changelog.
drawRectangle()
splitPages()
sanitize()
so it no longer fails on PDFs without a documentTree, metadata, etc.getEmbeddedFiles()
, hasEmbeddedSearchIndex()
, getDocumentOutlineTitles()
, and removeBookmarks()
sanitize()
now also removes the document outline (bookmarks)getVersion()
and getAcroForm()
/tests
server.json
for testing.gitignore
listXFAElements()
getText()
encounters a PDF with issues that prevent text extraction, and error is logged but not thrown, and an empty string is returned. Resolves issue #2.
$
box install pdfboxcfc