The PDF file icon of Adobe Reader DC, under public domain

Work with PDF files

Notes published the
12 - 16 minutes to read, 3113 words

PDF files have a lot of features, but the biggest disadvantage is that it is a difficult format to work with (complex enough that most Readers where vulnerable to maliciously crafted files 🗄️)

I do not work with PDF files often, but sometimes I have to, and every time I need to research how to accomplish some specific tasks.

Extract all images

Some PDF files are just a collection of images.

This is particularly true for scans, and comic books.

In those cases, I prefer a simpler format like .cbz, which is just a renamed .zip archive containing images.

Note 📝
while .cbz (and .cbr) is commonly used for comic books, it works equally well for other types of documents that mainly consist of images, for example, product catalogs, IKEA instructions, magazines, LEGO instructions, photo albums, or scanned documents.

In my experience, .cbz files are easier to modify (just extract the images and work on them directly), and on ebook readers, the program rendering the .cbz files does normally a better job at cropping blank borders.

For extracting all images, see the following commands

# lists all images (and image types) contained in the PDF file
pdfimages -list input.pdf;

# use this command if no conversion is desired, all images prefixed with "img-"
pdfimages -all input.pdf "./img-";

# use this command to convert all the images to jpeg, all images prefixed with "img-", see the help for other formats
pdfimages -j input.pdf "./img-";

Unless you have a particular requirement, either use -all or a specific format (use pdfimages --help to see the list of supported formats).

If no format is specified, pdfimages will convert all images to .ppm files.

I wondered long enough why a 33.5MiB PDF produces more than 300MiB of images, which resulted in 175MiB compressed .png images.

For what is worth, in one case, extracting (no conversions) images from a 33.5MB PDF file resulted in 33.6MiB images. As the PDF format does not solely consist of images, but some additional structure and metadata, it seemed strange that the images alone did occupy slightly more space.

If you look closely at the output of pdfimages -list, the last column will report if an image is compressed losslessly inside the PDF structure, and in my case, all images were compressed:

example output of pdfimages -list ./comic.pdf | head -7
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1069  1611  rgb     3   8  jpeg   no         8  0   200   200  773K  15%
   2     1 image    1083  1586  rgb     3   8  jpeg   no        14  0   200   200  768K  15%
   3     2 image    1046  1611  rgb     3   8  jpeg   no        20  0   200   200  845K  17%
   4     3 image    1038  1611  rgb     3   8  jpeg   no        26  0   200   200  729K  15%
   5     4 image    1050  1598  rgb     3   8  jpeg   no        32  0   200   200  759K  15%

Creating a zip archive with all the images (with some default settings for the compression level) resulted in 32.2MiB, and stripping the metadata I was not interested in resulted in 31.6MiB, compressed in a zip archive to 31.1MiB.

It is not a great achievement in terms of size (just reduce it by 4%), but I have a format that is easier to deal with, with no loss in quality.

Warning ⚠️
Do not blindly convert all .pdf files to .cbz

You should ensure that your comics actually consist only of images. Some .pdf comics consist of two parts; the images (as images), and the text as text. The effect is that the extracted images would have, for example, empty balloon dialogs.

On some documents, a page is composed of multiple images, thus after extracting them one should, somehow, convert them to a single image.

Another thing to pay attention to is that even if the whole page consists of an image, it might contain useful metadata. Some scanned documents have an invisible overlay (eventually created with some OCR program), that makes it possible to copy the scanned text. It should not be necessary to say that extracting the images will not preserve the copyable text.

Thus before converting a PDF file to .cbz, ensure that you are not interested in the additional metadata.

Convert every page to an image

Some comics do not have one image per page, but multiple composed and overlayed together. Or as mentioned earlier, the text could be not part of the image itself.

In that case, extracting all images will not yield the desired result.

Even if it is a lossy conversion, one might still prefer to have images instead of a PDF file, in that case, it is possible to convert every page to an image file

pdftoppm -png inpute.pdf "./img-";

In this case, the difference in size can vary much more compared to pdfimages -all, as the conversion can alter the quality of the content.

Compare two PDF files

diffpdf is, as far as I know, the most reliable program for comparing visual differences in PDF files.

It is relatively easy to use and can show if there are any visual differences (images included), or if only the text (words or single characters) differ (eventually ignoring fonts).

It will highlight where those differences are, either by marking them or by showing the difference between one file and another, instead of the two files side by side (which is very practical for documents with images).

Comparing two PDF files is important if you want to ensure that some operation did not modify its content.

I’ve used, for example, diffpdf as part of a pipeline to ensure that the output of generated documents was unchanged after optimizing the build process and cleaning up the source files.

In that case, I did not care if some part of the appearance was different, but mainly if the words or letters did change.

Insert text or image in an existing document

I currently have no good solution for those use cases.

The most common scenario is having a document to sign and send electronically.

The "official" way, excluding a digital signature which is something else, would be to print the document, sign the printed page, and then scan it.

The more environmentally friendly solution is to sign a piece of paper, scan it, and save it as an image somewhere on your PC. Every time you need to sign a document, instead of printing it, insert the image at the appropriate position.

LibreOffice Draw can accomplish this task, but converting between a PDF file and its internal format, will generally (slightly) change your document.

Some elements might not be aligned as before anymore, or the used might be different.

If you are not sure, use diffpdf. Ideally, the file should look identical except where I placed the signature or additional text.

Most of the time the differences are minor, especially if you adjust alignment issues by hand, and unless one is looking at two documents side by side, the difference will be hard to notice

As of today, no one asked me why the returned documents look slightly different, I assume that no one noticed, or at least that no one cared enough to make an issue out of it.

Nonetheless, there should be a better way.

A way that

  • works in multiple environments

  • does not otherwise alter the PDF document in any way (just add the new content at the given position)

  • does not require an internet connection or a subscription

  • is easy to use, ideally a graphical program, so that I can place the image with the mouse and see immediately how it looks

I’ve yet not found a program that ticks all the boxes, suggestions are welcome.

  • LibreOffice Draw alters the document

  • with GIMP, I’ve never understood how to handle documents that consist of multiple pages

  • from the command line, it is "hard" to place the image in the "correct" position

I’ve recently tried xournal++; it seems to do a better job than LibreOffice Draw but depending on the document it also introduces some unwanted changes.

Append one PDF to another

If you have two files, and you want to append one to another, then there are multiple tools at your disposal, especially from the command line.

# pdfunite
pdfunite input1.pdf input2.pdf input3.pdf ... output.pdf;

# pdftk (often part of the poppler package)
pdftk input1.pdf input2.pdf input3.pdf ... cat output output.pdf;

# qpdf
qpdf --empty --pages input1.pdf input2.pdf input3.pdf ... -- out.pdf;

# pdfjam, might be already available if working with LaTeX
pdfjam input1.pdf input2.pdf input3.pdf ... -o output.pdf;

TODO: not all suggested programs are equivalent. In particular, some might preserve metadata from the original document, and do different things in front of pages with different layouts and internal links.

According to pdfinfo:

Command version size in bytes Preserve metadata (title, producer, …​.)

pdfunite input.pdf input.pdf output.pdf

22.12.0

5678537

no, metadata is empty

qpdf --empty --pages input.pdf input.pdf -- output.pdf

11.9.0

2838893

no, metadata is empty

qpdf input.pdf --pages . input.pdf -- output.pdf

11.9.0

2839187

yes

pdftk input.pdf input.pdf cat output output.pdf

3.3.3

5675065

no, overwrites metadata

pdfjam input.pdf input.pdf -o output.pdf

3.10

5647602

no, overwrites metadata, also adds "Custom metadata", and "updated" from PDF 1.5 to PDF 1.3

Size of input file: 2839681 bytes

How can qpdf create such a small file? It deduplicates all objects. Since in this test, I appended a file to itself, most internal objects can be reused.

I did otherwise not validate output.pdf file, nor use a particularly complex document.

Extract one or more pages from a PDF file

# pdftk (often part of the poppler package)
pdftk input.pdf cat 2-17 20 23-24 output output.pdf;

# qpdf
qpdf --empty --pages input.pdf 2-17,20,23-24 -- output.pdf;

# pdfjam, might be already available if working with LaTeX
pdfjam input.pdf 2-17,20,23-24 -o output.pdf;

TODO: not all suggested programs are equivalent. In particular, some might preserve metadata from the original document, and do different things in front of pages with different layouts and internal links.

According to pdfinfo:

Command version size in bytes Preserve metadata (title, producer, …​.)

qpdf --empty --pages input.pdf 2-17,20,23-24 -- output.pdf

11.9.0

2452285

no, metadata is empty

qpdf input.pdf --pages . 2-17,20,23-24 -- output.pdf

11.9.0

2452579

yes

pdftk input.pdf cat 2-17 20 23-24 output output.pdf

3.3.3

2452528

no, overwrites metadata

pdfjam input.pdf 2-17,20,23-24 -o output.pdf

3.10

2441923

no, overwrites metadata, also adds "Custom metadata", and "updated" from PDF 1.5 to PDF 1.3

I did not validate output.pdf, or used a particularly complex document during my tests.

Show document metadata

pdfinfo input.pdf;

The output will look similar to

Keywords:        <keywords>
Author:          <author>
Creator:         <program used for creating the document>
Producer:
CreationDate:    Wed Jan 22 16:44:51 2020 CET
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           2
Encrypted:       no
Page size:       595.276 x 841.89 pts (A4)
Page rot:        0
File size:       12008 bytes
Optimized:       no
PDF version:     1.4

Note that a document has generally much more metadata than those shown with pdfinfo and other tools.

Compress PDF file

Some documents are huge.

PDF files can have complex structures, and there are many ways to represent certain content.

In addition to that, elements inside a PDF file can be compressed.

Thus it should be possible to "optimize" the size of a document, just like jpegoptim and optipng do for .jpg and .png files.

The first one is to remove all elements that are not visible, and eventually metadata too.

The second method would be to represent more efficiently the content.

As of today, all methods tried (except one) failed. By "failed" I mean that the optimization was not lossless, although I might not have been able to detect it with the naked eye.

Again, diffpdf will help to validate if a conversion was lossless or not.

  • ps2pdf input.pdf output.pdf will generally not work. It converts a PDF file to PS and then to PDF again.

  • gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf (or some variation of a similar command) will also not work. It converts to a document between different formats.

  • exiftool -all:all= document.pdf creates a bigger file. This effect can be explained if one looks at the output: Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! The metadata is not removed, but new empty metadata is appended to the document (and generally only the last metadata is shown, you can try it with pdfinfo too). If someone suggests using exiftool alone, he never bothered to look at the output, or at the generated document.

In general, gs and ps2pdf are lossy, and they can help if the document has hidden content (metadata too) or images that are at a too high resolution.

I’m not sure why exiftool is presented as a tool capable of removing metadata, it is not. It can add additional empty metadata and thus increases the size.

While it is true that pdfinfo and most other tools only show the last metadata, the original metadata is still there in cleartext.

There are mainly two ways to compress PDF files. With lossy or lossless methods.

In this context, lossless means that the rendered content is identical, and that text remains text (thus it can still be copied). Unused objects and invisible content might be removed, and the internal structure of the PDF document might look completely different.

Lossy means that there might be some artifacts, and text might be converted to an image.

The only method I’ve found for doing a lossless compression is qpdf:

qpdf --recompress-flate --compress-streams=y --object-streams=generate --remove-unreferenced-resources=yes --coalesce-contents input.pdf -- output.pdf;

There is also the parameter --optimize-images, which (I suppose) will make a lossy compression of the images.

Decrypt password-protected PDF file

Not all programs can handle PDF documents with passwords, and I do not want to decrypt those documents every time I want to read them.

So I generally save a decrypted copy

qpdf --decrypt input.pdf output.nopwd.pdf --password="$PASSWORD";

Remove copy protection and other restrictions

PDF files might have restrictions; for example not being able to print them, or not being able to select and copy the text.

Those restrictions are handled by the PDF reader, thus using a PDF reader that does not restrict what the end-user can do is sufficient.

But if you have to work with a given program that applies those restrictions, you might want to remove them.

The "official" way is to use the "owner password" (not to be confused with the "user password", used for decrypting a password-protected file), but you do not need to know it

remove copy protection
qpdf --decrypt input-with-restrictions.pdf output-without-restrictions.pdf;

Create a page with multiple images

Instead of printing one (small) image per page, I find it much more logical to print as many images as possible at once.

Normally I use LibreOffice Writer, but I assume that Microsoft Word is equally good.

In fact, for my main use case, I do not even need to create a PDF document if the device attached to the printer supports .odt, .doc, or .docx documents.

Extract attachments / embedded files

PDF files can act more or less like zip archives, and contain other files.

qpdf has the command-line options --list-attachments and --show-attachment, which can be used for extracting all embedded files.

Validate and repair PDF document

Even if a PDF reader can open, process, and show the content of a document, it does not mean that the file does not have some invalid structure.

Most programs can handle malformed PDF documents, but it would be better not to have such documents, especially if you are dealing with digital signatures, have created such documents to send to other people, or want to be sure to be able to open such documents in the future.

qpdf has an option for verifying if a PDF document is valid: qpdf --check input.pdf

It will not be able to verify all possible types of error, but it seems to verify multiple properties.

If you want to repair a document, then processing it with any of the mentioned tools (qpdf, pdftk, pdfjam, pdfunite, …​) should repair it, as they generally parse a document and create a new one.

For example, qpdf input.pdf output.pdf will create a valid output.pdf file, even if input.pdf is a partially broken PDF file.

Extract text from PDF

If you want to read a PDF file from the command line, or if you have some automated process that depends on the content of a PDF file, you might want to extract the text from it.

# pdf2txt, from the package python3-pdfminer in Debian
pdf2txt input.pdf;

# pdftotext (often part of the poppler package)
pdftotext input.pdf; # saves output to input.txt
pdftotext input.pdf -; # prints output to console

On the documents I’ve tested, pdftotext seems to give better results, as the displayed text has a "better" layout, that also resembles the layout used in the PDF file.

Note that both programs have some options for defining the layout of the output, and for creating HTML files.

Other Programs

There is pdfarranger:

PDF Arranger is a small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop, and rearrange their pages using an interactive and intuitive graphical interface.

— pdfarranger README on github

Xournal++ is a note-taking software which is also able import, modify and export PDF files. Contrary to LibreOffice Draw, the resulting document seems to be much more similar to the original, althoguh the resulting document might still have some artifacts.

And tabula-java, which is both a (Java) library and a command line tool specialized in extracting tabular data from PDF files.

And some document readers:

  • Adobe Reader, which does not support GNU/Linux systems. It is, at the end of the day, the document reader that defines more or less if other people can interact with your document. I’m still using Windows XP from time to time because of it.

  • Sumatra PDF, which contrary to the name, supports many more formats and is extremely lightweight. Unfortunately it works only on Windows.

  • MuPDF viewer is a lightweight PDF viewer. There is also a version for Android, and does support other file formats too.

  • KOReader, "a document viewer primarily aimed at e-ink readers".


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.