Work with PDF files
- Extract all images
- Convert every page to an image
- Compare two PDF files
- Insert text or image in an existing document
- Append one PDF to another
- Extract one or more pages from a PDF file
- Show document metadata
- Compress PDF file
- Decrypt password-protected PDF file
- Remove copy protection and other restrictions
- Create a page with multiple images
- Extract attachments / embedded files
- Validate and repair PDF document
- Extract text from PDF
- Other Programs
PDF files have a lot of features, but the biggest disadvantage is that it is a difficult format to work with (complex enough that most Readers where vulnerable to maliciously crafted files 🗄️)
I do not work with PDF files often, but sometimes I have to, and every time I need to research how to accomplish some specific tasks.
Extract all images
Some PDF files are just a collection of images.
This is particularly true for scans, and comic books.
In those cases, I prefer a simpler format like .cbz
, which is just a renamed .zip
archive containing images.
Note 📝 | while .cbz (and .cbr ) is commonly used for comic books, it works equally well for other types of documents that mainly consist of images, for example, product catalogs, IKEA instructions, magazines, LEGO instructions, photo albums, or scanned documents. |
In my experience, .cbz
files are easier to modify (just extract the images and work on them directly), and on ebook readers, the program rendering the .cbz
files does normally a better job at cropping blank borders.
For extracting all images, see the following commands
# lists all images (and image types) contained in the PDF file
pdfimages -list input.pdf;
# use this command if no conversion is desired, all images prefixed with "img-"
pdfimages -all input.pdf "./img-";
# use this command to convert all the images to jpeg, all images prefixed with "img-", see the help for other formats
pdfimages -j input.pdf "./img-";
Unless you have a particular requirement, either use -all
or a specific format (use pdfimages --help
to see the list of supported formats).
If no format is specified, pdfimages
will convert all images to .ppm
files.
I wondered long enough why a 33.5MiB PDF produces more than 300MiB of images, which resulted in 175MiB compressed .png
images.
For what is worth, in one case, extracting (no conversions) images from a 33.5MB PDF file resulted in 33.6MiB images. As the PDF format does not solely consist of images, but some additional structure and metadata, it seemed strange that the images alone did occupy slightly more space.
If you look closely at the output of pdfimages -list
, the last column will report if an image is compressed losslessly inside the PDF structure, and in my case, all images were compressed:
pdfimages -list ./comic.pdf | head -7
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1069 1611 rgb 3 8 jpeg no 8 0 200 200 773K 15%
2 1 image 1083 1586 rgb 3 8 jpeg no 14 0 200 200 768K 15%
3 2 image 1046 1611 rgb 3 8 jpeg no 20 0 200 200 845K 17%
4 3 image 1038 1611 rgb 3 8 jpeg no 26 0 200 200 729K 15%
5 4 image 1050 1598 rgb 3 8 jpeg no 32 0 200 200 759K 15%
Creating a zip
archive with all the images (with some default settings for the compression level) resulted in 32.2MiB, and stripping the metadata I was not interested in resulted in 31.6MiB, compressed in a zip archive to 31.1MiB.
It is not a great achievement in terms of size (just reduce it by 4%), but I have a format that is easier to deal with, with no loss in quality.
Warning ⚠️ | Do not blindly convert all .pdf files to .cbz |
You should ensure that your comics actually consist only of images. Some .pdf
comics consist of two parts; the images (as images), and the text as text. The effect is that the extracted images would have, for example, empty balloon dialogs.
On some documents, a page is composed of multiple images, thus after extracting them one should, somehow, convert them to a single image.
Another thing to pay attention to is that even if the whole page consists of an image, it might contain useful metadata. Some scanned documents have an invisible overlay (eventually created with some OCR program), that makes it possible to copy the scanned text. It should not be necessary to say that extracting the images will not preserve the copyable text.
Thus before converting a PDF file to .cbz
, ensure that you are not interested in the additional metadata.
Convert every page to an image
Some comics do not have one image per page, but multiple composed and overlayed together. Or as mentioned earlier, the text could be not part of the image itself.
In that case, extracting all images will not yield the desired result.
Even if it is a lossy conversion, one might still prefer to have images instead of a PDF file, in that case, it is possible to convert every page to an image file
pdftoppm -png inpute.pdf "./img-";
In this case, the difference in size can vary much more compared to pdfimages -all
, as the conversion can alter the quality of the content.
Compare two PDF files
diffpdf
is, as far as I know, the most reliable program for comparing visual differences in PDF files.
It is relatively easy to use and can show if there are any visual differences (images included), or if only the text (words or single characters) differ (eventually ignoring fonts).
It will highlight where those differences are, either by marking them or by showing the difference between one file and another, instead of the two files side by side (which is very practical for documents with images).
Comparing two PDF files is important if you want to ensure that some operation did not modify its content.
I’ve used, for example, diffpdf
as part of a pipeline to ensure that the output of generated documents was unchanged after optimizing the build process and cleaning up the source files.
In that case, I did not care if some part of the appearance was different, but mainly if the words or letters did change.
Insert text or image in an existing document
I currently have no good solution for those use cases.
The most common scenario is having a document to sign and send electronically.
The "official" way, excluding a digital signature which is something else, would be to print the document, sign the printed page, and then scan it.
The more environmentally friendly solution is to sign a piece of paper, scan it, and save it as an image somewhere on your PC. Every time you need to sign a document, instead of printing it, insert the image at the appropriate position.
LibreOffice Draw can accomplish this task, but converting between a PDF file and its internal format, will generally (slightly) change your document.
Some elements might not be aligned as before anymore, or the used might be different.
If you are not sure, use diffpdf
. Ideally, the file should look identical except where I placed the signature or additional text.
Most of the time the differences are minor, especially if you adjust alignment issues by hand, and unless one is looking at two documents side by side, the difference will be hard to notice
As of today, no one asked me why the returned documents look slightly different, I assume that no one noticed, or at least that no one cared enough to make an issue out of it.
Nonetheless, there should be a better way.
A way that
-
works in multiple environments
-
does not otherwise alter the PDF document in any way (just add the new content at the given position)
-
does not require an internet connection or a subscription
-
is easy to use, ideally a graphical program, so that I can place the image with the mouse and see immediately how it looks
I’ve yet not found a program that ticks all the boxes, suggestions are welcome.
-
LibreOffice Draw alters the document
-
with GIMP, I’ve never understood how to handle documents that consist of multiple pages
-
from the command line, it is "hard" to place the image in the "correct" position
Append one PDF to another
If you have two files, and you want to append one to another, then there are multiple tools at your disposal, especially from the command line.
# pdfunite
pdfunite input1.pdf input2.pdf input3.pdf ... output.pdf;
# pdftk (often part of the poppler package)
pdftk input1.pdf input2.pdf input3.pdf ... cat output output.pdf;
# qpdf
qpdf --empty --pages input1.pdf input2.pdf input3.pdf ... -- out.pdf;
# pdfjam, might be already available if working with LaTeX
pdfjam input1.pdf input2.pdf input3.pdf ... -o output.pdf;
TODO: not all suggested programs are equivalent. In particular, some might preserve metadata from the original document, and do different things in front of pages with different layouts and internal links.
According to pdfinfo
:
Command | version | size in bytes | Preserve metadata (title, producer, ….) |
---|---|---|---|
|
|
| no, metadata is empty |
|
|
| no, metadata is empty |
|
|
| yes |
|
|
| no, overwrites metadata |
|
|
| no, overwrites metadata, also adds "Custom metadata", and "updated" from PDF 1.5 to PDF 1.3 |
Size of input file: 2839681
bytes
How can qpdf
create such a small file? It deduplicates all objects. Since in this test, I appended a file to itself, most internal objects can be reused.
I did otherwise not validate output.pdf
file, nor use a particularly complex document.
Extract one or more pages from a PDF file
# pdftk (often part of the poppler package)
pdftk input.pdf cat 2-17 20 23-24 output output.pdf;
# qpdf
qpdf --empty --pages input.pdf 2-17,20,23-24 -- output.pdf;
# pdfjam, might be already available if working with LaTeX
pdfjam input.pdf 2-17,20,23-24 -o output.pdf;
TODO: not all suggested programs are equivalent. In particular, some might preserve metadata from the original document, and do different things in front of pages with different layouts and internal links.
According to pdfinfo
:
Command | version | size in bytes | Preserve metadata (title, producer, ….) |
---|---|---|---|
|
|
| no, metadata is empty |
|
|
| yes |
|
|
| no, overwrites metadata |
|
|
| no, overwrites metadata, also adds "Custom metadata", and "updated" from PDF 1.5 to PDF 1.3 |
I did not validate output.pdf
, or used a particularly complex document during my tests.
Show document metadata
pdfinfo input.pdf;
The output will look similar to
Keywords: <keywords>
Author: <author>
Creator: <program used for creating the document>
Producer:
CreationDate: Wed Jan 22 16:44:51 2020 CET
Custom Metadata: no
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 2
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 12008 bytes
Optimized: no
PDF version: 1.4
Note that a document has generally much more metadata than those shown with pdfinfo
and other tools.
Compress PDF file
Some documents are huge.
PDF files can have complex structures, and there are many ways to represent certain content.
In addition to that, elements inside a PDF file can be compressed.
Thus it should be possible to "optimize" the size of a document, just like jpegoptim
and optipng
do for .jpg
and .png
files.
The first one is to remove all elements that are not visible, and eventually metadata too.
The second method would be to represent more efficiently the content.
As of today, all methods tried (except one) failed. By "failed" I mean that the optimization was not lossless, although I might not have been able to detect it with the naked eye.
Again, diffpdf
will help to validate if a conversion was lossless or not.
-
ps2pdf input.pdf output.pdf
will generally not work. It converts a PDF file to PS and then to PDF again. -
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
(or some variation of a similar command) will also not work. It converts to a document between different formats. -
exiftool -all:all= document.pdf
creates a bigger file. This effect can be explained if one looks at the output:Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered!
The metadata is not removed, but new empty metadata is appended to the document (and generally only the last metadata is shown, you can try it withpdfinfo
too). If someone suggests usingexiftool
alone, he never bothered to look at the output, or at the generated document.
In general, gs
and ps2pdf
are lossy, and they can help if the document has hidden content (metadata too) or images that are at a too high resolution.
I’m not sure why exiftool
is presented as a tool capable of removing metadata, it is not. It can add additional empty metadata and thus increases the size.
While it is true that pdfinfo
and most other tools only show the last metadata, the original metadata is still there in cleartext.
There are mainly two ways to compress PDF files. With lossy or lossless methods.
In this context, lossless means that the rendered content is identical, and that text remains text (thus it can still be copied). Unused objects and invisible content might be removed, and the internal structure of the PDF document might look completely different.
Lossy means that there might be some artifacts, and text might be converted to an image.
The only method I’ve found for doing a lossless compression is qpdf
:
qpdf --recompress-flate --compress-streams=y --object-streams=generate --remove-unreferenced-resources=yes --coalesce-contents input.pdf -- output.pdf;
There is also the parameter --optimize-images
, which (I suppose) will make a lossy compression of the images.
Decrypt password-protected PDF file
Not all programs can handle PDF documents with passwords, and I do not want to decrypt those documents every time I want to read them.
So I generally save a decrypted copy
qpdf --decrypt input.pdf output.nopwd.pdf --password="$PASSWORD";
Remove copy protection and other restrictions
PDF files might have restrictions; for example not being able to print them, or not being able to select and copy the text.
Those restrictions are handled by the PDF reader, thus using a PDF reader that does not restrict what the end-user can do is sufficient.
But if you have to work with a given program that applies those restrictions, you might want to remove them.
The "official" way is to use the "owner password" (not to be confused with the "user password", used for decrypting a password-protected file), but you do not need to know it
qpdf --decrypt input-with-restrictions.pdf output-without-restrictions.pdf;
Create a page with multiple images
Instead of printing one (small) image per page, I find it much more logical to print as many images as possible at once.
Normally I use LibreOffice Writer, but I assume that Microsoft Word is equally good.
In fact, for my main use case, I do not even need to create a PDF document if the device attached to the printer supports .odt
, .doc
, or .docx
documents.
Extract attachments / embedded files
PDF files can act more or less like zip archives, and contain other files.
qpdf
has the command-line options --list-attachments
and --show-attachment
, which can be used for extracting all embedded files.
Validate and repair PDF document
Even if a PDF reader can open, process, and show the content of a document, it does not mean that the file does not have some invalid structure.
Most programs can handle malformed PDF documents, but it would be better not to have such documents, especially if you are dealing with digital signatures, have created such documents to send to other people, or want to be sure to be able to open such documents in the future.
qpdf
has an option for verifying if a PDF document is valid: qpdf --check input.pdf
It will not be able to verify all possible types of error, but it seems to verify multiple properties.
If you want to repair a document, then processing it with any of the mentioned tools (qpdf
, pdftk
, pdfjam
, pdfunite
, …) should repair it, as they generally parse a document and create a new one.
For example, qpdf input.pdf output.pdf
will create a valid output.pdf
file, even if input.pdf
is a partially broken PDF file.
Extract text from PDF
If you want to read a PDF file from the command line, or if you have some automated process that depends on the content of a PDF file, you might want to extract the text from it.
# pdf2txt, from the package python3-pdfminer in Debian
pdf2txt input.pdf;
# pdftotext (often part of the poppler package)
pdftotext input.pdf; # saves output to input.txt
pdftotext input.pdf -; # prints output to console
On the documents I’ve tested, pdftotext
seems to give better results, as the displayed text has a "better" layout, that also resembles the layout used in the PDF file.
Note that both programs have some options for defining the layout of the output, and for creating HTML files.
Other Programs
There is pdfarranger
:
PDF Arranger is a small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop, and rearrange their pages using an interactive and intuitive graphical interface.
Xournal++ is a note-taking software which is also able import, modify and export PDF files. Contrary to LibreOffice Draw, the resulting document seems to be much more similar to the original, althoguh the resulting document might still have some artifacts.
And tabula-java, which is both a (Java) library and a command line tool specialized in extracting tabular data from PDF files.
And some document readers:
-
Adobe Reader, which does not support GNU/Linux systems. It is, at the end of the day, the document reader that defines more or less if other people can interact with your document. I’m still using Windows XP from time to time because of it.
-
Sumatra PDF, which contrary to the name, supports many more formats and is extremely lightweight. Unfortunately it works only on Windows.
-
MuPDF viewer is a lightweight PDF viewer. There is also a version for Android, and does support other file formats too.
-
KOReader, "a document viewer primarily aimed at e-ink readers".
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.