The pyPdf.pdf Module

Destination(title, page, typ, *args) (class) [#]

A class representing a destination within a PDF file.

For more information about this class, see The Destination Class.

DocumentInformation() (class) [#]

A class representing the basic document metadata provided in a PDF File.

For more information about this class, see The DocumentInformation Class.

PageObject(pdf) (class) [#]

This class represents a single page within a PDF file.

For more information about this class, see The PageObject Class.

PdfFileReader(stream) (class) [#]

Initializes a PdfFileReader object.

stream
An object that supports the standard read and seek methods similar to a file object.

For more information about this class, see The PdfFileReader Class.

PdfFileWriter() (class) [#]

This class supports writing PDF files out, given pages produced by another class (typically {@link #PdfFileReader PdfFileReader}).

For more information about this class, see The PdfFileWriter Class.

The Destination Class

Destination(title, page, typ, *args) (class) [#]

A class representing a destination within a PDF file. See section 8.2.1 of the PDF 1.6 reference. Stability: Added in v1.10, will exist for all v1.x releases.

bottom [#]

Read-only property accessing the bottom vertical coordinate.

Returns:
A number, or None if not available.

left [#]

Read-only property accessing the left horizontal coordinate.

Returns:
A number, or None if not available.

page [#]

Read-only property accessing the destination page.

Returns:
An integer.

right [#]

Read-only property accessing the right horizontal coordinate.

Returns:
A number, or None if not available.

title [#]

Read-only property accessing the destination title.

Returns:
A string.

top [#]

Read-only property accessing the top vertical coordinate.

Returns:
A number, or None if not available.

typ [#]

Read-only property accessing the destination type.

Returns:
A string.

zoom [#]

Read-only property accessing the zoom factor.

Returns:
A number, or None if not available.

The DocumentInformation Class

DocumentInformation() (class) [#]

A class representing the basic document metadata provided in a PDF File.

As of pyPdf v1.10, all text properties of the document metadata have two properties, eg. author and author_raw. The non-raw property will always return a TextStringObject, making it ideal for a case where the metadata is being displayed. The raw property can sometimes return a ByteStringObject, if pyPdf was unable to decode the string's text encoding; this requires additional safety in the caller and therefore is not as commonly accessed.

author [#]

Read-only property accessing the document's author. Added in v1.6, will exist for all future v1.x releases. Modified in v1.10 to always return a unicode string (TextStringObject).

Returns:
A unicode string, or None if the author is not provided.

creator [#]

Read-only property accessing the document's creator. If the document was converted to PDF from another format, the name of the application (for example, OpenOffice) that created the original document from which it was converted. Added in v1.6, will exist for all future v1.x releases. Modified in v1.10 to always return a unicode string (TextStringObject).

Returns:
A unicode string, or None if the creator is not provided.

producer [#]

Read-only property accessing the document's producer. If the document was converted to PDF from another format, the name of the application (for example, OSX Quartz) that converted it to PDF. Added in v1.6, will exist for all future v1.x releases. Modified in v1.10 to always return a unicode string (TextStringObject).

Returns:
A unicode string, or None if the producer is not provided.

subject [#]

Read-only property accessing the subject of the document. Added in v1.6, will exist for all future v1.x releases. Modified in v1.10 to always return a unicode string (TextStringObject).

Returns:
A unicode string, or None if the subject is not provided.

title [#]

Read-only property accessing the document's title. Added in v1.6, will exist for all future v1.x releases. Modified in v1.10 to always return a unicode string (TextStringObject).

Returns:
A unicode string, or None if the title is not provided.

The PageObject Class

PageObject(pdf) (class) [#]

This class represents a single page within a PDF file. Typically this object will be created by accessing the {@link #PdfFileReader.getPage getPage} function of the {@link #PdfFileReader PdfFileReader} class.

artBox [#]

A rectangle (RectangleObject), expressed in default user space units, defining the extent of the page's meaningful content as intended by the page's creator.

Stability: Added in v1.4, will exist for all future v1.x releases.

bleedBox [#]

A rectangle (RectangleObject), expressed in default user space units, defining the region to which the contents of the page should be clipped when output in a production enviroment.

Stability: Added in v1.4, will exist for all future v1.x releases.

compressContentStreams() [#]

Compresses the size of this page by joining all content streams and applying a FlateDecode filter.

Stability: Added in v1.6, will exist for all future v1.x releases. However, it is possible that this function will perform no action if content stream compression becomes "automatic" for some reason.

cropBox [#]

A rectangle (RectangleObject), expressed in default user space units, defining the visible region of default user space. When the page is displayed or printed, its contents are to be clipped (cropped) to this rectangle and then imposed on the output medium in some implementation-defined manner. Default value: same as MediaBox.

Stability: Added in v1.4, will exist for all future v1.x releases.

extractText() [#]

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Stability: Added in v1.7, will exist for all future v1.x releases. May be overhauled to provide more ordered text in the future.

Returns:
a unicode string object

mediaBox [#]

A rectangle (RectangleObject), expressed in default user space units, defining the boundaries of the physical medium on which the page is intended to be displayed or printed.

Stability: Added in v1.4, will exist for all future v1.x releases.

mergePage(page2) [#]

Merges the content streams of two pages into one. Resource references (i.e. fonts) are maintained from both pages. The mediabox/cropbox/etc of this page are not altered. The parameter page's content stream will be added to the end of this page's content stream, meaning that it will be drawn after, or "on top" of this page.

Stability: Added in v1.4, will exist for all future 1.x releases.

page2
An instance of {@link #PageObject PageObject} to be merged into this one.

rotateClockwise(angle) [#]

Rotates a page clockwise by increments of 90 degrees.

Stability: Added in v1.1, will exist for all future v1.x releases.

angle
Angle to rotate the page. Must be an increment of 90 deg.

rotateCounterClockwise(angle) [#]

Rotates a page counter-clockwise by increments of 90 degrees.

Stability: Added in v1.1, will exist for all future v1.x releases.

angle
Angle to rotate the page. Must be an increment of 90 deg.

trimBox [#]

A rectangle (RectangleObject), expressed in default user space units, defining the intended dimensions of the finished page after trimming.

Stability: Added in v1.4, will exist for all future v1.x releases.

The PdfFileReader Class

PdfFileReader(stream) (class) [#]

Initializes a PdfFileReader object. This operation can take some time, as the PDF stream's cross-reference tables are read into memory.

Stability: Added in v1.0, will exist for all v1.x releases.

stream
An object that supports the standard read and seek methods similar to a file object.

decrypt(password) [#]

When using an encrypted / secured PDF file with the PDF Standard encryption handler, this function will allow the file to be decrypted. It checks the given password against the document's user password and owner password, and then stores the resulting decryption key if either password is correct.

It does not matter which password was matched. Both passwords provide the correct decryption key that will allow the document to be used with this library.

Stability: Added in v1.8, will exist for all future v1.x releases.

Returns:
0 if the password failed, 1 if the password matched the user password, and 2 if the password matched the owner password.
Raises NotImplementedError:
Document uses an unsupported encryption method.

documentInfo [#]

Read-only property that accesses the {@link #PdfFileReader.getDocumentInfo getDocumentInfo} function.

Stability: Added in v1.7, will exist for all future v1.x releases.

getDocumentInfo() [#]

Retrieves the PDF file's document information dictionary, if it exists. Note that some PDF files use metadata streams instead of docinfo dictionaries, and these metadata streams will not be accessed by this function.

Stability: Added in v1.6, will exist for all future v1.x releases.

Returns:
Returns a {@link #DocumentInformation DocumentInformation} instance, or None if none exists.

getNamedDestinations(tree=None, retval=None) [#]

Retrieves the named destinations present in the document.

Stability: Added in v1.10, will exist for all future v1.x releases.

Returns:
Returns a dict which maps names to {@link #Destination destinations}.

getNumPages() [#]

Calculates the number of pages in this PDF file.

Stability: Added in v1.0, will exist for all v1.x releases.

Returns:
Returns an integer.

getOutlines(node=None, outlines=None) [#]

Retrieves the document outline present in the document.

Stability: Added in v1.10, will exist for all future v1.x releases.

Returns:
Returns a nested list of {@link #Destination destinations}.

getPage(pageNumber) [#]

Retrieves a page by number from this PDF file.

Stability: Added in v1.0, will exist for all v1.x releases.

Returns:
Returns a {@link #PageObject PageObject} instance.

isEncrypted [#]

Read-only boolean property showing whether this PDF file is encrypted. Note that this property, if true, will remain true even after the {@link #PdfFileReader.decrypt decrypt} function is called.

namedDestinations [#]

Read-only property that accesses the {@link #PdfFileReader.getNamedDestinations getNamedDestinations} function.

Stability: Added in v1.10, will exist for all future v1.x releases.

numPages [#]

Read-only property that accesses the {@link #PdfFileReader.getNumPages getNumPages} function.

Stability: Added in v1.7, will exist for all future v1.x releases.

outlines [#]

Read-only property that accesses the {@link #PdfFileReader.getOutlines getOutlines} function.

Stability: Added in v1.10, will exist for all future v1.x releases.

pages [#]

Read-only property that emulates a list based upon the {@link #PdfFileReader.getNumPages getNumPages} and {@link #PdfFileReader.getPage getPage} functions.

Stability: Added in v1.7, and will exist for all future v1.x releases.

The PdfFileWriter Class

PdfFileWriter() (class) [#]

This class supports writing PDF files out, given pages produced by another class (typically {@link #PdfFileReader PdfFileReader}).

addPage(page) [#]

Adds a page to this PDF file. The page is usually acquired from a {@link #PdfFileReader PdfFileReader} instance.

Stability: Added in v1.0, will exist for all v1.x releases.

page
The page to add to the document. This argument should be an instance of {@link #PageObject PageObject}.

encrypt(user_pwd, owner_pwd=None, use_128bit=True) [#]

Encrypt this PDF file with the PDF Standard encryption handler.

user_pwd
The "user password", which allows for opening and reading the PDF file with the restrictions provided.
owner_pwd
The "owner password", which allows for opening the PDF files without any restrictions. By default, the owner password is the same as the user password.
use_128bit
Boolean argument as to whether to use 128bit encryption. When false, 40bit encryption will be used. By default, this flag is on.

write(stream) [#]

Writes the collection of pages added to this object out as a PDF file.

Stability: Added in v1.0, will exist for all v1.x releases.

stream
An object to write the file to. The object must support the write method, and the tell method, similar to a file object.