Introduction

This is the documentation for BFO Publisher ${VERSION}, the HTML to PDF converter from https://publisher.bfo.com

BFO Publisher is a CSS Layout engine. It takes HTML or XML as input, applies a CSS stylesheet to it to lay out the elements on one or more pages then writes the result to a PDF, just as a web browser would write the results to your screen.

Here are some highlights:

  • Input formats:

    • HTML5, SVG2, MathML4 or generic XML.

    • Anything we can convert to HTML, such as EPub, plain-text or AsciiDoc

    • Embed SVG, bitmaps, video, audio, barcodes, 3D models and PDF. Here’s the list.

    • All the worlds languages supported. Well, not traditional Mongolian, but all the others.

  • Output formats:

    • PDF and PDF 2.0. PDF/A-1, A2, A3 and A4, PDF/X and best-in-class support for PDF/UA with full control over tags and attributes.

    • SVG, including options for multi-page files.

    • Any bitmap format, created from the PDF.

  • Input rigorously tested against the web platform tests, also used to test browsers. Accurate support for CSS2.1 - passing 98.6% of tests, about the same as your browser means tables look the same in pdf as they do on screen.

  • Comprehensive CSS3 support: selectors-4, cascade-5, color-5, fonts-4, backgrounds-3, images-4, shapes-2, page-3, gcpm-1, flex-1 and many more.

  • Build on our PDF Library, it generates PDF/A and PDF/UA files that are verified as correct.

  • Digital signatures, including PKCS#11 keystores, Amazon Cloud HSM and the GlobalSign digital-signing service.

  • HTML5 forms are as supported as we can make them!

  • Streaming model means ten-thousand page documents are no problem.

  • XSLT and XInclude support.

  • HTTP or WebSocket-based web-service option to do all of this on a remote host.

  • 100% home-grown Java. No native code, minimal third-party libraries. Multi-thread ready with smart caching for batch conversions.

This document will show you how to do all of the above. However it will not go into heavy detail on CSS itself, as it uses the same CSS layout rules as Firefox, Chrome and Safari. It’s normal CSS. Rather than repeat all of that here, this document will focus mainly on the extensions and differences that make BFO Publisher unique, as well as some of the lesser known areas of CSS that are required for layout to paged media. There are a million guides on the internet for CSS, so you can pick your favourite (we recommend https://developer.mozilla.org/en-US/docs/Web/CSS/Reference) or head to https://www.w3.org/Style/CSS/ for the source.

As well as what we can do, here’s what we can’t do: we’d rather you hear it from us, as it may save you some time.

  • No JavaScript! Why? Because we don’t have a DOM. BFO Publisher is is built around the concept of streaming, so XML elements are fed in, styled, put on the page and discarded. The entire document is never in memory, which means if you want to lay out a million row table, you can (it’s 20,200 pages of A4: with 175MB of heap it takes just under two minutes on a modern laptop). Losing JavaScript is the price we pay for that ability.

  • PDF is not interactive, so has no support for scrollable areas, animation, mouse-hover styling and similar.

  • HTML forms use a very different model to PDF forms; we map as close as we can.

  • Special elements like <video> and <audio> are more limited in PDF than in HTML

  • Quirks mode layout is not supported

  • Variable OpenType fonts are not supported in PDF.

  • For CSS layout, see our release notes for the list of known issues.

Overall, if your file has no JavaScript we should be able to lay it out into PDF pretty much exactly as you’d see it on screen.

Quick Start

There are three ways to use BFO Publisher: build an application with the API, run the Web Service and control it with HTTP or WebSockets, or run it from the command line (easiest, but also slowest due to Java’s start-up time and the need to reload the system stylesheets and fonts for each conversion).

Command Line

Download the application, unzip, then run command below.

# For help on the CLI interface
$ java -jar bfopublisher-bundle.jar --help

# As above, but include extra JAR files in the classpath
$ java -cp bfopublisher-bundle.jar:path/to/otherjar.jar org.faceless.publisher.Main --help

# To run the web-service
$ java -jar bfopublisher-bundle.jar --web

# An example conversion
$ java -jar bfopublisher-bundle.jar --format pdf --input input.html --output out.pdf

GUI

Download as above but double-click on the bfopublisher-bundle.jar. The application will run as a web-service with an icon in the System taskbar. The comand java -jar bfopublisher-bundle.jar --web-gui may also be used to start the application in this way.

Docker

Run the command below and then connect to the exported web interface

# run application normally
$ docker run -P -d bfocom/publisher

# as above, but add any JAR files in "/local/jarpath" to the classpath
$ docker run -v /local/jarpath:/var/publisher/jars -d bfocom/publisher

API "Hello World" example

Create the following in the file Converter.java

import java.io.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import org.faceless.publisher.ReportFactory;
import org.faceless.publisher.Report;
import org.faceless.publisher.output.ReportOutput;
import org.faceless.publisher.type.MediaType;

public class Converter {
  final ReportFactory factory = new ReportFactory();

  public static void main(String[] args) throws Exception {
      new Converter().convert(new File(args[0]), new File(args[1]));
  }

  public void convert(File infile, File outfile)
      throws IOException, SAXException, ParserConfigurationException
  {
    Report report = factory.createReport();
    ReportOutput output = factory.createReportOutput(MediaType.parse("application/pdf"));
    report.setReportOutput(output);
    report.load(infile);
    report.parse();
    FileOutputStream stream = new FileOutputStream(outfile);
    output.write(stream);
    stream.close();
  }
}

Compile and run:

$ javac -cp bfopublisher-bundle.jar Converter.java
$ java -cp bfopublisher-bundle.jar Converter:. helloworld.html helloworld.pdf

HTML vs. XML

XML should be very well understood by any developer, and in particular its use of namespaces. BFO Publisher’s native input format is XML, and we ship with CSS stylesheets for formatting the XML namespaces most commonly associated with CSS: HTML, SVG, MathML.

HTML is similar enough to XML that it’s often indistinguishable. However there are some important differences between the two which anyone using BFO Publisher should be aware of. In fact, they’re important enough that our number-one rule is: know whether your file is being parsed as HTML or XML: BFO Publisher chooses the XML or HTML parser depending on the Media Type of the file (as set by the HTTP Content-Type header, for example), or the URL suffix if the Media Type is unknown.

If it still can’t be determined the API will guess, and like most guesses it will sometimes be wrong. So always ensure the Media Type or URL can be use to determine the file type.

XML Parsing uses the Java SAX implementation, and the one supplied with the JVM is normally the best choice. The Woodstox parser at https://github.com/FasterXML/woodstox is certainly faster so if speed is an issue it’s worth a try. We do not currently recommend the stock Apache Xerces parser - this has a number of issues which seriously impact performance, and some outright bugs. Do not use it. HTML parsing requires the parser from http://about.validator.nu - the 1.4 build is very old but works well, and later builds from trunk should be fine too.
XML HTML

Media type of text/xml, application/xml or application/nnn+xml. Is presumed for URLs ending in .xml or .xht.

Media type of text/html. Is presumed for URLs ending in .html or .htm.

Default encoding of UTF-8 may be overridden with an initial <?xml encoding?> processing instruction. XML 1.0 disallows most control characters; XML 1.1 has a larger range but some (eg U+0000) can never be represented in XML, not even escaped.

Default encoding of ISO-8859-1 may be overridden with a <meta charset> element near the start of the input file. All characters can be represented, even U+0000.

Case-sensitive.

Case-insensitive. Attributes or elements which are defined as mixed case (only a few, all in SVG) are handled specially by the parser, the rest are lower-cased internally.

XML requires text and attributes to escape & < and > as &amp; &lt; and &gt; (alternatively CDATA blocks can be used, eg for inline CSS stylesheets).

No escaping is required. The parsing rules vary by element, which allows <script> and <style> to contain text content that would define a child node in other contexts.

Processing instructions can be used, such as <?xml-stylesheet?> to apply an XSLT stylesheet to the entire document (see the XSLT section).

Processing instructions are not part of the HTML language.

Elements must be either self-closed or closed with an explicit close tag.

Elements do not need to be closed, although it’s often good practice. Self-closing tags are not part of the HTML language

Malformed documents will fail to parse (specifically, the SAX parser used by BFO Publisher will throw a SAXParseException).

Malformed documents are repaired according to well-defined but opaque rules. Parse failures are almost impossible but the repair process can cause surprises. For example the CSS rule p > p { color: blue } will never match: the HTML parser alters the document so a <p> element will never have a <p> child.

Elements have no namespace unless one is set with the xmlns attribute. Elements in the HTML, SVG and MathML namespaces must have their namespaces explicitly specified to pick up the default CSS stylesheet rules.

An element’s namespace is derived from its name - so <html> and <body> are always in the HTML namespace, <svg> and <rect> in the SVG namespace, <mtext> in the MathML namespace and so on.

Document can interleave any number of different namespaces. Attributes can have namespaces.

Only HTML, SVG and MathML are recognised officially, and attributes have no namespaces. BFO Publisher has an extension to change this ( see HTML Namespace Extensions).

The CSS id, class and :lang() selectors match the id and xml:id, class, and lang and xml:lang attributes for the HTML, SVG and MathML namespaces. Other namespaces default to the same rules, but this can be configured.

The CSS id, class and :lang() selectors match the id, class and lang attributes. xml:id and xml:lang do not apply.

The xml:base attribute can be used to set the URL against which any relative URLs are resolved. The attribute applies to any child elements, and can be set several times throughout a document. The <base> element in the HTML namespace still works, and sets the default.

The <base> element can be set once in the document <head> to set the URL against which any relative URLs are resolved. This is a document-wide setting. xml:base does not apply.

<xi:include> can be used to include XML from an external source. BFO Publisher extends this to support including HTML as well.

As HTML doesn’t support namespaces, <xi:include> can’t be used unless HTML Namespace Extensions are used. When enabled, it works as it does for XML.

HTML Namespace Extensions

While it is completely possible to use BFO Publisher to parse HTML without any additional namespaces, some additional features require them. In particular the <xi:include> syntax to include external files requires the appropriate namespace to be defined. So BFO Publisher lets you extend the list of namespaces known to HTML with an extension

Name

bfo-ext-html-namespace

Value

auto | none | xmlns | common | <name>=<uri> [ <name>=<uri> …​ ]

The bfo-ext-html-namespace environment variable can be set to configure how namespaces are derived from HTML input. The values are:

none

no namespaces are derived. Only the HTML, SVG and MathML namespaces can be used. This matches the official HTML specification.

xmlns

the xmlns attribute (and xmlns:nnn attributes) get the same meaning they have in in XML, and are used to map both unprefixed or prefixed elements to a particular namespace.

common

attributes or elements in a predefined list are recognised as belonging to a specific namespace. This is identical to the list of name=uri option below, except the list is taken from the HtmlNamespaceExtension extension to the application (which can be modified). By default, it’s equivant to:

xi:include=http://www.w3.org/2001/XInclude /bfo:=http://bfo.com/ns/publisher

This maps the <xi:include> to the XInclude namespace, and any attributes beginning with bfo: to the BFO Publisher namespace.

auto

either the xmlns or common approach will be used, depending on which one is encountered first while parsing the document.

list of name=uri

a space-separated list of mappings from tags or attributes to a particular namespace. Attributes are prefixed by a slash "/". Children of any elements using this method that have the same prefix will inherit the same namespace. Some examples:

  • include="http://www.w3.org/2001/XInclude" will cause the <include> element in HTML to be mapped to the XInclude namespace

  • xi:include="http://www.w3.org/2001/XInclude" will do the same for the <xi:include> element. The prefix before the colon will will be dropped, and the SAX parser will receive an element with a local name of include.

  • /bfo:xslt="http://bfo.com/ns/publisher" will map the attribute bfo:xslt on any element to the BFO publisher namespace. As with the example above, the prefix will be dropped.

  • xi:include/bfo:xslt="http://bfo.com/ns/publisher" will map the attribute bfo:xslt on only the xi:include element to the BFO publisher namespace.

  • /bfo:="http://bfo.com/ns/publisher" will map any attribute beginning with bfo: on any namespace to the BFO publisher namespace.

Once mapped, the additional namespaces can be used in HTML. For example, to use the XInclude syntax:

<html>
 <head>
  <meta name="bfo-sys-html-namespace" content="xmlns">
 </head>
 <body>
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" file="include.html"></xi:include>
 </body>
</html>

or, using the namespace mapping

<html>
 <head>
  <meta name="bfo-sys-html-namespace" content="include=http://www.w3.org/2001/XInclude">
 </head>
 <body>
  <include file="include.html"></include>
 </body>
</html>

Note the use of closing tags is required - although we can customize the namespaces in HTML, we cannot customize the parsing process. Self-closing tags are not part of the HTML syntax, and as the HTML parse is tolerant of unclosed tags a failure to close any custom element like this will result in incorrect output.

Environment variables

BFO Publisher is almost entirely configured with environment variables. These are not OS environment variables - the term comes from CSS (see https://drafts.csswg.org/css-env/). In CSS environment variables look like env(name), and as with var(--name) they can be used anywhere in CSS. But as they are normally fixed and only a few currently defined they’re not that useful in normal CSS.

However in BFO Publisher we make heavy use of them. First, we define a large number of environment variables and let the user define custom ones as well. More importantly, as well as being usable in CSS we use them to configure the file conversion: choosing between PDF or PDF/A output; defining the current media properties for Media Queries; whether the old or new processing model for display:run-in is used; whether the content property applies only to pseudo-nodes (as specified) or to regular nodes, and many, many more options.

But let’s start with the basics. First, BFO Publisher sets some variables automatically:

Name

bfo-location

Value

always set to the current URL of the file being processed

Name

bfo-format

Value

always set to the shorthand-type of the output being generated - currently "pdf" or "svg"

And both of these can be used in the CSS, just like any other variables:

<html>
 <head>
  <style>
  @page {
    margin: 40mm;
    @top-center {
       content: env(bfo-location);
    }
  }
  </style>
 </head>
 <body>
  ...
 </body>
</html>

This will automatically add the URL of the file to the page margin.

To set your own environment variable, you can choose from one of several options (as the current CSS specification does not provide a way to do this, they’re all custom).

  1. You can set them using a <meta> tag in HTML - any name keys that are not already defined will set the corresponding environment value. So for example:

    <html>
     <head>
      <meta name="author" value="John Smith"/>
      <meta name="first-property" value="my-value"/>
      <meta name="second-property" value="my-value"/>
     </head>
     <body>
      ...
     </body>
    </html>

    This will set the first-property and second-property environment variables, but as the author name is already defined as a meta tag in HTML, it will not set an environment variable.

  2. You can set them using CSS with the special @bfo env at-rule:

    @bfo env {
        first-property: "my value";
        second-property: "my value";
    }

    This is particularly useful because unlike <meta>, CSS stylesheets can be included with <link> or @import. This lets you create a standard set of environment variables in a CSS file and simply import it into all your documents.

  3. They can be set when the conversion is started, via the API, command line or through the Web Service.

Configuring BFO Publisher

Almost every configuration that can be applied to a file conversion in BFO Publisher can be configured by setting the appropriate environment variables. Throughout this documentation they’ll be listed just like the two above, in box with "Environment Variable" at the top (for example, bfo-pdf-profile).

Like other at-rules in CSS, they are processed in the normal CSS priority: rules in user stylesheets will override author stylesheets, which will override user-agent stylesheets, and if two rules have the same hierarchy, a later value will override an earlier one.

Most importantly, environment variables only apply going forward. They will not be applied to any rules already parsed - they’re the CSS equivalent of global constants, so they should be set as near the top of the document as possible. This mostly matters when environment variables are used in Media Queries, see that section for details on how this is done.

PDF Output

The primary output format of BFO Publisher is PDF, which it does by building on the BFO PDF Library, on the market since 2001. So there are a very large number of options that can be set to control exactly how the PDF is created.

PDF/A, PDF/UA and PDF/X

PDF has a number of predefined profiles which impose restrictions on the format in the name of compatibility. Broadly they are:

  • PDF/A (ISO19005) - a subset intended for long-term archiving

  • PDF/X (ISO15930) - a subset intended for blind exchange of documents for print

  • PDF/UA (ISO14289) - a subset intended to improve the experience for accessibility. Think of it as the PDF version of https://www.w3.org/TR/WCAG21/

They all overlap to a degree - it is possible to create a PDF that is compliant with all three. PDF/A and PDF/X need to be self-describing, so all fonts and colorspaces are embedded, and PDF/UA requires the PDF to be tagged, and places restrictions on the HTML input - for example, images need an alt attribute, tables must be correctly formed with headers etc.

PDF/A Quickstart

For the vast majority of documents, to save as PDF/A all you need to do is this in a stylesheet included by the document:

@bfo env {
    bfo-pdf-profile: "PDF/A-3a";
}

or if you prefer meta tags to CSS:

<meta name="bfo-pdf-profile" value="PDF/A-3a" />

Chinese, Japanese or Korean text in the generic font-families serif, sans-serif or monospace needs to be changed to an embedded font, and if you use device-cmyk color, make sure you add a @color-profile device-cmyk rule (see Anchoring device-cmyk). But for PDF/A, that’s all there is to think about.

Choosing the profile simply involves setting an environment variable, either externally or from within the document by way of CSS or <meta> tags.

Name

bfo-pdf-profile-base

Value

<profile name>

Description

Adds a named PDF profile to the document profile. May be set more than once

Name

bfo-pdf-profile-require

Value

<feature name>

Description

Sets a named PDF profile feature to required. May be set more than once.

Name

bfo-pdf-profile-deny

Value

<feature name>

Description

Sets a named PDF profile feature to deny. May be set more than once.

Name

bfo-pdf-profile-ignore

Value

<feature name>

Description

Sets a named PDF profile feature to neither require nor deny. May be set more than once.

Name

bfo-pdf-profile

Value

[ <profile name> | <feature name> | -<feature name> ] +

Description

A shortcut property which allows multiple PDF profiles and features to be set at once. Overrides any previous profile settings. Values are separated with spaces, with feature names prefixed with "+" or "-". A plus makes a feature required (or ignored if it was previously denied), and a "-" makes a feature denied (or ignored if it was previously required).

More than one Profile Name from the following list can be specified, although they must not be incompatible (it cannot be both PDF/A-2 and PDF/A-3, for example). The values are compared lower case and ignoring anything other than letters or digits, so you don’t have to worry about punctuation.

Profile Names

PDF/A-1b

PDF/A-1b is the oldest subset of PDF/A and also the most restrictive, disallowing any form of transparency or compositing. So bitmap images with an alpha channel, the use of CSS mix-blend-mode, opacity or any colors that aren’t 100% opaque will cause an error. Unless you have a good reason to target PDF/A-1, we do not recommend it.

PDF/A-1a

Identical to PDF/A-1b except that the PDF is tagged.

PDF/A-2u

PDF/A-2u restrictions of PDF/A-1. The "u" conformance asserts that all text has a a Unicode value, which BFO Publisher does anyway. PDF/A-2 disallows all file attachments unless they are PDF/A-1 or PDF/A-2 files.

PDF/A-2b

This is PDF/A-2b without the "Unicode text" assertion, so there’s no need to use this profile.

PDF/A-2a

Identical to PDF/A-2u except the PDF is tagged.

PDF/A-3u

PDF/A-3u is identical to PDF/A-2u but allows any files to be attached.

PDF/A-3b

PDF/A-3u without the "Unicode text" assertion, so again there’s no need to use this profile.

PDF/A-3a

Identical to PDF/A-3u except the PDF is tagged.

PDF/A-4

PDF/A-4 that targets PDF 2.0, so allows some of the more modern features of PDF to be used. It disallows all file attachments unless they are PDF/A-1, PDF/A-2 or PDF/A-4 files.

PDF/A-4f

PDF/A-4f is identical to PDF/A-4 but allows any files to be attached.

PDF/A-4e

PDF/A-4e is identical to PDF/A-4f except it allows video, audio and 3D objects to be embedded, which BFO Publisher does with the <object>, <video> and <audio> tags.

PDF/UA-1

PDF/UA-1 is described below.

PDF/X-1a

PDF/X-1 output is described below. PDF/X-1a is very old and not recommended.

PDF/X-3

PDF/X-3 output is described below. PDF/X-3 is very old and not recommended.

PDF/X-4

PDF/X-4 output is described below. It can be combined with PDF/A-2 and PDF/A-3

If that’s all a bit overwhelming, this table should help you choose which PDF/A version to target based on your requirements.

Profile Transparency Multimedia Attachments Tags PDF Version

PDF/A-1b

-

-

-

-

1.4

PDF/A-1a

-

-

-

1.4

PDF/A-2u

-

strict

-

1.7

PDF/A-2a

-

strict

1.7

PDF/A-3u

-

any

-

1.7

PDF/A-3a

-

any

1.7

PDF/A-4

-

strict

-

2.0

PDF/A-4f

-

any

-

2.0

PDF/A-4e

any

-

2.0

Feature Names are the names of an individual PDF profile features in the PDF Library. A predefined profile such as PDF/A-2 is nothing more than set of features that are required or denied. While there will rarely be a need to alter these directly, being able to alter individual features allows the profiles to be customized if required.

By way of example, here are two identical ways of creating PDF that complies to both PDF/A-3a and PDF/UA-1a, but is also uncompressed for easy debugging.

<head>
 <meta name="bfo-pdf-profile" content="PDF/A-3a PDF/UA-1 -RegularCompression">
</head>

<head>
 <meta name="bfo-pdf-profile-base" content="PDF/A-3a">
 <meta name="bfo-pdf-profile-base" content="PDF/UA-1">
 <meta name="bfo-pdf-profile-deny" content="RegularCompression">
</head>

One place you may need to interact directly with profile features is in CSS. The features, like all environment values, can be extracted in CSS with the env() function, and these values can be queried in @media rules to allow different rules to apply when a particular feature is required or denied.

The user-agent stylesheets use this approach to choose which font to embed, or whether to embed PDF 1.x or 2.x tags. Here are some examples for how this might work (see below for an explanation of the -bfo-pdf-tag property).

Finally, the PDF/A, PDFX and PDF/UA profiles all introduce the concept of an output-intent to PDF - the color-space of the device the PDF is intended for.

Name

bfo-pdf-profile-intent

Value

<url> | <dashed-ident> | device-cmyk | srgb

Description

Sets the indended viewing conditions for the PDF. By default this is sRGB, but it can be set to whatever the device-cmyk profile is by using that keyword, the the name of a @color-profile rule that links to an ICC profile, or the URL of that color profile.

PDF/UA

HTML and XML documents are structured using tags, but PDF is primarily a page description language so in general has no need for structure. But when consuming a PDF with some form of accessibility technology, such as a screen reader, braille display, or simply removing a background behind text for readabilty, having an XML-like tag structure is crucial.

These accessibilty technologies are commonly abbreviated as "AT", and a PDF with this structure is called a Tagged PDF.

PDF/UA (ISO14289) is a particular profile of Tagged PDF designed to work with AT by placing a number of requirements on on how the tags are used. The current version is PDF/UA-1 and targets PDF 1.7, so it’s commonly (but not necessarily) combined with PDF/A-2a and PDF/A-3a. PDF/UA-2 will apply to PDF 2.0 and PDF/A-4, but is under development.

Generating a PDF/UA document does not require an in depth knowledge of PDF tags, but it does mean ensuring the source document is created in a certain way. Many of the rules are the same as for HTML documents conforming to WCAG. They’re listed here:

  • The PDF must have a <title> set (or some equivalent that sets the dc:title metadata).

  • The PDF must have a valid language set on the root element, eg with <html lang="NNN">.

  • Any <img> elements must have an alt attribute describing the image.

  • Any <figure> elements must have a figcaption attribute describing the image.

  • Any <svg> elements must have a <desc> element describing the SVG.

  • Every <td> element must map to at least one <th> in the same table.

  • Fonts must be embedded, as they are for PDF/A.

  • Form fields must have a <label> or by labeled by way of aria-labeledby or similar. Digital signatures must have a title parameter to do the same.

  • Appropriate tags must be used - <p> for paragraphs, <hn> for headers and so on

  • Also be aware that many other requirements for accessibility, such as ensuring suitable contrast between text and background color, are not verified by any PDF/UA checker. These so-called "human" tests need to be verified by visual inspection; if you’re generating accessible documents, you should be aware of these requirements.

A mapping from a <td> to <th> ensures that every non-header cell in table has a header describing it. For simple tables, say where the first row or first column is entirely <th> elements, this can be derived automatically but for more complex tables this may need to be specified explicitly. As with WCAG, this can be done in one of two ways:

  • A <th> can have scope set to row or col to inform BFO publisher that the header applies to all <td> cells in the same row or column.

  • A <td> can have a headers attribute set to the ID (or ID’s) of the header cells that describe it.

So long as one of these techniques can be used to map every <td> in a table to a corresponding <th>, this condition will be met for PDF/UA.

To enable PDF/UA output, use the bfo-pdf-profile tag described in the previous section

<meta name="bfo-pdf-profile" value="PDF/UA-1" />

or use one of the other methods of setting this property

There is a lot of overlap between PDF/UA and PDF/A, with PDF/UA the more demanding of the two specifications. If you’re going to generate PDF/UA files, consider making them PDF/A compliant as well: with very few exceptions there is zero downside to doing this, and you’ll get an objectively "better" PDF as a result.

Tagged output

Even without the relatively strict requirements of PDF/UA, Tagged PDF is useful for many PDF consumers.

  • It is more likely to work with reflow tools on devices such as phones

  • It is easier for tools that want to extract data or text from the PDF, such as search engines.

  • It allows metadata and attachments to be applied to sections of the PDF, rather than just the entire file.

While there is a slight cost in terms of file-size and performance, BFO recommend that tags are always enabled unless you’re certain that the generated PDF will never be accessed by anyone using accessibility tools.

Selecting a tagged profile like PDF/A-2a, PDF/A-3a or PDF/UA-1 will turn on tags automatically, but they can easily be turned them on manually:

Name

bfo-pdf-tagged

Value

true | false

Description

Whether to enable PDF tags

Tagged output will use standard tag mappings from HTML, SVG and MathML. One significant different between PDF 1.x and PDF 2.x is that PDF 2 allows namespaces on the tags. BFO Publisher makes full use of this, and if namespaces are allowed it should be possible to mostly reconsitute the original HTML structure from the PDF.

Customizing PDF Tags and Attributes

To adjust or modify the defauly tag mappings, or to include various optional attributes, we need to wade into the large number of custom CSS properties.

Name

-bfo-pdf-tag

Value

none | <tag> | <tag> map-to <tag> (PDF 1.x)
none | <namespace> "|" <tag> | <namespace> "|" <tag> map-to <namespace> "|" <tag> ] (PDF 2.x)

Applies to

any element or pseudo-element that is rendered in the output.

Inherited

no

Description

Sets the PDF "tag" that is applied to this element in the output PDF

Every element has a mapping defined in the user-agent stylesheet data/tags.css, so the default value none only applies to unknown elements. The syntax above is a bit dense and best shown with some examples - note that as PDF 2.x has namespaces and PDF 1.x does not, you need to choose the syntax that matches your PDF output.

@namespace pdf2 url("http://iso.org/pdf2/ssn");
@namespace mathml url("http://www.w3.org/1998/Math/MathML");
@namespace custom url("http://example.com/ns/custom");

.untagged       { -bfo-pdf-tag: none } (1)
span            { -bfo-pdf-tag: tag } (2)
li::marker      { -bfo-pdf-tag: "Lbl" } (3)
span            { -bfo-pdf-tag: tag map-to "Span" } (4)
body            { -bfo-pdf-tag: |tag map-to pdf2|"Part" } (5)
mathml|*        { -bfo-pdf-tag: |tag } (6)
#myselector     { -bfo-pdf-tag: custom|"weird" map-to pdf2|"Div" } (7)
div.heading     { -bfo-pdf-tag: "H1" } (8)
div.heading     { -bfo-pdf-tag: pdf2|"H1" } (9)
1 Any elements with class="untagged" will have no tags in the final PDF - the content of the element will appear to be be merged with its parent.
2 (PDF 1.x) Any <span> elements will be written to the PDF using a <span> tag (the value tag means repeat the original tag. Unless the PDF tag is identical to the source language tag, this is not the best approach. PDF tags are case sensitive, so the correct tag would be <Span>
3 (PDF 1.x) Any ::marker pseudo-elements on a list are mapped to the <Lbl> tag in PDF. This is one of the default rules taken from the user-agent stylesheet. As pseudo-elements do not have a tag, the tag value doesn’t apply here.
4 (PDF 1.x) Any <span> elements will be written to the PDF using a <span> tag, and then role-mapped to the <Span> tag. This is a standard rule taken from the user-agent stylesheet - it allows us to keep as much of the HTML semantics as possible when we write out the PDF, while keeping to the proscribed list of PDF tags. PDF 1.x role-mapping is quite limited, so we will use it where possible, falling back to direct use of the role-mapped tag where we can’t.
5 (PDF 2.x) Any <body> elements will be written as <body> in the element’s namespace, then role-mapped to Part in the pdf2 namespace (which is defined earlier in the CSS using the standard @namespace rule). This is a standard mapping taken from the user-agent stylesheet for PDF 2.x output. Rolemapping in PDF 2.x is more flexible than in PDF 1.0 and allows us to keep the namespace URL and use multiple mappings. Elements should generally be rolemapped to the http://iso.org/pdf2/ssn namespace.
6 (PDF 2.x) Any elements in the MathML namespace will be written using the same tags and namespace. This is also a standard mapping taken from the user-agent stylesheet: MathML is a standard namespace in PDF 2.x, with the same tags, so this is valid output.
7 (PDF 2.x) To show the full syntax range for PDF 2.x, this would tag the single element matching that selector as <weird> in the custom namespace, role-mapping it to <Div> in the PDF2 namespace.
8 (PDF 1.x) While it’s rare that the default rules will need overriding, for situations where classes are used to augment the HTML syntax (for example, using <div class="heading"> instead of <h1>) then overriding the default tags is a good idea, to ensure the augmented syntax is reflected in the PDF.
9 (PDF 2.x) this is the PDF 2.x equivalent of <8>

Setting -bfo-pdf-tag on an element will tag its content in the PDF, but we still need to determine what counts as content and what doesn’t: is a background-image part of the element or not? For this we have the -bfo-pdf-tag-include CSS property.

Name

-bfo-pdf-tag-include

Value

auto | [ background || shadow || if-empty ]

Applies to

any element or pseudo-element with -bfo-pdf-tag other than none

Inherited

yes

Description

Controls which visible aspects of an element are tagged as part of the tag

The default value of auto will attempt to do the right thing - it includes borders and content, but typically not the background of an element as part of its content. For elements where where the background is semantically significant, setting this property to background will include it. shadow will include the rasterized box-shadow image. Finally, if-empty will insert the tag even if it has no content not normally useful, but with elements like <td> it’s important to include them even if empty, to keep the table structure.

Attributes

The tagged structure in PDF, like the tagged structure in XML, can also make use of attributes. There are a large number of these described in the PDF Reference - some (like id) are analagous to XML/HTML attributes, some (like text-decoration-color) are closer to CSS properties.

Name

-bfo-pdf-tag-nnn

Value

unset | copy | see table below for values and defaults

Applies to

any element or pseudo-element with -bfo-pdf-tag other than none

Inherited

no

Description

Sets the value of the attribute nnn on the PDF tag

In all cases, the universal CSS value unset can be used to disable the attribute, and the value copy can be used to request the value is derived from the nearest corresponding HTML or CSS property, if possible. Most attributes default to unset but some default to copy, as shown below. This distinction is arbitrary - we’ve tried to balance how useful the attribute is with how verbose the resulting document will be by including it, and there will be many situations where the context may change this decision.

span.highlight {
    background-color: yellow;
    -bfo-pdf-background-color: copy;    /* it's significant; record it */
}
Custom CSS Property PDF Attribute Value

-bfo-pdf-tag-id

the ID for the PDF tag

<string> (default is copy)

-bfo-pdf-tag-class

the PDF tag class

<string>+ (default is copy)

-bfo-pdf-tag-title

the PDF tag title

<string> (default is copy)

-bfo-pdf-tag-placement

Layout:Placement

block inline before start end (default is copy)

-bfo-pdf-tag-writing-mode

Layout:WritingMode

lr-tb rl-tb tb-rl tb-lr lr-bt rl-bt bt-rl bt-lr

-bfo-pdf-tag-background-color

Layout:BackgroundColor

<color>

-bfo-pdf-tag-border-color

Layout:BorderColor

<color>{1,4}

-bfo-pdf-tag-border-style

Layout:BorderStyle

none hidden dashed solid double groove ridge inset outset {1,4}

-bfo-pdf-tag-border-thickness

Layout:BorderThickness

<length>{1,4}

-bfo-pdf-tag-color

Layout:BorderColor

<length>{1,4}

-bfo-pdf-tag-padding

Layout:Padding

<length>{1,4}

-bfo-pdf-tag-space-before

Layout:SpaceBefore

<length>

-bfo-pdf-tag-space-after

Layout:SpaceAfter

<length>

-bfo-pdf-tag-start-indent

Layout:StartIndent

<length>

-bfo-pdf-tag-end-indent

Layout:EndIndent

<length>

-bfo-pdf-tag-text-indent

Layout:TextIndent

<length>

-bfo-pdf-tag-text-align

Layout:TextAlign

start center end justify

-bfo-pdf-tag-width

Layout:Width

<length>

-bfo-pdf-tag-height

Layout:Height

<length>

-bfo-pdf-tag-block-align

Layout:BlockAlign

before middle after justify

-bfo-pdf-tag-inline-align

Layout:InlineAlign

start center end

-bfo-pdf-tag-tborder-style

Layout:TBorderStyle

none hidden dashed solid double groove ridge inset outset

-bfo-pdf-tag-tpadding

Layout:TPadding

<length>

-bfo-pdf-tag-line-height

Layout:LineHeight

<length>

-bfo-pdf-tag-baseline-shift

Layout:BaselineShift

<length>

-bfo-pdf-tag-text-decoration-type

Layout:TextDecorationType

none underline overline line-through (default is copy)

-bfo-pdf-tag-text-position

Layout:TextPosition

sup sub normal

-bfo-pdf-tag-text-decoration-color

Layout:TextDecorationColor

<color>

-bfo-pdf-tag-text-decoration-thickness

Layout:TextDecorationThickness

<length>

-bfo-pdf-tag-column-count

Layout:ColumnCount

<integer> (default is copy)

-bfo-pdf-tag-column-widths

Layout:ColumnWidths

<length>

-bfo-pdf-tag-column-gap

Layout:ColumnGap

<length>

-bfo-pdf-tag-glyph-orientation-vertical

Layout:GlyphOrientationVertical

<angle>

-bfo-pdf-tag-ruby-align

Layout:RubyAign

start center end justify distribute

-bfo-pdf-tag-ruby-position

Layout:RubyPosition

before after warichu inline distribute

-bfo-pdf-tag-list-numbering

List:ListNumbering

none unordered description disc circle square ordered decimal upper-roman lower-roman upper-alpha lower-alpha (default is copy)

-bfo-pdf-tag-table-row-span

Table:RowSpan

<integer> (default is copy)

-bfo-pdf-tag-table-col-span

Table:ColSpan

<integer> (default is copy)

-bfo-pdf-tag-table-headers

Table:Headers

string> (default is copy)

-bfo-pdf-tag-table-scope

Table:Scope

string> (default is copy)

-bfo-pdf-tag-table-summary

Table:Summary

string> (default is copy)

-bfo-pdf-tag-table-short

Table:Short

<string>

-bfo-pdf-tag-artifact-type

Artifact:Type

pagination layout page inline

-bfo-pdf-tag-artifact-subtype

Artifact:Subtype

header footer watermark pagenum bates linenum redaction <string>

-bfo-pdf-tag-nnn

nnn

<content-list>

1 The class attribute in PDF may in theory be used to inherit attribute values, but this approach is not a good fit to the way classes are used in CSS so this approach is not used in PDF
2 The title of a PDF tag is, roughly, it’s descriptive text. The source for copy will vary depending on the tag and will often, but not always, be the title attribute of the HTML element
3 The writing-mode in PDF is problematic, not only because it lists writing modes which are unused in any known language, but also because in PDF the value influences the order of the values in other properties such as -bfo-pdf-border-style. It is here for completeness but we very strongly advise against setting it.
4 -bfo-pdf-tag-border-color and similar attributes can take from 1 to 4 values. In all cases, the values are specified in the CSS order - clockwise from the top, independent of writing mode. This is not the native PDF order, but in this context consistency with CSS is more important. We fix it up internally.
5 Any unrecognised tags will be stored as specified in the PDF - the syntax is described in https://drafts.csswg.org/css-content-3/#typedef-content-content-list (although the leader() function is excluded). This would generally be used to copy custom attributes from the source file to the PDF - for example
<html>
 <style>
  [data-custom] { -bfo-pdf-tag-custom: attr(data-custom); }
 </style>
 <body>
  <p data-custom="myvalue">Element will have a PDF attribute of "custom" set to "myvalue"</p>
 </body>
</html>

Pronunciation assistance

Tagged PDF version 2.0 allows a Pronunciation Lexicon to be stored in the PDF and phonemes to be associated with a PDF tag. When generating a Tagged PDF 2.0 file, the data-ssml-phoneme-ph and data-ssml-phoneme-alphabet attributes defined in https://www.w3.org/TR/spoken-html/#data-ssml-phoneme will be used if specified (the shorthand data-ssml attribute is also supported).

<link rel="pronunciation"> is an approach taken from the EPUB 3.0 standard.

Best practice for pronunciation hints on the internet is far from decided; SSML, ePub and PDF all have enough in common to make implementing support fairly simple, even though the SSML specification is currently a working draft and so liable to change. Fow now, here’s an example showing the current state.

<html>
 <head>
  <meta charset="utf-8">
  <meta name="bfo-pdf-profile" content="PDF/A-4"/>
  <meta name="bfo-pdf-tagged" content="true"/>
  <link rel="pronunciation" href="path/to/norfolk-lexicon.xml"/>
 </head>
 <body>
  <h1>Welcome to <span data-ssml-phoneme-ph="heɪzburrah">Happisburgh</span></h1>
  <p>You'll never leave!</p>
 </body>
</html>

Attachments

It’s possible to attach files to a PDF being generated by using a special <link> annotation. Some examples:

<link rel="attachment" href="path/file.pdf"/> (1)
<link rel="attachment/source" href="path/file.html" title="Source HTML"/> (2)
<link rel="attachment" name="file.zip" href="path/nnn" type="application/zip" /> (3)
1 The simplest way to add an attachment - just set rel="attachment"
2 It’s also possible to set the type of attachment, which is a concept specific to PDF 2.0 and PDF/A-3. Predefined types are source, data, alternative, supplement, encryptedpayload, formadata, schema or unspecified. PDF/A-3 requires a type, but we’ll set it to unspecified if it’s missing. The title attribute can be used to give a description to the link.
3 You can override the type of the file and use a (non-standard) name attribute to override the file name.

If the href attribute is a fragment URL, the element it refers to be will be attached to the PDF as an XML document unless the element is a <style> element (in which case the type will be text/css, and only the content of the element will be attached) or a <script> element (in which case the type will be the value of the script’s type attribute, or text/javascript if unspecified, and only the content of the element will be attached). Fragment URL processing is new in version 1.3.

Finally, if the href attribute is # and the type attribute is text/css, all the stylesheets referenced by the document will be attached to the PDF. Special processing occurs here to ensure @import rules are expanded, CSS is converted to UTF-8, URLs are made absolute, and other steps which allow the CSS to be extracted from the PDF and usefully used where required, as as when deriving HTML from PDF. This functionaliy is new in version 1.3.

Encryption

The generated PDF can be password encrypted, or encrypted with public keys for specific recipients (although this has limited support in PDF viewers). As usual this is controled by environment properties.

all forms of encryption are disallowed in PDF/A.

For standard password encryption the following properties apply.

Name

bfo-pdf-encrypt-password

Value

<string>

Description

The password to open the PDF. If unset and no public-key recipients are specified, no password will be used. However the PDF will still be encrypted if other encryption options such as bfo-pdf-encrypt-print are set.

Name

bfo-pdf-encrypt-admin-password

Value

<string>

Description

The password that’s required to open the PDF and change the encryption. If unset, the encryption cannot be changed.

Name

bfo-pdf-encrypt-cipher

Value

aes-256 | aes-128 | rc4-128 | rc4-40

Description

The cipher and bitlength. AES-256bit was introduced with Adobe Acrobat X and is widely supported, so there’s no reason to use an older one. It’s the default value if unset.

Name

bfo-pdf-encrypt-metadata

Value

no | yes

Description

Determine if the metadata is encrypted.

Name

bfo-pdf-encrypt-print

Value

yes | no | lowres

Description

Whether to allow printing.lowres requests lower-resolution printing, but this is likely to collapse to yes in many viewers

Name

bfo-pdf-encrypt-change

Value

yes | annotations | forms | layout | no

Description

Whether to allow changes to the PDF. yes to allow all changes, annotations to all form filling and annotations to be added, forms to allow form filling only`, layout to allow pages to be rearranged and no for no changes allowed.

Name

bfo-pdf-encrypt-extract

Value

yes | no

Description

Whether to allow text to be extracted for non-accessibility purposes.

Here’s an example showing a fairly typical use case for password encryption - the PDF can be opened by anyone with the password password, and once opened it can’t be printed:

It is up to the application to honour the print, change and extract flags. Do not presume that all will.
<html>
 <head>
  <meta name="bfo-pdf-encrypt-password" content="secret">
  <meta name="bfo-pdf-print" content="no">
 </head>
 ...
</html>

Public key encryption uses a similar set of properties, but instead of bfo-pdf-encrypt-password and bfo-pdf-encrypt-admin-password there is bfo-pdf-encrypt-recipient.

Name

bfo-pdf-encrypt-recipient

Value

<url>

Description

The URL of a public key to use for encryption.

Each recipient has the print, change and extract rights set at the time the recipient is added - there may be more than one recipient, each with different access rights. The public key is an X.509 certificate which, as with Digital Signatures, can be a KeyStore (the URL may contain fragment parameters to select the key) or a PEM encoded X.509 certificate.

Here’s the above example, changed from password encryption to use a single public key for encryption. Anyone with this key will be able to open the PDF, but won’t be able to print it.

<html>
 <head>
  <meta name="bfo-pdf-print" content="no">
  <meta name="bfo-pdf-encrypt-recipient" content="http://example.com/userx509.cer">
 </head>
 ...
</html>

A more complex example allows two students to view the PDF and make no changes, and one teacher who may edit the form and annotate it. To change things up, let’s assume all the X.509 certificates are in a single file - maybe a Java KeyStore or be a single text file with multiple PEM encoded certificates, it doesn’t matter. We’ll use the cn fragment parameter to choose which entry in the key store we want, just as we can do for Digital Signatures

<html>
 <head>
  <meta name="bfo-pdf-change" content="no">
  <meta name="bfo-pdf-encrypt-recipient" content="http://example.com/keystore#cn=Harry">
  <meta name="bfo-pdf-encrypt-recipient" content="http://example.com/keystore#cn=Ron">
  <meta name="bfo-pdf-change" content="yes">
  <meta name="bfo-pdf-encrypt-recipient" content="http://example.com/keystore#cn=Dumbledore">
 </head>
 ...
</html>

General Options

The remaining PDF specific environment variables set general PDF Options - this covers aspects like whether the PDF opens with the bookmarks window or the thumbnail window, whether it opens up in single page or one column mode, and so on.

Name

bfo-pdf-nnn

Value

<any>

Description

Any unrecognised property beginning with bfo-pdf-nnn is treated as a PDF option, and passed directly to PDF.setOption() in the underlying PDF API.

Please consult the API method listed above for the full list. Here’s an example showing how to select the thumbnail panel when the PDF is opened, and display the pages as one long column.

<html>
 <head>
  <meta name="bfo-pdf-pagemode" content="UseThumbnails">
  <meta name="bfo-pdf-pagelayout" content="OneColumn">
 </head>
</html>

Layers

While PDF is usually a static document, the format does have some support for layers. We’re using this term to group two different concepts in PDF - annotations, which sit above the page and are largely independent, and optional content layers which are part of the page, but can be selectively turned on or off.

By default no layers are created, but an element can be assigned to a layer with -bfo-layer-type and various other -bfo-layer-nnn properties used to configure the layer.

this approach is also used to apply the special rules required to create hyperlinks and form fields, much like the appearance property has historically been used in browser. We’re not documenting these aspects.
Name

-bfo-layer-type

Value

none | layer | stamp | note | file

Applies to

any element that forms a stacking context.

Inherited

no

Description

For values other than none, set the type of layer this element will be part of

Name

-bfo-layer-name

Value

<string>

Applies to

any element with -bfo-layer-type other than none

Inherited

no

Description

Set the name of the layer. Names should be unique across the documemnt.

Name

-bfo-layer-visibility

Value

visible | hidden

Applies to

any element with -bfo-layer-type other than none

Inherited

no

Description

Controls when this element’s layer should be visible.

Name

-bfo-layer-print

Value

when-visible | always | none

Applies to

any element with -bfo-layer-type other than none

Inherited

no

Description

Controls when this element’s layer should be printed.

Name

-bfo-layer-export

Value

when-visible | always | none

Applies to

any element with -bfo-layer-type of layer

Inherited

no

Description

Controls when this element’s layer should be included when the element is exported to a bitmap format (

Name

-bfo-layer-lock

Value

none | no-edit | no-move | no-delete

Applies to

any element with -bfo-layer-type of other than none

Inherited

no

Description

Controls how the layer can be changed after the PDF is created. no-move implies no-edit and no-delete implies both.

Name

-bfo-layer-author

Value

<string>

Applies to

any element with -bfo-layer-type of type other than layer

Inherited

no

Description

Sets the author of this annotation layer

Name

-bfo-layer-subject

Value

<string>

Applies to

any element with -bfo-layer-type of type other than layer

Inherited

no

Description

Sets the subject of this annotation layer

Name

-bfo-layer-title

Value

<string>

Applies to

any element with -bfo-layer-type of type other than layer

Inherited

no

Description

Sets the "Content" of this annotation layer - the descriptive text. Required for PDF/UA

Name

-bfo-layer-href

Value

<url>

Applies to

any element with -bfo-layer-type of type file

Description

Sets the URL of the file to include.

Name

-bfo-layer-processing-step

Value

<string> which is defined in ISO19593, eg "White", "Braille" or "Structural.Cutting"

Applies to

any element with -bfo-layer-type of type of layer

Inherited

no

Description

Associates the layer with an ISO 19593 Processing Step.

Optional Content Layers

Layers are mainly used by tools such as Adobe Illustrator when exporting to PDF, and PDF viewers other than Adobe Acrobat are unlikely to have strong support. Here’s a fairly contrived example that creates three layers, an outer one and two inner ones;

<html>
 <style>
  .layer {
    -bfo-layer-type: layer;
    -bfo-layer-name: attr(data-name);
  }
  .layer[data-hidden] {
    -bfo-layer-visibility: hidden;
  }
 </style>
 <div class="layer" data-name="Outer Layer">
  Some content
  <div class="layer" data-name="en">
   An example
  </div>
  <div class="layer" data-name="es" data-hidden lang="es">
   Un ejemplo
  </div>
 </div>
</html>

Open the generated PDF in Adobe Acrobat and you would see the image on the left. Toggle the layers and you would see the image on the right.

The generated PDF with default layers
The same PDF with layers toggled

Just because this isn’t a very imaginative demo doesn’t make this feature uninteresting. Using layers with SVG or absolute positioning would allow relatively complex graphics to be displayed in PDF, such as architectural drawings showing different levels. Help text could be exposed when a user clicks a button, or documents could be toggled between multiple languages.

Optional Content layers like this are valid in PDF/A-2 or later, so we think this opens up some quite interesting possibilities for layout.

Stamp Layers, and other annotations

By changing the -bfo-layer-type to stamp, the elements’s content will be inserted into a PDF stamp annotation. Annotations in PDF conceptually sit above the page; the same mechanism is used for text highlights, notes, but also form fields hyperlinks. As the name implied, they’re intented for stamping content above the page content, like "Top Secret" or "Draft".

Unlike the optional content layers described previously, stamps cannot be nested - any use of bfo-layer-type inside a stamp will be ignored - and the stamp annotataion can be moved, resized or deleted by default in PDF viewers, although this can be disabled with the -bfo-layer-lock property (not every PDF viewer will respect this).

In the above example, change -bfo-layer-type to stamp in the stylesheet. A single stamp is created containing all the text, which can be moved, resized or deleted in most PDF viewers.

Adding bfo-layer-lock: delete will prevent the stamp from being modified, in theory. In practice at least the macOS PDF viewer does not honour this flag.

For a very different result, changing -bfo-layer-type to note will put the content (or at least, the text content) into a popup note. This is taking the content out-of-flow, so the content of the note will not affect layout of surrounding elements in the same way as if it was absolutely positioned.

Javascript

Any <script> elements in the input document normally define scripts to run immediately; they’re applied to the document itself. However the PDF format also supports JavaScript, which runs under a completely different environment: there is no DOM, for example, and instead of dealing with elements there are structures representing pages, form fields and other PDF constructs.

Currently BFO Publisher does not support JavaScript which is run on the input document, but scripts which are intended for embedding in the PDF are very much supported. To mark a script as destined for the PDF, set the type attribute to the value bfo/pdfscript

<html>
 <script>
  // Any Javascript here is assumed to be run in the HTML document context;
  // BFO Publisher does not support this, so it will be ignored
 </script>
 <script type="bfo/pdfscript">
  // Any Javascript here will be embedded directly into the final PDF.
  // If the output format is not PDF, it will be ignored.
 </script>
 <script type="bfo/pdfscript" src="external.js"></script>
  // Any Javascript in "external.js" will be treated as above.
 </script>
</html>

The use of a custom Media Type will also prevent this content from being processed if the document is loaded in a web-browser.

PDF URLs

BFO Publisher supports some custom hyperlink formats which allow <a> elements to perform actions within the PDF viewer - although support for these depends on the viewing environment. Actions such as goto() or FirstPage are fairly widely supported, but many others will require Adobe Acrobat or a PDF Viewer of a similar level of suport.

pdf:show(#fragment)

if the PDF has been created with Optional Content Layers, this action will show the specified layer.

pdf:hide(#fragment)

if the PDF has been created with Optional Content Layers, this action will hide the specified layer.

pdf:toggle(#fragment)

if the PDF has been created with Optional Content Layers, this action will show the specified layer if its hidden, and hide it if it’s visible.

<html>
 <div id="foo" style="-bfo-layer-type: layer">
  Optional Content Layer
 </div>
 <a href="pdf:show(#foo)">Show layer</a>
 <a href="pdf:hide(#foo)">Hide layer</a>
</html>
pdf:submit(url, method)

This will submit the PDF "AcroForm" to the specified URL. A detailed discussion of PDF forms is beyond the scope of this document, but as a PDF only has one document-wide form, the process for submitting it is a bit simpler than with HTML. url is required, but method is optional (it defaults to post if unspecified). Valid values are:

  • post to submit the form by HTTP POST

  • xml to submit the form using the XFDF syntax of XML

  • xml+annotations is as xml, but includes the annotations

  • pdf submits the entire PDF as the value, which includes the current values of the form fields as well as everything else.

pdf:reset

This will reset the PDF "AcroForm" fields to their initial values.

pdf:goto(page, mode)

This is an internal hyperlink to a specific page in the PDF. While normally it’s easier to link to an internal element, eg <a href="#chapter3">, this syntax can be used where a particular page is required. page is a number starting at 0 for the first page, mode is optional but can be fit-width, fit-height or fit to scale the page to fit the window in the viewer.

pdf:FirstPage, pdf:LastPage, pdf:NextPage, pdf:PrevPage

This syntax runs a particular pdf action. The listed four actions are universal, but other actions may be defined for a particular PDF viewer. For example, Adobe define the following values which are supposed to match the correponding actions in Adobe Acrobat:

GeneralPrefs Quit Scan SaveAs Close GeneralInfo Print Spelling Find FindSearch GoToPage GoBack GoForward FirstPage PrevPage NextPage LastPage SinglePage OneColumn TwoPages TwoColumns ZoomTo ActualSize FitPage FitWidth FitHeight FitVisible CollectionPreview CollectionHome CollectionDetails CollectionShowRoot ShowHideArticles ShowHideFileAttachment ShowHideBookmarks ShowHideOptCont ShowHideModelTree ShowHideThumbnails ShowHideSignatures FullScreenMode FindCurrentBookmark BookmarkShowLocation ZoomViewIn ZoomViewOut HandMenuItem ZoomDragMenuItem

Probably the most common use will be to print, for example <a href="pdf:Print">Print this file</a>

pdf:media-NNN(#fragment, …​)

This syntax is used for control of embedded media, such as video. The syntax is deliberately loose as the current (2022) process for embedding media in PDF is in a state of flux. In general however, the NNN component of the URL will be action, such as play or pause, and the first parameter will be the fragment-only URL of the multimedia element to control. Depending on the action, other parameters may be specified.

For example, to play an audio clip embedded in the PDF with the legacy annotation type:

<html>
 <audio id="foo" annotation="legacy" src="audio.mp3"/></div>
 <a href="pdf:media-play(#foo)">Play Audio</a>
</html>

Bookmarks

Bookmarks (also called Outlines in PDF) are a semi-standard part of CSS defined in css-gcpm-3.

Name

bookmark-level

Value

none | <integer> | copy | increment

Applies to

any block-level element

Inherited

no

Description

defines the level of the bookmark, with the highest level being 1 (negative and zero values are invalid). none means no bookmark is generated. BFO Publisher adds the custom values copy which copies the bookmark level from the nearest ancestor that set bookmark-level to a value other than none, and increment which does the same but adds one to the level.

Name

bookmark-label

Value

<content-list> | none

Applies to

any block-level element

Inherited

no

Description

defines the value to display in the bookmark. The <content-list> term is defined css-gcpm-3 but broadly includes a sequence of strings, attr(), string(), counter() or content() to form the label. For example, h1 { bookmark-label: content(text) } would set the bookmark to the text value of the <h1>. The value none means no bookmark is generated.

Name

bookmark-state

Value

open | closed

Inherited

no

Applies to

any block-level element

Description

the initial state of the bookmark, open or closed.

Name

-bfo-bookmark-target

Value

auto | <url>

Inherited

no

Applies to

any block-level element

Description

values other than auto will cause any generated bookmark to link to the specified url instead of the current element.

The canonical examples of CSS bookmarks from the specification tend to look like this:

<html>
 <style>
  h1 { bookmark-level: 1; bookmark-label: content(text); }
  h2 { bookmark-level: 2; bookmark-label: content(text); }
  /* and so on for h3, h4, h5, and h6 */
 </style>
 <body>
  <h1>Header 1</h1>
  <p>Paragraph</p>
  <h2>Header 2</h2>
  <p>Paragraph</p>
 </body>
</html>

which presupposes that the depth of each heading is known in advance; the choice of <h1>, <h2> etc. define the depths. BFO Publisher adds the copy and increment values for when the depth is not known and the document is structured with nesting. The example below will have the same bookmark levels as the example above.

<html>
 <style>
  section { bookmark-level: increment; bookmark-label: none }
  .heading { bookmark-level: copy; bookmark-label: content(text) }
 </style>
 <body>
  <section>
   <div class="heading">Heading 1</div>
   <p>Paragraph</p>
   <section>
    <div class="heading">Heading 2</div>
    <p>Paragraph</p>
   </section>
  </section>
 </body>
</html>

SVG Output

BFO Publisher can produce SVG 2 output. Since SVG describes essentially a single image, whereas Publisher is oriented towards paged media with multiple pages, there are several strategies you can use to bridge the paradigm gap between the two.

Essentially there are two independent properties used to control SVG output: encapsulation and pagination.

Encapsulation

Encapsulation describes the strategy used to handle URLs in the SVG output. As a purely Web format, SVG allows specifying arbitrary URLs for links to the resources contained in the document, notably external bitmap images. Arbitrary URLs may also have been specified in the source XML and CSS to load content resources from. However, we usually want the resulting document to be self-contained and not to depend on external content defined elsewhere on the Web, where it may be changed, moved or deleted, or become unavailable due to network failure.

The simplest strategy to overcome this problem is simply to embed all external loaded resources into the target output SVG. For binary content such as bitmap images, this can be done using a data: URL.

The advantage of encapsulation is that the resulting SVG has no dependencies and will always look the same as the source document did at the time that it was processed by BFO Publisher. However, this comes at a cost. Images and other binary content must be Base64 encoded and embedded into the output file, which may result in very large files. Also, the data: URL strategy does not support defining some content once and referencing it in multiple separate places, so there is a potential for massive duplication of binary assets.

Name

bfo-svg-encapsulation

Value

all | public | none

The encapsulation configuration parameter specifies the encapsulation strategy.

all

All URLs will be encapsulated. URL references to external bitmap images and fonts will be converted to data: URLs encoding the resource content.

public

URLs in the source will be encapsulated only if they refer to local resources (files and HTTP resources on localhost). Public HTTP resources on external servers will be referenced as-is and trusted to remain unchanged and constant.

none

No URLs in the source will be encapsulated. If there are any references to resources on the local filesystem, they will only be retrievable and displayed if the SVG viewer is run on the same machine with the same permissions.

Pagination

Pagination describes the strategy used to handle paged media in the SVG output.

Name

bfo-svg-pagination

Value

fragment | stacked | slideshow | individual

The pagination configuration parameter specifies the pagination strategy.

stacked

Pagination is essentially ignored. The resulting SVG document will be one image with all the pages stacked on top of one another. This is a good strategy for source documents that were never really intended to be paginated in the first place, such as web pages.

slideshow

Pagination is performed and a separate <svg> element is created inside the target SVG document for each page. The first "page" <svg> element is marked as "selected" and will be visible in the browser, other pages will not be visible. We include some JavaScript to be able to perform user navigation of the pages via the keyboard or programmatically - it will change which page is selected and thus what CSS display value it has. This strategy is suitable for scenarios where the document is intended to be viewed in a browser by a user.

fragment

Pagination is performed and a separate <svg> element is created inside the target SVG document for each page. The first "page" <svg> element is marked as "selected" and will be visible in the browser, other pages will not be visible. We include some JavaScript to be able to perform user navigation of the pages but this must be done programmatically.

individual

One SVG file is created for each page.

Paged Media

An important function of BFO Publisher is its ability to handle CSS paged media instructions and thus format content correctly for paged layout, including page margins, headers and footers, page breaks, and generated content such as page numbers.

Publisher uses absolutely standard CSS syntax to handle pages. However many CSS developers writing for the Web may not be familiar with these instructions, so we’ll explain them a little here.

@page rule

The @page CSS at-rule specifies the overall page layout, including its size and margins. Other CSS properties can be specified, either for all paged content, or only for pages that match certain criteria. The @page rule is defined at https://www.w3.org/TR/css-page-3/#at-page-rule

Page selectors can specify a page identifier or page pseudo-classes. The following page selectors are defined:

the nth() selectors are defined at https://www.w3.org/TR/css-gcpm-3/#document-page-selectors, which also defines the term page group.
:first

Matches only the first page of the document

:left

Matches pages that will be on the left hand side if the pages were compiled into a book.

:right

Matches pages that will be on the right hand side if the pages were compiled into a book.

name

A named page selector will match a corresponding page attribute on an element - it matches any page in that page group.

:nth(x)

The :nth() selector can be used to match a numbered page in the document - the argument x takes the form An + B. Page numbers start at 1 so @page :nth(1) is identical to @page :first, @page :nth(2n) to @page :left and @page :nth(2n + 1) to @page :right (or the other way around, depending on writing direction).

:nth(x of y)

The :nth(x of y) syntax matches page numbers the same way as the previous selector, except that matching is done within a named page group rather than in the doucment overall. A page group is established by setting the page property on an element. So @page :nth(1 of body) matches the first page of the body page group.

:blank

Matches blank pages that result from a break-before or break-after value of left, right, recto or verso. Will not match pages that are incidentally blank.

Whether a page is left or right depends on the writing direction of the document. If the writing-direction is left-to-right the first page will match :right; if it has a writing direction of right-to-left it will match :left.

The page property forces the element to be on a page from the named page group, triggering a page break if necessary.

Name

page

Value

auto | <string>

Applies to

any block element

Inherited

no

Definition

https://www.w3.org/TR/css-page-3/#using-named-pages

Named page groups allow the document to be broken up into sections, allowing different headers or footers to be applied to different sections of the document - for example, a cover page may have no page number, the introduction page numbers in lower-case roman, while the main body of the document uses arabic numerals. Here’s how you could do that:

<html>
 <style>
  @page introduction {
    @top-center {
      content: counter(page, lower-roman);
    }
  }
  @page body {
    @top-center {
      content: counter(page);
    }
  }
  header {
    counter-reset: page 1;
    page: introduction;
  }
  main {
    counter-reset: page 1;
    page: body;
  }
 </style>

 <div id="frontcover"> ... </div>
 <header> Introduction here </header>
 <main> Main body of document here </main>
</html>

This example resets the page counter to 1 to ensure the header and main sections of the PDF both start at one. The page property causes a new page group to begin, which forces a page break and allows the :nth(n of m) selector to be applied. For example, to select the second page of the body section you could use the selector :nth(2 of body).

Be careful: body:nth(2) means any page which is in the body page group and is page 2 in the document, while :nth(2 of body) means any page which is the second page in body page group.

Page Margin Boxes

Page margin boxes can be used within the @page rule to further subdivide the page into separate regions, such as headers and footers. These are all CSS at-rules with their own blocks. Conceptually, the page area is divided into nine boxes. The page content is displayed in the center (horizontally) and middle (vertically). The corner areas are then referred to as "corners", and the edge areas (above, below, and to either side of the content) are further subdivided into 3 boxes representing their start, central, and end areas.

The page margin box types are as follows:

@top-left-corner

Specifies rules applied for the top left corner area.

@top-left

Specifies rules applied for the top left area.

@top-center

Specifies rules applied for the top center area.

@top-right

Specifies rules applied for the top right area.

@top-right-corner

Specifies rules applied for the top right corner area.

@left-top

Specifies rules applied for the left top area.

@left-middle

Specifies rules applied for the left middle area.

@left-bottom

Specifies rules applied for the left bottom area.

@right-top

Specifies rules applied for the right top area.

@right-middle

Specifies rules applied for the right middle area.

@right-bottom

Specifies rules applied for the right bottom area.

@bottom-left-corner

Specifies rules applied for the bottom left corner area.

@bottom-left

Specifies rules applied for the bottom left area.

@bottom-center

Specifies rules applied for the bottom center area.

@bottom-right

Specifies rules applied for the bottom right area.

@bottom-right-corner

Specifies rules applied for the bottom right corner area.

Name

size

Value

auto | <length>{1,2} | [ <page-size> || [portrait | landscape] ]

Applies to

the @page rule

Definition

https://www.w3.org/TR/css-page-3/#page-size-prop

The size property specifies the target size and orientation of the page box. It is only relevant inside an @page block. In PDF terminology it sets the trim box (unless the bleed property is negative, in which case it sets the bleed box).

auto

Default values are used. In BFO publisher this means the size is taken from the media size, which is set with the bfo-media-size environment variable. See Media Queries.

landscape

The page content is displayed in landscape mode: the longest side is horizontal.

portrait

The page content is displayed in portrait mode: the longest side is vertical. This is the default.

<length>

The first value specifies the width of the page and the second its height. If only one value is provided, it specifies both width and height.

<page-size>

This a keyword representing one of the pre-defined page sizes: A3, A4, A5, B4, B5, JIS-B4, JIS-B5, letter, legal, or ledger.

New sizes are added with an environment variable, e.g. set bfo-sys-page-size-a6 to 105mm 148mm
Name

marks

Value

none | crop || cross

Applies to

the @page rule

Definition

https://www.w3.org/TR/css-page-3/#marks

Name

bleed

Value

auto | <length>

Applies to

the @page rule

Definition

https://www.w3.org/TR/css-page-3/#bleed

The marks property determines whether printer marks are added to the page. With the bleed property, printer marks show the printer where to trim the output.

The bleed property specifies the extent of the page bleed area outside the page box defined by size. Bleed is typically set when the page contains backgrounds that are supposed to extend to the edge. Any solid, gradients, or tiled image backgrounds that extend to the edge of the area defined by size will be automatically extended into the bleed box.

When specified as a positive length it determines how far outward, in each direction, the bleed box extends past the page box. If specified as a negative length then the size property is assumed to specify the bleed box, and the bleed property defines the trim box with respect to that.

The default value of auto evaluates to 0 unless the page has crop marks (see the marks CSS property), in which case it’s 6pt.

Name

-bfo-trim

Value

auto | <length>{1,4} | to <page-size> <landscape>?

Applies to

the @page rule

The -bfo-trim property specifies the distance between the page box (i.e. the box defined by the size property) and the edge of the physical page, known as the media box in PDF. As with bleed, the default value of auto evaluates to 0 unless the page has crop marks, in which case it’s bleed + 6pt.

This property is a shorthand for the properties -bfo-trim-top, -bfo-trim-right, -bfo-trim-bottom, and -bfo-trim-left, which define the distances for each individual side in the same way as padding or margin.

It’s also possible to specify a value like -bfo-trim: to A4, which would set the four trim sizes to expand the page to a media box of A4, centering the content. This syntax is the word to followed by any value that would be valid for the size property (so to auto is valid).

A diagram may help to visualize this.

The various page boxes

Here are some more example @page rules. Note the default page margin is 0, so if you’re setting any margin content, you should set margin too.

@page {
  size: A4;
  margin: 2cm;
}
@page {
  size: letter;
  margin: 0.5in;
  @bottom-center {
    content: "Page " counter(page) " of " counter(pages);
  }
}
@page :first {
  margin-left: 1.5in;
}

If you’d prefer to set the media box of the page directly and derive the trim box from that, that’s possible with some custom properties and calc() functions:

:root {
  --media-width: 210mm;
  --media-height: 297mm;
  --trim-top: 25mm;
  --trim-right: 25mm;
  --trim-bottom: 25mm;
  --trim-left: 25mm;
}
@page {
    -bfo-trim: var(--trim-top) var(--trim-right) var(--trim-bottom) var(--trim-left);
    size: calc(var(--media-width) - var(--trim-left) - var(--trim-right))
          calc(var(--media-height) - var(--trim-top) - var(--trim-bottom));
}

Page breaks

An important function of pagination is controlling where page breaking occurs. Page breaks are controlled via the following CSS properties:

Name

break-before

Value

auto | always | avoid | left | right | recto | verso | page | avoid-page

Applies to

block-level boxes, grid items, flex items, table row groups, table rows

Inherited

no

Definition

https://www.w3.org/TR/css-break-3/#break-between

Name

break-after

Value

auto | always | avoid | left | right | recto | verso | page | avoid-page

Applies to

block-level boxes, grid items, flex items, table row groups, table rows

Inherited

no

Definition

https://www.w3.org/TR/css-break-3/#break-between

Name

break-inside

Value

auto | always | avoid | avoid-page | avoid-column

Applies to

all elements except inline-level boxes, internal ruby boxes, table column boxes, table column group boxes, absolutely-positioned boxes

Inherited

no

Definition

https://www.w3.org/TR/css-break-3/#break-within

The break-before, break-after, and break-inside property specifies how page breaks should occur before, after, and inside a box respectively.

auto

Allows (but does not force) a break.

always

Forces a break.

avoid

Avoids a break if possible.

left

Forces one or two breaks, such as to place the box after the break on a "left" page.

right

Forces one or two breaks, such as to place the box after the break on a "right" page.

recto

Forces one or two breaks, such as to place the box after the break on a recto page.

verso

Forces one or two breaks, such as to place the box after the break on a verso page.

page

Forces a page break.

avoid-page

Avoids a page break if possible.

Note that where page breaks occur is part of CSS fragmentation more generally; thus, some property values are agnostic as to whether they are page breaks or column breaks in multi-column output, whereas some are specifically relevant to page breaks.

Additionally, there are two properties that can be used to avoid breaking inside paragraphs that would result in too few lines in the paragraph before or after the break.

Name

orphans

Value

<integer>

Applies to

block containers that establish an inline formatting context

Inherited

no

Definition

https://www.w3.org/TR/css-break-3/#widows-orphans

The orphans property specifies the minimum number of lines in a block container that must be shown at the bottom of the page. It must be a positive integer; the default value is 2.

Name

widows

Value

<integer>

Applies to

block containers that establish an inline formatting context

Inherited

no

Definition

https://www.w3.org/TR/css-break-3/#widows-orphans

The widows property specifies the minimum number of lines in a block container that must be shown at the top of the page. It must be a positive integer; the default value is 2.

Media Queries

BFO Publisher fully supports media queries as defined in https://www.w3.org/TR/mediaqueries-4/, plus a few useful extensions. The default media type is considered to be print to an A4 page size, but as all media properties can be overriden with environment variables this is easy to change. Pre-defined environment variables correspond to each non-derived media feature defined in https://www.w3.org/TR/mediaqueries-4/: width matches bfo-media-width, resolution matches bfo-media-resolution and so on. Most of these will never need changing, but some that will be are listed here.

Name

bfo-media

Value

print | screen

Description

Set the type of media query that matches in a Media Query

Name

bfo-media-size

Value

<page-size> <landscape>?| <length> <length>?

Description

A shortcut property that sets bfo-media-width and bfo-media-height - this sets the size of the media that matches in a Media Query. This property works exactly the same as the size CSS property: a single size can be set, or two lengths that set a width and height. The default value is A4

Name

bfo-media-monochrome

Value

<integer>

Description

Determines whether the output device is considered to be monochrome or not. The default of 0 indicates a color device.

Media Queries and environment variables

BFO Publisher also accepts a non-standard syntax that allows environment variables to be used in Media Queries. This can be used to match a Media Query against any environment variables in use for the document:

@media (env(bfo-format): pdf) {
   /* This block evalutes if the "bfo-format" environment variable is set to "pdf" */
   :root {
       key: value;
   }
}

As BFO Publisher output is entirely controlled by environment variables, this is extremely useful. For example the bfo-format variable is always set to the output format being generated; in the example above, the rule will only be applied for PDF output, not SVG output.

For PDF output, environment variables are set for each OutputProfile Feature that is required or denied. This terminology comes from the PDF Library, and it’s how we make the distinction between regular PDF output and PDF/A, PDF/X, PDF/UA and on.

For example, when generating tagged PDF output for PDF/UA, we need to make a distinction between tags suitable for PDF/UA-1 (based on PDF 1.x) and PDF/UA-2 (based on PDF 2.x). One of our user-agent stylesheets does this with a rule like this one:

@media (env(bfo-pdf-profile-feature-PDFVersion20): deny) {
    /* Rules for PDF 1.x are set here */
}
@media (env(bfo-pdf-profile-feature-PDFVersion20): require) {
    /* Rules for PDF 2.x are set here */
}

We can use the same approach to select different fonts for serif and sans-serif, choosing between unembedded or embedded versions depending on what is allowed with the standard PDF output. For a full list of features refer to the PDF Library API Documentation, and for some examples see the data/tags.css stylesheet in the Jar.

Finally, as this syntax is unique to BFO Publisher it can be used to create rules that are only applied in this user-agent, and will never be applied when rendering in browsers but also other CSS to PDF output engines.

@media (env(bfo-format)) {
    /* Rules here will only ever be applied in BFO Publisher */
}

Counters and generated content

CSS has a fairly sophisticated method of generating content - text which is inserted into the document but is not part of the input DOM. Typical uses are list counters, but any number of custom counters can be specified. BFO Publisher has full support for all types of generated content in the specification.

Unfortunately the specifications governing this are scatted: CSS Content 3, CSS Lists 3 (for counters) CSS Page 3 (for counters as the apply to pages) and CSS GCPM 3; although the latter is quite ancient, it has been used as a basis by most print layout engines.

Page counters

The user-agent stylesheet defines the follow standard rules

:root {
    counter-reset: page 0;
}
@page {
    counter-increment: page;
}

Together they define a page counter which will increment every time a new page is created (the first page is created after the root box - https://github.com/w3c/csswg-drafts/issues/4759 - so has the value 1). page is like any other counter - it can be reset or have its increment altered at any point. For example, it might be useful to have the header of the document use roman numerals and then reset the counter to 1 for the main body of the document

@page header {
  @top-center {
    content: counter(page, lower-roman);
  }
}
@page main {
  @top-center {
    content: counter(page, decimal);
  }
}
#header {
  page: header;
  counter-reset: page 1;
}
#main {
  page: main;
  counter-reset: page 1;
}

Other page-based counters can be used too - to ensure they’re in scope (see https://www.w3.org/TR/css-lists-3/#nested-counters) just be sure to add a counter-reset: page 0 your-custom-counter 0 rule to the :root element.

All counter formats described in https://www.w3.org/TR/css-counter-styles-3/ are supported, as is the @counter-style rule. BFO Publisher supports one additional counter-style which is particularly useful for page-based counters: auto, which means use whichever format was most recently used for formatting this counter.

This value is particularly useful for target-counter(), where we’ve made it the default format (the specification says it should be decimal). target-counter() is used to reference the value of a counter in another element, typically (but not necessarily) the page number. A default value of auto means that the formatting used in the target element will be used. In the example above, using target-counter(url(#header), page) will get the page counter from the node with id="header", and format it using whatever format was most recently used at that point (which is lower-roman, as that’s the format in the margins for that page). target-counter(url(#main), page) will get a counter format of decimal instead.

BFO Publisher also defines three special additional counters which cannot be altered by counter-increment or counter-reset.

  • -bfo-page-close - the value of the page counter at the close of the element.

  • -bfo-page-physical - the physical page in the document, with the first page starting at 1

  • -bfo-page-physical-close - the value of the -bfo-page-physical counter at the close of the element.

These three counters exist mostly for [Index generation], but could be used in the document itself. For example, if a section ran over several pages it’s possible to list them all as a range.

<style>
  .pageref[href='^#']::after {
    content: target-counter(attr(href url), page) "-" target-counter(attr(href url), -bfo-page-close);
  }
</style>
<table id="bigtable>
 ... many pages of table ...
</table>
...
<p>
 See the tables on <a class="pageref" href="#bigtable">pages </a> <!-- eg output is "pages 54-95" -->
</p>

PDF viewers (such as Adobe Acrobat) also need to display the current page external to the document, for example in the page dialogs. This value is known as the page label. This is controlled with the -bfo-page-label property

Name

-bfo-page-label

Value

normal | none | <content-list>

Applies to

the @page descriptor

Description

sets how the page number is formatted in the user-interface of a PDF viewer

The -bfo-page-label value determines how the page is formatted in external PDF viewers. The default value of normal will set the page label to match the page counter, as it’s displayed in the margins. So for example, if the page is displayed formatted as lower-roman, the value will be stored as lower-roman. If the page number is not displayed anywhere in the margins of the current page, the formatting from the last preceding page containing the page is used, or (if no preceding page contains the page number), the page is formatted as decimal.

A value of none removed any special formatting, which causes the page number to be formatted as decimal. Other strings are theoretically possible, but the PDF viewer may not honour the value.

The default value normal is likely to be required for compliance with PDF/UA-2

Lookahead mode

Layout begins with the first page and ends with the last, which leads to a problem when using the pages counter, or the target-counter for an element that hasn’t yet been laid-out. How do we know what the value will be if it hasn’t been computed yet?

There are two approaches to work around this. First, we can allocate a fixed amount of space on the page for the counter, then come back and fill it in when the document is complete. Or, we can do a trial layout of the document to establish the value of the counter.

Which approach is controlled with the bfo-lookahead environment property

Name

bfo-lookahead

Value

true | false

Description

determines whether BFO Publisher should do a trial layout pass if required.

The default value is true, which means whenever a future value is required, the layout will continue until that value is known, then repeated with the correct value inserted into the DOM. This gives the best results but can require two passes - if the pages counter is used, we have to continue until the end of the document. BFO Publisher will avoid this where possible - for example, an entry in a table-of-contents preceded by a leader() does not require two passes, as the leader can be resized.

The alternative is to allocate a fixed block of space on the page for the number. This allows rendering to run in a single pass, which will be a significant win for documents with thousands of pages. The size of the gap is unlikely to be correct, but careful layout can minimise the effect of this - for example, ensuring there is no content to the right of a pages counter and that it’s left or center aligned on the page means the gap will not be noticed.

XInclude

BFO Publisher supports the xi:include element as defined at https://www.w3.org/TR/xinclude/, as well as all the semi-official extensions from the 1.1 working group note at https://www.w3.org/TR/xinclude-11/.

XInclude is fairly well established in XML, although has not been adopted into HTML. BFO Publisher allows it to be used with either syntax (see HTML Namespace Extensions). It’s typically used to embed XML content into a larger document, but can include plain text as well.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude">
 <body>
  <xi:include href="chapter1.xml" />
  <xi:include href="chapter2.xml" />
 </body>
</html>

The XML resource will be included exactly as if it were part of the source document. Resources can be included more than once, and can themselves include further documents to make a tree of content. Loops will be safely detected and ignored, and in all cases the content will be streamed rather than imported into memory - important, as it’s possible to make very large documents this way.

BFO Publisher includes HTML content by converting it to XML, although the HTML parsing rules mean the included content will always be wrapped in an <html> and <body> element if they are missing from the source files.

As well as the enhancements and attributes defined in the 1.1 working note, we’ve added a few enhancements of our own to the syntax.

  • The <xi:include> element can take an optional bfo:xslt attribute, which defines one (or more) XSLT stylesheets to apply to the included content (see XSLT).

  • Any other namespaced attributes (excluding bfo:xslt) will be inherited by the included element (or elements), as specified in XInclude 1.1. This includes the bfo:scope attribute which will cause the included elements to act as scoping roots (see Style Scoping).

  • BFO Publisher supports the use of any Media Type in the parse attribute as specified in XInclude 1.1, and defaults to the native Media Type of the resource (note that this is a departure from the specification, which requires a default type of xml. In almost all cases the effect is the same). text/html content will be converted to XML, and types other then text/plain, text/html or an XML type will include the object as an attachment.

  • the XInclude specification disallows the URL fragments, requiring a fragid attribute instead. We don’t understand the problem this is trying to solve; BFO Publisher allows both by default, with the fragid taking precedence.

  • The xpointer attribute is supported, although the more powerful bfo:xslt attribute is a better choice for complex work. Schemes xpath and element are supported as well as a bare word matching an element id (note XPath comes as part of XSLT, so requires an implementation in the classpath). When combined with the bfo:xslt attribute the XSLT transformation is applied first, then the xpointer used to select an element from the transformed result. But as XSLT can do everything XPointer can, mixing the two is neither necessary nor advised.

    A particularly useful xpointer example is including only children of the element being included.

    <!-- main file -->
    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude">
     <body>
      <table>
       <xi:include href="subtable1.xml" xpointer="xpath(/*/*)" />
      </table>
     </body>
    </html>
    
    <!-- subtable1.xml -->
    <ignored-outer-element xmlns="http://www.w3.org/1999/xhtml">
     <tr><td>...</td></tr>
     <tr><td>...</td></tr>
     <tr><td>...</td></tr>
    </ignored-outer-element>
  • Don’t forget to namespace the included elements! The above example shows how to do it - the included elements will not inherit the namespaces of the parent context: if you forget to specify the namespace, the content is unlikely to be styled correctly.

Resources within an included document are resolved against the URL of the included resource, as specified. This leads to a potential ambiguity when the links are fragment links only, for example <a href="#toc">Contents</a>. We resolve this by first searching the included document for that fragment, falling back to a wider search of the whole document if not found.

This ambiguity can be avoided by specifying the path: if document.xht included chapter.xht, a link within either of those files to chapter.xht#heading or document.xht#heading is unambiguous. Relative links to an element in a resource included more than once will refer to the first instance.

Style Scoping

When including XML or HTML the content becomes part of the parent document so will inherit the styles from its container, and any styles defined in the included file will also apply globally. As this behaviour isn’t always desired, BFO Publisher defines a bfo:scope property to control this - while it can be used anywhere, it’s most useful with xi:include.

Name

bfo:scope

Value

normal | inherit | isolate | all

Applies to

the value all applies to <style> and <link>, other values apply to all elements. When set on <xi:include> it applies to the elements that are included.

The values have the following meanings:

normal

No special scoping rules. This is the default.

inherit

Any element with bfo:scope="inherit" will act as a Scoping Root. Any stylesheets defined within this element will be discarded when the element is closed. Any stylesheets applied outside this element will continue to apply to the element and its descendants.

isolate

The same as inherit, except that stylesheets applied outside this element will not apply to the element and its descendants. The subtree is isolated from the rest of the document.

all

When bfo:scope="all" is set on a <style> or <link> element, any styles rules defined by that element will be applied even to elements that have set bfo:scope="isolate". The ability to "punch through" this isolation is useful for style rules like @font-face which are intended for all content in the document regardless of isolation.

Here are some examples showing how this works. For clarity we’ve set bfo:scope on inline elements, which is perfectly valid, but we’d expect this to be mostly used on <xi:include>

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:bfo="https://bfo.com/ns/publisher">
 <style>
  p { font-weight: bold }
 </style>
 <body>
  <div id="d1" bfo:scope="inherit">
   <style>
    p { font-style: italic }
   </style>
   <p id="p1">This text is bold and italic</p>
  </div>
  <p id="p2">This text is bold only</p>
 </body>
</html>

In this example, the paragraph p1 has the bold style from the first stylesheet, and the italic style from the second stylesheet. When the div d1 closes, the bfo:scope="inherit" means the inner stylesheet is discarded, and paragraph p2 is no longer itaalic.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:bfo="https://bfo.com/ns/publisher">
 <style>
  p { font-weight: bold }
 </style>
 <body>
  <div id="d1" bfo:scope="isolate">
   <style>
    p { font-style: italic }
   </style>
   <p id="p1">This text is italic only</p>
  </div>
  <p id="p2">This text is bold only</p>
 </body>
</html>

In this example, the paragraph p1 is only styled with the second stylesheet: the bfo:scope="isolate" prevents stylesheets defined outside the div d1 from applying within it. As with the previous example, the second stylesheet is discarded when d1 is closed.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:bfo="https://bfo.com/ns/publisher">
 <link rel="stylesheet" href="myfont.css" bfo:scope="all" />
 <style>
  p { font-weight: bold; font-family: MyFont }
 </style>
 <body>
  <div id="d1" bfo:scope="isolate">
   <style>
    p { font-style: italic; font-family: MyFont }
   </style>
   <p id="p1">This text is italic "MyFont"</p>
  </div>
  <p id="p2">This text is bold "MyFont"</p>
 </body>
</html>

Sometimes style rules should cross into isolated scopes - for example, when they define resources like fonts. The addition of bfo:scope="all" to the stylesheet at myfont.css will ensure that any rules it defines (such as @font-face) are applied even to isolated scopes.

The bfo:scope attribute can be set on any element. When set on <xi:include>, as with any namespace-qualified attribute it will be propagated to the root element (or elements, if appropriate) of the included resource.

Elements which set bfo:scope to isolate or inherit will match the :scope CSS selector (if no scope is declared, the :scope selector matches the root element). The :scope selector and the concept of a scoping root are standard CSS concepts, but the creation of a scope is not currently defined in CSS or HTML.

HTML used to define a scoped attribute with similar functionality, but this was deprecated in 2014 and eventualy removed from implementations. The concept remains popular amongst developers, and we’ve implemented this in Publisher due to the many advantages it brings to managing styles for large documents comprised of many included sections.

XSLT

BFO Publisher supports using XSLT to transform the source XML before processing. There are two ways this can be applied.

First, the <?xml-stylesheet?> processing-instruction, as specified in https://www.w3.org/TR/xml-stylesheet/. For example:

<?xml-stylesheet href="transform.xslt"?>
<?xml-stylesheet href="stylesheet.css"?>
<document xmlns="http://mycompany.com/myschema">
 <widget>
  <data>Content>
  ...

The XML will be transformed by the transform.xslt stylesheet before it is processed by BFO Publisher - in the above example, a CSS stylesheet will also be applied to the transformed XML.

It is possible to specify multiple XSLT stylesheets with this approach, and they will be applied in the order they’re specified. Any relative URLs will be resolved against the original URL of the Document, ignoring any HTML <base> element or xml:base attributes in the source.

For a more flexible alternative unique to BFO Publisher, any element in the XML can have a bfo:xslt attribute specified on it.

Name

bfo:xslt

Value

<url> [ <url>* ]

Applies to

all elements, with special processing when set on <xi:include> or <link>

This attribute contains be the URL of one or more XSLT stylesheets; multiple URLs are separated with spaces. Unlike <?xml-stylesheet?> the stylesheets will be applied to the subtree it’s specified on. This is particularly useful for transforming content included from another file with the <xi:include> element (see XInclude).

If set on <link> and the link element’s rel attribute includes attachment, the target of the link element will be transformed by the specified stylesheet(s) before it’s attached to the PDF (ths functionality is new in version 1.3). See Attachments

<aside id="fxrate">
 <h2>Exchange rates</h2>
 <p>Values current as of today</p>
 <xi:include href="http://fx.example.com/feed.xml" bfo:xslt="fx-to-html.xslt" />
</aside>

The processing is done by the XsltAttributeExtension extension, one of the default set of extensions used by BFO Publisher. It uses the standard javax.xml.transform package to transform the XML.

XSLT is not included in Java SE, so requires an implementation to be added to the classpath. We highly recommend Saxon, and have also tested with Apache Xalan. BFO Publisher will correctly stream events when used with XSLT 3.0 streaming stylesheets and if the XSLT engine that supports them - currently this is only Saxon EE.

Finally, a Java class can be specified to transform the XML instead of XSLT. This is useful for very specific transformations that would be difficult with XSLT, and as BFO Publisher supports multiple transformations it can be applied alongside an XSLT transformation if necessary. The index in the PDF version of this document is created using this approach.

Use a classpath URL to specify the class that will perform the transformation: the class must implement both org.xml.sax.XMLFilter and org.xml.sax.ContentHandler, have a public no-argument constructor, and it must already be in the Java classpath.

<div id="index">
 <xi:include bfo:xslt="classpath:com.example.MyXMLFilter file:/path/to/finaltranform.xsl"
             href="about:index"/>
<div>

The XML will be processed by the com.example.MyXMLFilter class, with the output further transformed by the /path/to/finaltransform.xsl XSLT stylesheet before being included in the final document.

Here’s an example XMLFilter that converts all element names to lower case:

package com.example;

public class MyFilter extends XMLFilterImpl {

  @Override
  public void startElement(String ns, String l, String q, Attributes atts) throws SAXException {
    lname = lname.toLowerCase();
    super.startElement(ns, lname, qname, atts);
  }

  @Override
  public void endElement(String ns, String l, String q) throws SAXException {
    lname = lname.toLowerCase();
    super.endElement(ns, lname, qname);
  }

}

Linking Resources

The HTML <link> element is the standard way to link to resources in HTML, and is fully supported in BFO Publisher. What is less well known is that a document-wide link header can also be specified with HTTP Headers.

RFC8288 specifies how the Link HTTP header can be used instead of <link> - the two syntaxes are virtually identical (mozilla.org have a good writeup) and this is supported by BFO Publisher - although it does require that the document XML/HTML is served over HTTP.

Another approach unique to BFO Publisher is the use of a link processing instruction. The syntax looks like this:

<?link rel="attachment" href="path/to/file.xml" ...?>
<html xmlns="http://www.w3.org/1999/xhtml">
 ...
</html>

Anything you can do with a <link> element in the document head, you can also do with this processing instruction. The approach is useful when the input document is being processed somehow (say with FreeMarker or XSLT), and as processing instructions can be added programattically to any input via the API or with the Web Service, it adds quite a lot of flexibility: the link instruction no longer has to be part of the input document. See the Factur-X section for an example of where this is particularly useful

FreeMarker and ZTemplate Templates

BFO Publisher (since version 1.3) supports using Apache FreeMarker (https://freemarker.apache.org) or ZPath ZTemplates (as specified at https://zpath.me) as a template pre-processor. Of course either of these or any other template processor can always be run manually to generate the input HTML or XML prior to processing it with BFO Publisher, however there are some advantages to doing the Template processing in BFO Publisher.

  • Ease of use - the template processing (which occurs in FreeMarker or ZTemplate) and the subsequent conversion of the output to PDF (in BFO Publisher) is presented as a single step

  • Templates can be processed via the WebService interface.

  • Where possible, templates conversion is streamed into the next stage rather than storing the content in memory.

FreeMarker is invoked by adding a freemarker processing instruction to the file being parsed, exactly the same way as for XSLT processing. If the data is stored in XML this is trivial: the example from the FreeMarker documentation could be represented like so

<?freemarker href="path/to/template.ftl"?>
<data>
 <user>Big Joe</user>
 <latestProduct>
  <url>products/greenmouse.html</user>
  <name>green mouse</user>
 </latestProduct>
</data>

ZTemplates works exactly the same way, except a ztemplate processing instruction is used.

<?ztemplate type="text/html" href="test.ztl"?>
<data header="Colors">
 <items>
  <item first="true"><name>red</name><url>#Red</url></item>
  <item link="true"><name>green</name><url>#Green</url></item>
  <item link="true"><name>blue</name><url>#Blue</url></item>
 </items>
</data>

XML requires a single, named root element - we’ve used <data> for this, and if using FreeMarker the template would reflect that.

XML is a useful example as it allows us to draw parallels with XSLT, but a more typical serialization format for input to a modern Template language would be JSON or CBOR, both of which can be processed the same way. As processing-instructions cannot be added to JSON, one needs to be added manually via the API.

Report report = reportFactory.createReport();
ProcessingInstruction pi =
  new ProcessingInstruction("ztemplate", "href=\"path/to/template.ztl\"");
report.getProcessingInstructions().add(pi);
report.load(new File("data.json"));
report.parse();
PDF pdf = output.getPDF();
pdf.render(new FileOutputStream("out.pdf"));

For FreeMarker, it’s identical except for the change to the processing instruction

ProcessingInstruction pi =
  new ProcessingInstruction("freemarker", "href=\"path/to/template.ftl\"");

// or, if you prefer

ProcessingInstruction pi = new ProcessingInstruction()
   .setType("freemarker")
   .put("href", "path/to/template.ftl");

The same example from the FreeMarker documentation we referred to above could be represented in JSON as

{
  "user": "Big Joe",
  "latestProduct": {
    "url": "products/greemouse.html",
    "name": "green mouse"
  }
}

Finally, this approach can also be used to a Template to a FreeMarker TemplateModel which can be created from a HashMap or similar. Just pass the TemplateModel into the report.load() method.

For ZTemplate, the engine we’re using is https://github.com/faceless2/zpath - this can accept a com.bfo.json.Json or java.util.Map as input directly, so just pass that into report.load().

HTML, XML and relative paths

The output from any Template is assumed to be HTML by default; this is the case even if the data was originally loaded from an XML file. If the Template generates XML instead then this must be specified by adding a type="text/xml" attribute to the processing instruction. Our first example above would now look like this:

<?freemarker type="text/xml" href="path/to/template.ftl"?>
<data>
 ...

or when using the API

ProcessingInstruction pi =
  new ProcessingInstruction("freemarker", "type=\"text/xml\" href=\"path/to/template.ftl\"");

// or, if you prefer

ProcessingInstruction pi = new ProcessingInstruction()
   .setType("freemarker")
   .put("type", "text/xml")
   .put("href", "path/to/template.ftl");

If the Template contains relative URLs to images or other resources, they will be resolved relative to the path of the Template file, not relative to the data file being parsed.

Configuration and Security

Environment variables beginning with freemarker. are passed to the FreeMarker configuration, minus the freemarker. prefix. For example, setting the Environment variable freemarker.incompatible_improvements to 2.3.27 would configure FreeMarker to use that version of its API. The bfo-lang environment variable, used to set the default language of the Report, is also used to set the FreeMarker default language.

FreeMarker supports recursion, so a malicious template could use all the available memory in Java. BFO Publisher will resolve the URL in the processing-instruction as normal, so in theory Templates could be loaded from any URL. To mitigate the security implications here, we’ve added a concept of a trusted resource. URLs with a scheme of file, jar or classpath are trusted, and so can contain a Templates. Attempting to load a Template from a non-trusted URL will fail. This is mostly of interest when using the web-service.

ZTemplate is a much simpler template language, and the only configurable is bfo-lang is used to set the default locale for templates, as with FreeMarker. ZTemplate is designed to prevent recursion and other types of runaway resource use, but for added security the trusted resource concept described for FreeMarker is also required for ZTempalte templates.

Templates and the Web-Service

Template processing can run from the Web Service too, so long as the file being converted contains the required processing-instruction or one is specified in the processing_instructions property passed to convert (see [Conversion API in detail]). The Template must be a trusted URL, so to allow Templates to be uploaded to the service, BFO Publisher version 1.3 adds the trusted key to files added to the store. This key can only be set by a user with the admin/trusted grant (see Access Control).

The way we expect this to work is as follows.

  1. An admin user with the admin/trusted grant uploads template(s) to a shared folder.

  2. Later, a regular user uploads a datamodel as CBOR/JSON and converts that by referencing that template with a processing-instruction

Here’s the upload of the Template, as done by the admin user

POST ${BASEPATH}/store HTTP/1.1
Authorization: Bearer _…​JWT with admin/trusted permission…​
Content-Type: application/json

{
  "type": "put",
  "content_type": "text/html",
  "path": "freemarker-templates/template.html",
  "trusted": true
  "content": …​
}

And here’s the conversion - the data is uploaded as a file, so is sent as JSON serialized as a string (if sending CBOR, the data can also be application/cbor, serialized as a byte buffer). The template is referenced with a processing instruction as shown. The data is the same as the JSON example above.

POST ${BASEPATH}/convert HTTP/1.1
Content-Type: application/json

{
  "url": "data.json",
  "lang": "en-US",
  "put": [
    {
      "path": "data.json",
      "content": "{\"user\":\Big Joe\",\"latestProduct\":{\"url\":\"products/greenmouse/html\",\"name\":\"green mouse\"}}",
      "content_type": "application/json"
    }
  ],
  "extra_folders": ["freemarker-template"],
  "processing_instructions": [
    {
      "type": "freemarker",
      "data": "href=\"template.html\""
    }
  ]
}

Of course if the user doing the conversion has the admin/trusted grant, both the template and the datamodel can be uploaded in a single pass. Templates are cached and reused, and as BFO Publisher will identify them by their checksum it’s OK to upload them repeatedly if necessary - this will create extra network traffic, but not a lot of extra processing.

FreeMarker include and import are supported, as are includes in ZTemplate (they’re the same concept). Relative paths are resolved relative to the path of the Template file. Both are loaded using BFO Publisher resource loader, so are treated the same was as any other URL (see Security). As a rule of thumb, the URLs for these imported files should be relative in order to work without Security implications.

Metadata

Metadata can be embedded in the source XML or HTML in a number of ways, all of which will be converted to XMP which is the native format for PDF. Typically there is a single XMP object for the whole document, created by merging all the metadata specified in the file.

Traditional HTML metadata

<title>The Document Title</title>
<title lang="de">Der Dokumententitel</title>
<meta name="author" content="The Document Author">

This is how most people think of metadata in HTML - the title and meta elements can be used to set the corresponding fields on the document. The vocabulary available with this approach is very limited however, so if you want to set more than just title, author, subject etc. then you need to look outside the HTML spec for inspiration:

<link rel="schema.DC" href="http://purl.org/DC/elements/1.1/">
<meta name="DC.date" content="2021-02-01">
<meta name="DC.identifier" content="urn:isbn:9781507760116">
<meta name="DC.contributer" content="Einstein, Albert">
<meta name="DC.contributer" content="Penrose, Roger">
<meta name="DC.contributer" content="Da Vinci, Leonardo">

The Dublin Core™ Elements are the normal gateway to more advanced metadata usage in HTML. BFO Publisher supports the approach recommended at https://www.dublincore.org/specifications/dublin-core/usageguide/2000-07-16/simple-html/ for embedding these properties (and of course this isn’t limited to Dublin Core). The only requirements here are:

  • A <link> with a rel property starting with schema., that maps the prefix to the schema namespace

  • <meta> properties that begin with that prefix followed by a ".", and that define simple values. Repeated definitions of the same property will be assembled into a list.

RDF metadata

<svg>
 <title>The SVG title</title>
 <desc>The SVG description</desc>
 <metadata>
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <rdf:Description about="" xmlns:cc="http://web.resource.org/cc/">
    <cc:license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/2.5/" />
   </rdf:Description>
  </rdf:RDF>
 </metadata>
</svg>

SVG has a native method for embedding structured metadata - the <metadata> element contains RDF content, which is directly embeddable into the XMP object.

There’s no direct equivalent to this tag in HTML, but there is a recognised approach for embedding arbitrary XML in a PDF: the <script> tag.

<script type="application/rdf+xml">
 <rdf:Description xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="">
  <mail:headers xmlns:mail="http://example.com/ns/mailschema/">
   <rdf:Seq>
    <rdf:li rdf:parseType="resource">
     <mail:name>delivered-to</mail:name>
     <mail:value>user@example.com</mail:value>
    </rdf:li>
    <rdf:li rdf:parseType="resource">
     <mail:name>subject</mail:name>
     <mail:value>email subject</mail:value>
    </rdf:li>
   </rdf:Seq>
  </mail:headers>
 </rdf:Description>
</script>

Set the type of the script to application/rdf+xml and any content within the script tag will be parsed as an RDF (an individual RDF description can also be used, as shown here). This functions identically to the SVG <metadata> element.

The <script> element is used because - uniquely, in HTML - its content is always parsed as text. This makes it particularly suitable for embedding XML data. When parsing XML source files, the content can be wrapped in a CDATA block to avoid any parsing issues.
<link rel="meta" type="application/rdf+xml" href="https://example.com/schema/boilerplate.xml"/>

As an alternative to embedding the XML directly into the file, a <link rel="meta"> can be used - the content is the same as with <script>, but this time it’s stored in an external file. Although this technique is fairly widely used it’s not standardized, but it particularly useful when building an XMP document as it lets us import boilerplate sections of XMP, such as extension schemas required for PDF/A-3.

Metadata Subjects

As mentioned above, typically metadata in the document applies to the document as a whole. But this isn’t always the case. For example, in this example we have an SVG inside an HTML document, each with its own title

<html>
 <title>The Document Title</title>
 <meta name="bfo-pdf-tagged" content="">
 <body>
  <p>test</p>
  <svg>
   <title>The SVG Title</title>
   <rect fill="red">
  </svg>
 </body>
</html>

This works because the <svg> element is a metadata subject - an element that will "own" any metadata properties set within it. If the PDF is created with tags (for example, by setting the bfo-pdf-tagged environment variable to any value), then the tag corresponding to the SVG in the generated PDF will have its own XMP metadata.

Which elements are metadata subjects is determined by the -bfo-metadata-subject CSS property.

Name

-bfo-metadata-subject

Value

none | <url>

Applies to

all elements

Inherited

no

Description

For values other than none, marks this element as a metadata subject

Note

If a URL is specified, it should always be env(bfo-location) which is the URL of the current document. Other values may become useful in a later release.

By default, <svg>, <object>, <html>, <img> and <iframe> elements are all metadata subjects because the user-agent stylesheet defines this rule:

svg|svg, html, object, iframe {
    -bfo-metadata-subject: env(bfo-location);
}

Any element with -bfo-metadata-subect set to the document location will become a metadata-subject. To reverse this, set -bfo-metadata-subject: none.

Note that replaced content like images, video and audio will always form metadata subjects, and will be initialised with metadata stored in the source file. So when embedding a JPEG that includes XMP metadata, the metadata will automatically set stored in the PDF.

Text

Fonts

Fonts are loaded exactly as specified in https://www.w3.org/TR/css-fonts-4/, which is fully supported except for the sections relating to variables fonts. The deprecated embedded-opentype and svg font formats are not supported.

Some features of OpenType are unsupported, and these can be tested using the font-tech() query from https://www.w3.org/TR/css-conditional-5/#at-supports-ext or the the tech() function from https://www.w3.org/TR/css-fonts-4/#font-technologies-formats. The full list is below.

Tech Support Notes

features-opentype

OpenType layout in all languages with CFF or GLYF shapes

features-aat

-

Apple AAT layout is unsupported

features-graphite

-

SIL Graphite layout tables are unsupported

variations

-

OpenType Variable fonts are unsupported

colr-colrv0

Supported

colr-colrv1

-

OpenType 1.9 introduced a new COLR table format which is unsupported

colr-svg

Supported

colr-sbix

-

Apple SBIX color fonts are unsupported

colr-cbdt

Supported

palette

Supported

incremental

The "range request" approach is supported (see below)

As OpenType Variable fonts are unsupported, the CSS properties font-variation-settings and font-optical-sizing are unrecognised, as are the font-named-instance and font-variation-settings descriptors.

Incremental Loading

Incremental loading is a new idea in CSS, but one which the PDF Library underlying BFO Publisher has unwittingly supported for years. When this tech is specified on an OpenType or Truetype font (not WOFF or WOFF2), the font is loaded over HTTP and the server supports HTTP Range, only the required sections of the font will be downloaded.

For very large fonts where only a few glyphs are expected to be used, this can speed up processing and reduce network traffic - although as only the required used glyphs are embedded in the PDF, the size of the final PDF will be unchanged.

/* We're unlikely to use more than a few glyphs in the bold version
 * so load it incrementally
 */
@font-face {
    font-family: "Noto Sans CJK TC";
    font-weight: 400;
    src: url("http://example.com/NotoSerifCJKtc-Regular.otf") format(opentype);
}
@font-face {
    font-family: "Noto Sans CJK TC";
    font-weight: 700;
    src: url("http://example.com/NotoSerifCJKtc-Bold.otf") format(opentype) tech(incremental);
}

PDF Fonts

The @font-face rule will load a font from a URL and embed it in the PDF, but PDF also supports a number of fonts which can be used without embedding anything. These are pre-defined in the user-agent stylesheet of BFO Publisher, so are available to every document.

  • PDF Times, PDF Helvetica, PDF Courier - these three fonts cover the glyphs required for English, French, German, Portuguese, Italian, Spanish, Dutch (no "ij" ligature), Danish, Swedish, Norwegian, Icelandic, Finnish, Polish, Croatian, Czech, Hungarian, Romanian, Slovak, Slovenian, Latvian, Lithuanian, Estonian, Turkish, Catalan (no Ŀ), Basque, Albanian, Rhaeto-Romance, Sorbian, Faroese, Irish, Scottish, Afrikaans, Swahili, Frisian, Galician, Indonesian/Malay and Tagalog (corrections and additions to this list welcome). They also include a number of extended punctuation characters. The full character map is available at https://bfo.com/products/pdf/docs/map-normal.pdf

  • PDF Symbol - stylistically identical to PDF Times, this is made up of math symbols. The map is available at https://bfo.com/products/pdf/docs/map-symbol.pdf

  • PDF ZapfDingbats - stylistically identical to PDF Times, this is most of the Unicode Dingbats table at U+2700 to U+27C0. The map is available at https://bfo.com/products/pdf/docs/map-zapf.pdf

The five fonts above, along with the bold, italic and bold-italic variants of PDF Times, PDF Helvetica and PDF Courier make up what’s commonly known as the Standard 14 Fonts in PDF. Every glyph in these fonts is guaranteed to be available on all PDF viewers.

There are also some semi-standard CJK fonts available to PDF

  • PDF Mincho - a variable stroke-width (serif-like) font for Japanese: map

  • PDF Kaku Gothic - a fixed stroke-width (sans-serif like) font for Japanese: map

  • PDF Songti TC - a variable stroke-width font for Traditional Chinese: map

  • PDF Heiti TC - a fixed stroke-width font for Traditional Chinese: map

  • PDF Songti SC - a variable stroke-width font for Simplified Chinese: map

  • PDF MyeongJo - a variable stroke-width font for Korean: map

  • PDF Gothic KO - a fixed stroke-width font for Korean: map

The range of glyphs available to each of these is defined but of course is much larger, and also tends to be expanded with new PDF revisions. Consequently whether a glyph can be shown on a PDF viewer or not is less well defined - for example, although the PDF Mincho font includes the Reiwa era glyph (U+32FF ㋿) added in 2019, it’s possible that the PDF viewer displaying the glyph may not have it.

Finally, BFO Publisher ships with the STIX Two Math font, which is pre-loaded for all documents and is used to display MathML.

PDF Fonts and PDF/A

PDF/A, PDF/UA and PDF/X require all fonts are embedded, so BFO Publisher ships with embeddable versions of the Standard 14 Fonts. No change is required to use these: if the PDF output format requires it, the embedded versions will be used automatically.

The same is not true for the CJK fonts listed above, and attempting to use those fonts in a PDF/A file will cause an error to be thrown during conversion. An embedded font must be used instead.

Local Fonts

BFO Publisher does not ship with any local fonts, and does not use the OS fonts directory. Fonts can only be loaded by creating a CSS stylesheet for those fonts and including it.

This process can be automated by using a BFO Publisher extension: the fonts URL scheme. It’s very similar to the file scheme, except that instead of returning the font at the specified path, it returns a CSS file describing the font. If the path is a directory the CSS file will describe all fonts contained in that directory. This makes it easy to add system fonts by simply including fonts:///c:/windows/fonts as a stylesheet.

there is a small cost to adding too many fonts - first, they all have to be parsed to determine which characters they contain, which is done the first time the fonts URL is accessed. Second, if a character is used which is not in the current font, all fonts loaded this way will be checked and the first that contains the glyph will be used. For this reason we do not generally recommend adding every font you can find.

The font URL is a live URL; it will be updated if new fonts are added or removed to that folder

Generic Fonts

CSS defines several generic font families which should always be available, but as PDF does not use the operating system fonts this gets a bit more complicated. BFO Publisher maps these generic fonts to named fonts depending on the element’s language, (typically just the script), and this mapping can be changed. The default mappings are listed below; see https://www.w3.org/International/articles/language-tags/ for an explanation of languages in XML and HTML

Generic Languages Mapped to

serif

*-Latn

PDF Times, PDF Symbol, PDF ZapfDingbats

serif

*-Jpan, *-Hrkt, *-Hira, *-Kana

PDF Mincho

serif

*-Hant, *-Hanb, *-Bopo

PDF Songti TC

serif

*-Hans, *-Hani

PDF Songti SC

serif

*-Kore, *-Hang, *-Jamo

PDF MyeongJo

sans-serif

*-Latn

PDF Helvetica

sans-serif

*-Jpan, *-Hrkt, *-Hira, *-Kana

PDF Kaku Gothic

sans-serif

*-Hant, *-Hanb, *-Bopo

PDF Heiti TC

sans-serif

*-Kore, *-Hang, *-Jamo

PDF Gothic KO

monospace

*-Latn

PDF Courier

math

*

STIX Two Math

Other mappings might be preferred - for example, instead of using the standard PDF Fonts for these generic familes, you might prefer to map all generic familes to the Noto fonts, or to fonts available in your system font directory.

To enable this, BFO Publisher includes some system stylesheets which will do this for you. They are optional but can be included in any document as a user or user-agent stylesheet.

Stylesheet Result

classpath:data/fonts-noto-cjk.css

serif and sans-serif will be mapped to the Noto CJK series of fonts for the Chinese, Japanese and Korean scripts listed in the table above.

classpath:data/fonts-noto-core.css

serif, sans-serif and monospace in the Latn, Cyrl and Grek scripts will be mapped to Noto Serif, Noto Sans and Noto Sans Symbols, and Noto Sans Mono respectively.

classpath:data/fonts-noto.css

serif and sans-serif for languages other than the ones listed above will be mapped to theappropriate Noto font. emoji will be mapped to Noto Color Emoji

classpath:data/fonts-os-cjk.css

serif, sans-serif, fangsong and cursive for Chinese, Japanese and Korean scripts listed above will be mapped to typical fonts shipped with Windows, macOS or Linux.

classpath:data/fonts-os.css

serif and sans-serif for specific languages other than Chinese, Japanese and Korean are mapped to typical fonts shipped with Windows, macOS or Linux. For example serif in the *-Thai script is mapped to the Angsana New font.

classpath:data/fonts-os-core.css

serif, sans-serif, monospace, cursive, fantasy and emoji generic fonts for unspecified languages are mapped to typical fonts shipped with Windows, macOS or Linux. For example, serif is mapped to Times New Roman

This is inevitably complex but it will usually boil down to this:

  • You want to use the Noto fonts for all generic familes? Include the three noto stylesheets listed above. and make sure the fonts you need are loaded with @font-face

  • You want to use typical OS fonts for all generic familes? Include the three os stylesheets listed above.

  • You’re happy with the defaults for Latin and the CJK scripts, but want to add generic families for other languages? Include the fonts-os.css or fonts-noto.css as preferred.

Finally, all this can be customized - the stylesheets above can be retrieved by calling the Report.getResource() method (see API Usage for details) or extracted from the JAR file, and used as a basis for a cutom mapping.

In all these cases, you will need load the fonts by creating a @font-face rule for them. Only the standard PDF Fonts listed in the previous section can be used without a @font-face.

Hyphenation

Hyphenation is pre-supported for a number of languages, using the code and Hyphenation patterns from Apache FOP. Those patterns in turn were derived from TeX.

New hyphenation patterns can be loaded using a <link> element, as shown below. Either the FOP-style XML format (with a root element of <hyphenation-info>) or UTF-8 TeX patterns (which should look like \patterns{ …​ } - some examples) can be used.

<head>
<link rel="hyphenation" lang="ta" href="hyph-ta.tex"/>
<style>
  :lang(ta) {
    hyphens: auto;
    hyphenate-limit-chars: auto 3 auto;
  }
</style>
</head>

Language matching is done using standard BCP47 rules; setting lang="en-GB-oxendict" will look first for a hyphenation dictionary with that exact language tag, falling back to en-GB then en. Hyphenation requires a language to be set: with no lang attribute (set or inherited), no hyphenation will take place.

Color

Color in CSS has been sRGB only for many years, but the 2021 publication of CSS Color 4 has finally brought wide gamut RGB and Lab color to the web. The specification is very RGB-focused (sRGB only uses ⅔ of the gamut of modern displays, many of which are capable of Display P3). But it also brings many improvements which are great for print, like Lab and LCH color.

BFO Publisher fully supports all of CSS Color 4 and a good amount of the evolving CSS Color 5 revision too. With the exception of device-cmyk(), all of the colors used in CSS are calibrated - their color values are defined exactly in a way that allows them to be reproduced reliably on screen and in print.

RGB and Lab/LCH

PDF itself has long supported calibrated colors, and allows them to be defined in CIELab, or by way of an embedded ICC profile. So long as they’re within gamut, any calibrated color can be converted to any other without loss. Which means we’re able to support all the colors now available in CSS, also without loss.

With the explosion of new color-spaces available we won’t list every syntax here - if it’s in the specification, we support it. The following table shows how each is stored in PDF.

Type CSS Color Spaces PDF Storage

RGB

sRGB, sRGB-linear, Display P3, Adobe™ RGB, ProPhoto, Rec.2020

ICC profile

Hue-based sRGB

HSL, HWB

as for sRGB

Lab/LCH

CIE Lab, CIE LCH, OKLab, OKLCH,

Lab

XYZ

XYZ

Lab

BFO Publisher ships with the reduced-size ICC profiles for the various RGB spaces listed above, created and placed into the public domain at https://github.com/saucecontrol/Compact-ICC-Profiles.

CMYK and ICC colors

CSS Color 5 defines a syntax for device-dependent CMYK color - the only type of uncalibrated color available to CSS.

div { color: device-cmyk(0 0 0 1);
div.alpha { color: device-cmyk(0 0 0 1 / 0.5);  /* With 50% alpha */

As of 2022 this syntax is widely support by other print CSS engines, although not yet by browsers. Another common syntax which was in wide use before the standardization in CSS is the cmyk() function, which is not supported in BFO Publisher by default. However it can be added by setting the bfo-sys-colors environment variable:

Name

bfo-sys-colors

Value

none | <string> <string>*

Description

A list of one or more strings, naming the non-standard color functions which may be encountered in existing print documents and should be recognised by BFO Publisher. Currently defined values are cmyk, gray, device-gray, rgb-icc, -ro-spot and prince-color.

For example to use the legacy cmyk() function to define CMYK color:

@bfo env {
    bfo-sys-color: "cmyk";
}
div { color: cmyk(0, 0, 0, 1);

Device-dependent color is fine if you’re happy with whatever pigments the printer has, but if you want calibrated CMYK color you will need to reference an ICC Profile by its URL.

For example, to generate CMYK color that is calibrated to "FOGRA 39", also variously known as ISO12647-2:2004 or "ISO Coated v2", you would need the URL of a suitable FOGRA39 ICC profile.

@color-profile --fogra39 {
    src: url("http://example.com/path/to/fogra39.icc");
}
div { color: color(--fogra39 0 1 0 0);

The @color-profile at rule is defined in CSS Color 5, and takes a profile name - which must begin with a double-hyphen - and a single property, src, which is the URL of the ICC profile to embed. To reference this new color-space use the standard CSS color() function as shown.

While CSS allows any type of ICC profile in theory, PDF only accepts CMYK, RGB or grayscale profiles that are of type prtr or mntr - printer and monitor, not scanner or other types intended for input devices rather than output.

Anchoring device-cmyk

The @color-profile rule can also be used to to anchor any device-cmyk colors to an ICC profile. This is required for PDF/A, PDF/UA and PDF/X documents that make use of uncalibrated CMYK. It’s identical to the example shown above except the name is device-cmyk:

@color-profile device-cmyk {
    src: url("http://example.com/path/to/fogra39.icc");
}
div { color: device-cmyk(0 1 0 0);

We strongly recommend this approach for CMYK content instead of using the color() function; it’s simpler to manage (device-cmyk is easier to remember than a custom name), causes less problems with overprint, and will also catch any CMYK images that don’t reference an ICC profile which would otherwise remain uncalibrated.

Spot colors and Overprint

CSS does not yet have a syntax for spot colors - also called separations, these can be thought of as additional pigments which are added to the printer alongside Cyan, Magenta, Yellow and Black.

The HP Indigo range of industrial printers, for example, has a white pigment called CMYK White, which can be used when printing onto colored stock.

To define spot colors in BFO Publisher, we’ve added two custom descriptors to the @color-profile rule which can be used instead of src: -bfo-components and -bfo-fallback:

Name

-bfo-components

Applies to

the @color-profile rule

Value

none | <string> [, <string> ]+

Name

-bfo-fallback

Applies to

the @color-profile rule

Value

none |<color> [, <color> ]+

These properties takes a comma-separated list of component names - the name of the ink, e.g. Pantone Reflex Blue C, and their corresponding fallback colors. The two lists should be the same length, and in the vast majority of cases they will both be a single item, to define a Spot color (lists of more than one ink define what’s called a Device-N color in PDF parlance).

The fallback color(s) should typically be in device-cmyk or another CMYK space, but RGB is also allowed. More exotic spaces such as Lab and LCH can be used, but any space that is stored in PDF using the Lab color-space is a bit complicated when it comes to gradients, so be careful. Typically the Spot color name is a well known name from the PANTONE™ range or similar, and is being used because the intended output device is aware of it colorimetry. The fallback will only be used on a device that doesn’t know about that ink, such as a screen or regular desktop printer, so an approximate device-cmyk() is fine.

Once you’ve defined a new @color-profile rule with both two properties, you can use it like any other.

@color-profile --reflexblue {
    -bfo-components: "PANTONE Reflex Blue C";
    -bfo-fallback: device-cmyk(1 0.723 0 0.02);
}
.a {
  color: color(--reflexblue 1);        /* 100% Reflex Blue */
}
.b {
  color: color(--reflexblue 0.5);     /* 50% Reflex Blue */
}

If you want to create a gradient between two spot colors, or between a spot color and a process color pigment, create a @color-profile rule with all the required components of the gradient. Your new color-space can have as many components as you like - as we’re typically using additive colors any 0 values mean no ink, so keep the component at zero to disable it. However that all gets a bit complex, so the example here just show two inks.

@color-profile --blue-red {
    -bfo-components: "PANTONE Reflex Blue C",
                     "PANTONE Warm Red C";
    -bfo-fallback: device-cmyk(1 0.723 0 0.02),
                   device-cmyk(0 0.75 0.9 0);
}
@color-profile --blue-black {
    -bfo-components: "PANTONE Reflex Blue C",
                     "Black"
    -bfo-fallback: device-cmyk(1 0.723 0 0.02),
                   device-cmyk(0 0 0 1);
}
.a {
  /* A gradient from 100% "Pantone Reflex Blue C"
     to 100% "Pantone Warm Red C"
  */
  background: linear-gradient(to right,
                color(--blue-red 1 0), color(--blue-red 0 1));
}
.b {
  /* A gradient from 100% "Pantone Reflex Blue C"
     to 100% process black.
  */
  background: linear-gradient(to right,
                color(--blue-black 1 0), color(--blue-black 0 1));
}

Another use for Spot colors is to convey special pseudo-colors, which are really just instructions for the output-device: perhaps representing cut or score lines, for example, or an area to apply varnish or glue. Spot colors with overprint are a good choice here aa they allow these lines to be marked without interfering in any other color on the page. Drawing these on a PDF Layer allows them to be easily removed during proofing.

Name

-bfo-overprint:

Applies to

the @color-profile rule

Value

false | true

Setting -bfo-overprint to true in a @color-profile ensures that any colors created in that color-space are drawn with overprint. Overprinting is a concept unique to print; normally any inks drawn on a page replace all other inks in that area. When overprinting, only the inks in use by that color are replaced. It’s a difficult one to visualise on screen, but take a close look at the color components to get an understanding - the example shows drawing first in a CMYK color, then overwriting that with a spot-color in our custom colorspace: first normally, then with overprint.

Overprinting

To put all this together, imagine we want to to create our PDF with a special Fold pseudo-color which indicates to our ISO19593-aware print workflow that a fold should be made at that point. We don’t want this fold line to obscure the color behind it - we want it to overprint.

<svg>
 <style>
  @color-profile --fold {
      -bfo-components: "Fold"
      -bfo-fallback: device-cmyk(0 0 0 1);
      -bfo-overprint: true;
  }
  .fold {
      stroke: color(--fold 1);
      -bfo-layer-type: layer;
      -bfo-layer-name: "Fold";
      -bfo-layer-processing-step: "Structural.Folding";
  }
  rect {
      fill: device-cmyk(0.2 1 0.3 0);
  }
 </style>
 <rect x="0" y="0" width="400" height="100"/>
 <line class="fold" x1="200" y1="0" x2="200" y2="100"/>
</svg>

Registration colors

Finally, printers sometimes make use of registration black - a color which uses all inks available, so that the marks will appear on all plates (the term rich black is sometimes heard too, which is device-cmyk(1 1 1 1). The two are identical unless spot colors are used in the document - spot-color plates will not be marked by rich black, only registration black).

We’ve added a special color-profile to the list predefined in CSS Color 4, called registration. Printer marks (see the marks CSS property) are drawn in this color.

@page {
    @bottom-right {
        content: env(bfo-location);
        /* draw 100% in every ink on the output device */
        color: color(registration 1);
    }
}

Forms

HTML has form fields and PDF has form fields, but the underlying model between the two environments is quite different. BFO Publisher tries to consolidate these two views as much as possible, but some significant differences remain.

HTML:

  • has multiple independent <form> elements

  • JavaScript access to fields uses the HTML DOM

  • fields do not need a name. When names are specified they should be unique, but nothing will break if they’re not.

  • CSS can be used to style fields in a particular state, for example using the :checked pseudo-class.

  • Although more limited than general HTML elements, form fields can be largely restyled with CSS. appearance:none can be used to override

  • A click on the <label> element is treated as a click on the field, which opens some clever options for restyling checkboxes and radio buttons

  • Form submission is can be get or post, with a post method of encoded or multipart depending on the enctype attribute

PDF:

  • has one document-wide Form

  • JavaScript access to fields uses the PDF JavaScript environment

  • every field must have a name - if they don’t, one is auto-assigned

  • field names must be unique across the entire document

  • as a special case, two fields with the same name must have the same type, and will share the same value. If two fields with the same name have different types, conversion will fail.

  • state-related pseudo-classes such as :checked can be used, but they will reflect only the initial state of the field - unchecking the checkbox will not cause it to be restyled.

  • Styling of the dynamic content of form-fields (ie. text) is extremely limited - only font size, style, family and color can realistically be altered.

  • appearance:none is not supported.

  • clicking on the <label> element does nothing.

  • Form submission can be post, fdf, xml or pdf. The post method is not controllable, but is usually encoded.

Within these limitations BFO Publisher will overlay the HTML model onto the PDF model as closely as possible. All HTML5 input types and attributes are supported, with the following caveats:

HTML Element Comment

input type=email

Identical to text

input type=url

Identical to text

input type=search

Identical to text

input type=tel

Identical to text

input type=radio

Supported but the font cannot be changed

input type=checkbox

Supported but the font cannot be changed

input type=date

Adobe Acrobat styles this as a text field with a special date picker.

input type=time

Adobe Acrobat styles this as a text field with validation on the entered value.

input type=datetime-local

Adobe Acrobat styles the date portion of this as date, and the time portion as time.

input type=month

Styled as date.

input type=week

Identical to text (there is no standard PDF equivalent)

input type=range

Identical to number - styled as a text field with validation.

input type=file

Theoretically supported, but in practice appears unrecognised in Adobe Acrobat or other viewers.

input type=color

Identical to text (there is no standard PDF equivalent)

textarea

PDF treats this as input type="text" with a multi-line flag. Directionality of auto is unsupported, it will be ltr or rtl. wrap is ignored.

HTML Attribute Comment

list

Supported on text fields and their variations, but maxlength will be ignored. Not supported on date types

min and max

Supported on number and range types only.

step

Partially supported on number and range types - determines number of decimal places, no more

spellcheck

Supported, is not inheritable so should be set on the element.

autocomplete

Not supported

dirname

Not supported

pattern

Not supported

placeholder

Not supported

minlength

Not supported

maxlength

Supported on text and text-like fields, unless they have list set

multiple

Supported on select

target and formtarget

Not supported

enctype and formenctype

Not supported

method and formmethod

See earlier paragraph for details on how this differs.

Digital Signatures

BFO Publisher supports digitally signing the PDF it creates by using the HTML <object> element with a special type value of bfo/signature.

Only the PDF output format can be digitally signed. When used with any other output format, only the visible appearance of the signature will be generated. Also note that only one signature can be applied in PDF - this is a limitation of the PDF signing process.
<html>
<body>
 This document contains a digital signature
 <object type="bfo/signature">
   <param name="keystore" value="path/to/keystore.pkcs12">
   <param name="alias" value="myidentity">
   <param name="password" value="secret">
   <param name="reason" value="Royal Seal of Approval">

   <img src="Richard_III_signature_1.svg" style="width:100%" alt="Signature image"/>
   Richard III<br/>
   Leicester, LE1 5DB
  </object>
 </body>
</html>

An HTML <object> may have <param> children to configure the object; any other children are displayed as normal (as they would be for an HTML <object> with a missing source). Most of the parameters are optional, and which apply will depend on the engine.

engine

The Signature Engine to use for the signature. Option are the default engine (the default) and globalsign.dss or globalsign.qss, which use the GlobalSign Digital Signing Service™ or Qualified Signing Service, described at https://www.globalsign.com/en/digital-signatures.

keystore

The URL of the Key Store containing the digital identity used for signing. BFO Publisher recognises the PKCS#12, JKS and JCEKS Key-Store types normally used by Java, and the identity can also be loaded from a file with a PEM-encoded Private Key and one or more PEM encoded X.509 certificates. This is the only parameter that is always required.

alias

The alias to use from the Key Store. Only required for Key Stores that contain more than one key such as JKS, this parameter may also be specified as the alias fragment parameter of the keystore URL, for example. path/to/keystore.jks#alias=myidentity. Multiple fragment parameters are encoded exactly like query parameters.

serial

The serial number of the identity to use from the Key Store - this serves the same purpose as alias and is also optional. It may be combined wih cn or dn. Like alias, it may also be specified as a fragment parameter in the keystore URL.

cn

The X.500 common name of the identity to use from the Key Store - again, like alias or serial this is a method of choosing an identity from a Key Store that contains more than one, and so is optional. It may be combined wih serial and may may be specified as a fragment parameter in the keystore URL.

dn

The X.500 distinguished name of the identity to use from the Key Store. Only the fields specified will be matched. This is here for completeness but cn is usually a simpler choice. It may be specified as a fragment parameter in the keystore URL.

password

The password to access the chosen identity from the Key Store. It may be specified as a fragment parameter in the keystore URL. If not specified, it will be requested via a callback (see Secrets and Authentication).

store-password

For Key Store formats that allow different passwords for the key and the store itself, this specifies the store password. It may be specified as a fragment parameter in the keystore URL. If not specified, it will be requested via a callback (see Secrets and Authentication).

title

A descriptive title for the signature field, this serves as the description of the field for accessibility purposes (see PDF/UA)

reason

The Reason for signing, a text string stored with the signature.

location

The Location of signing, a text string stored with the signature.

contact-info

The Contact Information of the entity signing the document, a text string stored with the signature.

author

The Author of the signature, a text string stored with the signature. If not specified, defaults to the common name of the signing identity

changes

Which changes to allow to the PDF after signing. Values are comments to allow comments to be applied, fields to also allow form fields to be completed`, all to allow any changes allowed by the signature policy of the application displaying the PDF, or none to allow no changes at all (the default).

attestation

If the value of changes is not all, this is an optional text string to be stored with the signature explaining anything affecting the legal integrity of the document.

tsa

The URL of an RFC3161 time-stamp server to time-stamp the signature. This parameter may be specified more than once to provide a list of servers, in case the first one is unavailable.

hash

The digest algorithm to use for signing - the default is sha256

mode

The signature mode - values are ltv for PAdES Long Term Validation, pades (the default) for regular PAdES signatures, or legacy for non-PAdES.

size

The estimated size in bytes of the PKCS#7 object that will be generated. If unspecifed, BFO Publisher will estimate the size by doing a test signature first. This is usualy the best option unless you’re using a signing service that charges per signature. The value determines how much space is reserved for the signature and needs to be higher than the largest PKCS#7 object that might be generated.

store-type

The type of Java KeyStore to create. This is not required for normal use - the type is automatically determined for regular software-based Key Stores. It’s required for non-standard extensions to Java, such as signing with the Amazon CloudHSM keystore when you would use a value of CloudHSM.

The keys above apply to the default signature engine. The globalsign.dss and globalsign.qss engines have different requirements:

apikey

The API key required to access the GlobalSign Service. Required (but will be requested via a callback if missing (see Secrets and Authentication).

apisecret

The API secret required to access the GlobalSign Service. Required, but will be requested via a callback if missing (see Secrets and Authentication).

identity

For globalsign.qss, an email address (which must have been previously been registered with the QSS service). For globalsign.dss, an X.500 identity which will be used for signing. Which X.500 fields apply, or whether it applies at all, will depend on the DSS service. Required, but will be requested via a callback if misssing (see Secrets and Authentication).

keystore

As for the default engine, this is the path to a Key Store. For globalsign.dss or globalsign.qss signatures the Key Store contains the TLS client certificate issued by GlobalSign to access their service, rather than identity information for the signature. The alias, password etc. properties also apply exactly as they do above.

PKCS#11 Hardware tokens

Key Store information can also be taken from an HSM (Hardware Security Module) - in fact, this is a requirement for signatures meeting the standards of the Adobe Approved Trusted List (AATL). We’ve written more on this topic at https://bfo.com/blog/2019/09/23/perfect_pdf_digital_signatures_eu_style/

The only change required to use a PKCS#11 hardware token is changing the URL of the Key Store to use the pkcs11 scheme, as defined in https://tools.ietf.org/html/rfc7512

<html>
<body>
 This document contains a digital signature
 <object type="bfo/signature">
  <param name="keystore"
     value="pkcs11:object=Test%20Tester?pin-value=1234&module-path=/usr/lib/opensc-pkcs11.so"/>
  Signature Here
 </object>
</body>
</html>

RFC7512 describes a number of parameters but not all are available for use with the Java PKCS#11 interface. Those that do apply are described here.

object

This is the CKA_LABEL of the object to retrieve from the PKCS#11 device - effectively the alias of the identity on the keystore. It’s part of URL Path, and is required.

module-path

This is the path to the native library providing the PKCS#11 implementation to Java. It’s part of the URL Query String, maps to the library parameter in Java’s PKCS#11 interface, and is required.

pin-value

This is the PIN value to retrieve the object from the PKCS#11 device - effectively the password. If pin-value and pin-source are both missing the standard callback mechanism will be used to prompt the user for the pin (see Secrets and Authentication). It’s part of the URL Query String, maps to the library parameter in Java’s PKCS#11 interface, and is optional.

pin-source

This is the URL of a file containing the PIN value to retrieve the object from the PKCS#11 device. If pin-value and pin-source are both missing the standard callback mechanism will be used to prompt the user for the pin. It’s part of the URL Query String, maps to the library parameter in Java’s PKCS#11 interface, and is optional.

slot-id

This is the CK_SLOT_ID to select on the PKCS#11 device - a number, typically 0 or 1. It’s part of the URL Path, maps to the slot parameter in Java’s PKCS#11 interface, and is optional.

library-description

This is the CK_INFO libraryDescription - the description of the PKCS#11 library. It’s part of the URL Path, maps to the description parameter in Java’s PKCS#11 interface, and is optional.

java-nnn

Any fields in the URL Path or Query String that begin with java- will be added (minus the java- prefix) to the PKCS#11 interface configuration as specified. All are optional.

Other parameters may be specified but will be ignored.

Signature defaults

Each of the parameters specified will take their default values from an environment variable of the form bfo-ext-signature-NNN where NNN is the parameter name. This allows information which might not be available to the document author, such as the path to the keystore, to be specified in advance.

Name

bfo-ext-signature-nnn

Value

string

Description

set the default parameter values for any digital signatures.

Using this approach, the PKCS#11 example above could have be specified like this:

<html>
<meta name="bfo-ext-signature-keystore"
  content="pkcs11:object=Test%20Tester?pin-value=1234&module-path=/usr/lib/opensc-pkcs11.so">
<body>
 This document contains a digital signature
 <object type="bfo/signature">
  Signature Here
 </object>
</body>
</html>

Obviously the environment variable can be specified externally to the source file rather than inline as shown here. Using the Overrides and Defaults controls from the web-service to set these properties would allow an identity used for authentication with the Web Service to preconfigure or control the identity used for signing documents.

Barcodes

BFO Publisher supports inserting dynamically created barcodes into the PDF. While it’s possible to create barcodes externally and insert them as a bitmap, the approach described here will be faster and give better resolution: the barcode is generated with vector graphics.

An HTML <object> element with a special type value of bfo/barcode is used to insert a barcode. An HTML5 <embed> object can also be used if preferred, which allows using attributes instead of nested <param> tags.

<object type="bfo/barcode">
 <param name="code" value="qrcode"/>
 <param name="value" value="Testing Barcode"/>
 <param name="unit" value="1mm"/>
 <p>testing fallback</p>
</object>

The parameters below can be used. Any unrecognised parameters will be ignored, as will any other chidren to <object>. code and value are required, anything else is optional and the defaults for those values depend on the code symbology.

code

The barcode type - required.

label

Either true (the default) or false, controls whether the text value of the barcode is displayed. Does not apply to every type of barcode.

unit

The barcode unit-size. The exact meaning of this depends on the code, but for 1D codes like Code 128 it’s typically the width of the thinnest bar, and for 2D codes like QR-code it’s the size of the individual squares in the code.

columns

For PDF417 barcodes, the number of columns to format the code - must be >= 1

ecc

The error correction to apply. The generic values none, min and max will be mapped to whatever is appropriate for the chosen barcode type, and integer values can also be used, with 0 meaning none, 1 meaning min and values increasing from there.

ratio

For variable width barcodes like Code 39, the ratio between thin and thick bars. Typically about 2.5

bleed

The ink-bleed amount, specified as a length (or horizontal and vertical lengths separated with a space).

value

The barcode value. Required.

Barcode types can be any of the following

  • qrcode

  • code128

  • ean13

  • ean8

  • upca

  • codebar

  • postnet

  • rm4scc

  • identcode (leitcode is a synonym)

  • intelligentmail

  • databar

  • databar-truncated

  • aztec

  • datamatrix

  • deutschepostmatrix

  • code39

  • code39x

  • code93

  • interleaved25

  • itf14

  • itf14-box (ITF14 with the bearer box)

  • pdf417 (pdf417:2006 and pdf417:2015 are synonyms)

  • pdf417:2001 (see here for an explanation)

  • pdf417:eci

The size of the barcode (the intrinsic size of the object, in CSS terminology) will be dynamically calculated from the barcode type, value and unit and will include the minimal mandatory whitespace around the barcode. It’s possible to override the size in HTML by setting the width or height attributes on the <object> or in CSS, but be aware that for for some types this may distort the barcode. Controlling size with unit is the preferred option.

Barcodes will respect many CSS properties like font-family, color and so on. Although this author feels there is a special place in hell reserved for those who place an image over the center of a QR-code, neatly obscuring all the error correction, it can be done by placing an absolutely positioned image in the correct place:

<div style="position: relative; width: min-content">
 <object type="bfo/barcode" style="display: block">
  <param name="code" value="qrcode">
  <param name="ecc" value="max"> <!-- give your image something to obscure -->
  <param name="value" value="Testing Barcode">
 </object>
 <img src="kitten.png"
   style="position: absolute; left: 0; top: 0; width: 100%; height:100%; object-fit: none">
</div>

Footnotes and friends

Documents with many references often need a way to move those references out of flow. Three approaches are typical in paged media.

  • Footnotes - content is moved to the end of the current page.

  • Endnotes - content is moved to the end of the current chapter or section.

  • Sidenotes - content is moved to the side of the page.

BFO Publisher supports all three, and although the method of achieving each is quite different there are some common concepts.

  • In all cases, content is moved away from where it’s defined in the document. To indicate this, a call is left at the point it was removed - typically a counter value, often super-scripted1 or placed within brackets(1).

  • The content is moved to a region where it is preceded by a marker containing the same counter value as the call, in exactly the same way as a list-marker marks a list item.

Footnote, endnote and sidenote markers and calls

Footnotes

Footnotes are defined in https://www.w3.org/TR/css-gcpm-3/, so are at least semi-standardised in CSS. To create a footnote set the float property on the footnote element to footnote. This does several things

  1. A ::footnote-call pseudo-element is created as a child of the floated element and positioned it at the call location on the page, where the footnote was originally.

  2. A ::footnote-marker pseudo-element is also created as a child of the floated element, and moved with the floated element to the @footnote region of the page.

The @footnote region is similar to the Page Margin Boxes, and can be styled the same way. It is only created on a page when there are footnotes to place into it. The example above could be styled with the following stylesheet.

Use counter-set: footnote 0 rather than counter-reset: footnote to reset a footnote counter in the page margin; the second approach won’t work if counter scoping is implemented as specified.
<style>
@page {
  counter-set: footnote 0; /* Reset counter to one on each page */
  @footnote {
    margin-top: 1em;
    border-top: 1px black -bfo-dash(0 50%); /* Draw a top-border that's 50% of its length */
    padding-top: 1em;
  }
}
.note {
  float: footnote;
}
::footnote-call {
  /* These rules are the defaults; they're only shown for completeness */
  counter-increment: footnote;
  content: counter(footnote);
  font-variant-position: super;
}
::footnote-marker {
  /* These rules are the defaults; they're only shown for completeness */
  content: counter(footnote);
}
</style>
<section>
 <p>
  Here is useful
  <span class="note">
   this information is not actually that useful, so we moved it to a footnote
  </span>
  information about something important, which we'll now go on about at length
 </p>
</section>

Endnotes

Endnotes are when content is moved to a later position in the document - the end of a chapter, a section, or the document. Endnotes don’t have standard support in CSS, but we can use CSS Regions (see https://www.w3.org/TR/css-regions-1/) to enable this.

BFO Publisher has fairly limited support for CSS Regions; overflow is not yet supported. What is implemented is easily enough to handle endnotes.

Unlike footnotes we have to create the Call, Marker and Region manually, but we can do this with pseudo-nodes so no change to the markup is required. Below is the previous example with the CSS changed to float the notes to the end of each section.

<style>
.note {
  display: block;
  counter-increment: endnote;
  flow-into: endnote;           /* Redirect the content to the "endnote" region */
}
.note::-bfo-call {              /* A custom pseudo-class that allows a "call" to be created  */
  content: counter(endnote);    /* for any element, not just footnotes. Like "::before", it's */
  font-variant-position: super; /* created only if the "content" property is set */
}
.note::before {                 /* The is the "marker" for our endnotes */
  content: counter(endnote) ". ";
}
section {
  counter-reset: endnote;
}
section::after {                /* This is the "endnote region". It's created at the end */
  flow-from: endnote;           /* of each section and will display any content previously */
  display: block;               /* directed to the "endnote" region". */
  margin-top: 1em;
  border-top: 1px black solid;
  padding-top: 1em;
}
section::before {               /* This pseudo-element exists only to flush the "endnote" */
  flow-into: endnote;           /* region content. Every time a new section is encountered, */
  display: block;               /* the "break-before:region" on this element will cause */
  content: "";                  /* any endnotes that follow it will be sent to the next */
  break-before: region;         /* "endnote region" rather than the previous one */
}
</style>
<section>
 <p>
  Here is useful
  <span class="note">
   this information is not actually that useful, so we moved it to a footnote
  </span>
  information about something important, which we'll now go on about at length
 </p>
</section>

Index Generation

A special feature of BFO Publisher is the ability to generate a simple XML document containing values calculated during the layout process. This can be used in a post-processing step, or it can be included back into the document with <xi:include> and a suitable stylesheet to create an index.

The generated XML file has the special URL about:index. It can be retrieved by calling report.getResource("about:index") from the API, or by downloading the about-index.xml file from the web-service after conversion has completed.

Creating the about:index file involves setting some custom CSS properties. The syntax is very similar to the bfo-tag properties used for Tagged output, as both these sets of properties are used to create an XML-like tag structure from the input XML and CSS. It makes use of the <content-list> type definition from https://drafts.csswg.org/css-content-3/#typedef-content-content-list (although the leader() function is excluded, and the content() function can also be used).

Name

-bfo-index

Value

none | <tag> <content-list>?

Applies to

all elements or pseudo-elements that are rendered in the output.

Inherited

no

Description

Controls the existance and format of this element in the generated index

Name

-bfo-index-nnn

Value

none | <content-list>

Applies to

any element or pseudo-element with -bfo-index other than none

Inherited

no

Description

Sets the value of the nnn attribute on this element in the generated index

Setting the -bfo-index attribute on an element will generate an element with that value in the about:index document - if the element then contains other elements which also have -bfo-index set, those elements will be properly nested.

If the <content-list> is set after the tag name, it will be used to populate the text content of the node (text-content will precede all element children; mixing text and element children is not recommended). And if a <content-list> is set on any -bfo-index-nnn property, it will set the attribute nnn on the -bfo-index element to that content.

Some examples: to create an about:index file containing the page number for every element with an id and noting the content and auto-generated heading number of every <h1>, you could do something like this:

<style>
[id] {
    -bfo-index: tag;
    -bfo-index-id: attr(id);
    -bfo-index-page: counter(page);
}
h1, h2 {
    counter-increment: heading;
    -bfo-index: "h" content(before) content(text);
    -bfo-index-page: counter(page);
}
h1::before, h2::before {
    content: counter(heading) ": ";
}
</style>
<section id="sect1">
 <h1>first heading</h1>
 ...
 <h2>second heading</h2>
</section>

this would generate an about:index file like this:

<index:index xmlns:index="about:index">
  <section id="sect1" page="1">
   <h page="2">1: first heading</h>
   <h page="5">2: second heading</h>
  </section>
</index:index>

The id attribute and generated page number for each item is included, and the content() value used to copy the content - including the generated content in the :before pseudo-node - from the <h1> and <h2> elements to the generated tag.

Including the generated index

As well as downloading the about:index file after generation, it can also be included into the source document before parsing completes. With a few simple HTML attributes added to the document to create index entries, an index can be generated as part of the conversion - all in a single pass.

This section bring together many of the features we’ve previously described - XInclude, XSLT, Page counters and the -bfo-index properties decribed above. And we’re now conceptually dealing with three XML documents:

  • the input XML, which will be annotated to create index entries

  • the about:index file which those entries generate

  • the XML file created by transforming about:index with the org.faceless.publisher.ext.IndexFilter stylesheet.

This stylesheet places requirements on on the about:index XML that need to be met.

As a result, this chapter is necessarily more complicated than the others. We’ll first walk through a process of how you might do this, show the results, and then in the next sub-section we’ll go into detail of how it works.

An example generated index

The first step is to annotate our XML or HTML input to generate index entries. There are many ways you could do this; For now we’ll just describe the approach we took for the PDF version of this document, which involved setting one of three custom attributes on an HTML element to create an index entry.

Here’s the stylesheet we added to the file.

/* Set "data-index" attribute to create a single index entry for that item */
[data-index] {
    -bfo-index: "entry";
    -bfo-index-id: attr(id);
    -bfo-index-term: attr(data-index);
    -bfo-index-page: counter(-bfo-page-physical) " " counter(page, auto);
}

/* Set "data-index-range" attribute to create a "range" index entry
   for that item, referencing all pages containing the element */
[data-index-range] {
    -bfo-index: "entry";
    -bfo-index-id: attr(id);
    -bfo-index-term: attr(data-index-range);
    -bfo-index-page: counter(-bfo-page-physical) " " counter(page, auto);
    -bfo-index-to-page: counter(-bfo-page-physical-close) " " counter(-bfo-page-close, auto);
}

/* Set "data-index-xref" attribute to create a cross-reference entry which has no page */
[data-index-xref] {
    -bfo-index: "entry";
    -bfo-index-term: attr(data-index-xref);
}

Check the Page counters section for details on those custom counter values.

To use these styles from within our source HTML, we set one of those attributes on any element we want to index (that is not display:none). The value of the attribute is the term - more on that later, but at its simplest it’s just the phrase to index. In this example we’re using a / character in the term to create a hierarchy: entry and sub-entry.

<section data-index-range="fruit">
<p>
 ...
 text describing the <span data-index="fruit/apple">apple</span>
 ...
 more text describing <span data-index="fruit/citrus fruit">citrus fruit</span>
 ...
 an <span data-index-xref="fruit/orange -> fruit/citrus fruit">orange</span> is a type of
 ...
 even more text describing <span data-index="fruit/citrus fruit">citrus fruit</span>
 ...
</p>
</section>

the -bfo-index and -bfo-index-nnn properties have already been described, so if you’ve been following along you’ll know that together, the stylesheet and HTML above would generate something like the following about:index file:

<index:index xmlns="about:index">
 <entry term="fruit/growing" page="5 5" to-page="15-15">
  <entry term="fruit/apple" page="7"/>
  <entry term="fruit/citrus fruit" page="9"/>
  <entry term="fruit/orange -> fruit/citrus fruit"/>
 </entry>
</index:index>

We then need to include this XML back into the original source document, applying an appropriate stylesheet to convert it to an HTML index we can style.

Insert the following XML near the end of your input document - it must be after any elements that would add to the index. If you’re doing this in HTML rather than XML, you might want to set the -bfo-ext-html-namespace environment property to auto.

<xi:include
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:bfo="http://bfo.com/ns/publisher"
    bfo:xslt="classpath:org.faceless.publisher.ext.IndexFilter"
    href="about:index"
/>

This will include and transform the about:index file shown above with the stylesheet at org.faceless.publisher.ext.IndexFilter. This is a class implementing XMLFilter rather than an XSLT stylesheet, as the transform is a bit much for XSLT. The transformed output looks something like this:

<section class="bfo-index-container">
 <div class="bfo-index">

  <div class="bfo-index-group" data-term="F">
   <div class="bfo-index-heading" data-term="F">F</div>

   <div class="bfo-index-entry" data-term="fruit">
    <span class="bfo-index-term">fruit</span>
    <span class="bfo-index-pages">
     <a href="pdf:goto(5)">5</a>-<a href="pdf:goto(15)">15</a>
    </span>

    <div class="bfo-index-entry bfo-index-entry-final" data-term="apples">
     <span class="bfo-index-term">apples</span>
     <span class="bfo-index-pages">
      <a href="pdf:goto(7)">7</a>
     </span>
    </div>

    <div class="bfo-index-entry bfo-index-entry-final" data-term="citrus fruit">
     <span class="bfo-index-term">citrus fruit</span>
     <span class="bfo-index-pages">
      <a href="pdf:goto(9)">9</a>,<a href="pdf:goto(12)">12</a>
     </span>
    </div>

    <div class="bfo-index-entry bfo-index-entry-final" data-term="oranges">
     <span class="bfo-index-term">oranges</span>
     <span class="bfo-index-xref">fruit, citrus fruit</span>
    </div>
   </div>

  </div>
 </div>
</section>

As these elements are now included into the source document, they can be styled in the normal way. A default user-agent stylesheet is included as part of the index - it’s in the Jar as org/faceless/publisher/data/index.css and can be retrieved by calling Report.getResource(URL2.parse("classpath:data/index.css")).

Here’s how the generated index looks, with the default styles.

center

Index file format details

The main aspect of this section is the transformer org.faceless.publisher.ext.IndexFilter and the format it expects, which we’ll call the BFO Publisher Index format. As the IndexFilter is a class in the classpath, it has API documentation which will add to what’s described here.

The IndexFilter class has a convenient transform() method which can be used to quickly transform an input XML to an output XML. It also has a main method which runs this transform on System.in and writes to System.out, so you can run it from the command line for testing.

The XML format consists of any root element, containing multiple <entry> elements. Each <entry> may contain other <entry> elements (nesting is ignored) or <term> elements. All other elements are transparent to this transformation. The definitions are:

<entry>

defines an entry in the index. Multiple entries can exist for a single term, they will be merged to one entry with multiple pages. The attributes are

term

specifies the index term or terms. An index term is required, and may be set with the term attribute, a <term> descendant of the <entry> or both. There are three aspects to each term: the term itself, plus optionally how the term is sorted and (if it’s a cross-reference) what it’s referencing. The term attribute is actually a list of terms, for when one <entry> needs to create multiple index entries. The term is parsed with special formatting rules, best demonstrated by example:

term="apples"

add an entry with the term "apples"

term="fruit/apples"

add an entry with the term "fruit" and a sub-entry with the term "apples". The class, page number etc. apply to the sub-entry (apples) rather than the parent (fruit).

term="α-particle { alpha-particle }"

add an entry with the term "α-particle" but sort it as if it were "alpha-particle"

term="α{alpha}-particle"

exactly as the previous example.

term="http:\/\/"

add an entry "http://" - the slash characters are escaped by prefixing with a backslash "\"

term="malus domestica -> apple"

add an entry "malus domestica" which is a cross-reference to the entry "apple".

term="rgb() // rgba()"

add two identical entries, "rgb()" and "rgba()", both of which will refer to the same page set on the entry.

term="rgba() -> rgb() // #number -> rgb()"

create two entries for "rgba()" and "#number", both of which are cross-references to the "rgb()" entry.

If a term creates a cross-reference but no page attribute is specified, a see style entry is created. If a term creates a cross-references and a page attribute is set see also style entry created instead. There is no limit to how deep terms can go, but we suggest no more than three.

page

the page the item is on. The format is an integer starting at 1, followed optionally by the formatted value of the page. For example page="12", page="12 12" and page="12 xii" all link to page 12, but the final version will format the number using lower-latin digits. If no page attribute is specified, the term attribute must create a cross-reference, otherwise the entry is ignored.

to-page

if the entry covers a range of pages, the to-page attribute specifies the last page of the range. The format is identical to page.

class

the optional class to apply to the term.

page-class

the optional class to apply to the page-number.

id

if specified, the generated hyperlink in the index will be to that item rather than to the numbered page.

<term>

any <term> child of an <entry> will have its text-content and any term attribute merged into the term attribute of the parent <entry>, first first replacing any zero-length terms then being appended if no zero-length terms exist. Each of the following groups of XML constructions are all equivalent:

<entry term="apples" />

<entry>
 <term term="apples"/>
</entry>

<entry>
 <term>apples</term>
</entry>
<entry term="fruit/apples" />

<entry>
  <term>fruit</term>
  <term term="apples" />
</entry>

<entry term="/apples"> <!-- note leading slash - first term is blank -->
  <term term="fruit"/>
</entry>

<entry term="fruit">
  <term>apples</term>
</entry>

The transform will convert this input XML into an output XML, an example of which was shown in the previous section. The output XML has the following structure.

  • A wrapper <section> element with class bfo-index-container

  • A single element with class bfo-index

  • One or more elements with class bfo-index-group, one for each group. Groups depend on the language, but broadly there is one per letter.

  • Each bfo-index-group contains one element with class bfo-index-heading, then one or more elements with class bfo-index-entry.

  • Each bfo-index-heading element contains a single element with class bfo-index-term containing the name of the group, eg "F".

  • Each bfo-index-entry element contains a single element with class bfo-index-term containing the term. It also contains either:

    • one element of class bfo-index-pages containing the <a> elements hyperlinking to the page, and the page number text.

    • zero or more elements of class bfo-index-xref-also, for any see also cross-references that exist alongside the page number.

    or

    • one or more elements of class bfo-index-xref, for any see cross-references that exist without a page number.

This reference with the example from the previous section should give you everything you need to style the index if required.

Customizing the transformed index

The transform that is applied to the input XML has a few settings that can be modified.

  • subdivision - the token to split a term into hierarchy. The default is "/" - as in the example "fruit/apples"

  • division - the token to split a term into more than one term. The default is "//", as in the example "rgb() // rgba()"

  • xref - the token which identifies the start of a cross-reference in the term - the default is "->".

  • sort - the token (or pair of tokens) which delimits an optional sort-value in a term. The default is "{ }" - it should be two terms separated by a space.

  • comma - the output-token inserted between two unrelated page numbers. The default is ",", as in the output "5,9"

  • dash - the output-token inserted between two ranged page numbers. The default is an n-dash, as in the output "5—​9"

  • separator - the output-token inserted between a term entry and sub-entry in a see or see also cross-reference. The default is ",", as in the output "see fruit, apples".

  • symbol - the text to use for the group of entries that begin with a symbol character. The default is "Symbol".

Each of these can be set as attributes on the root element of the input XML, which requires a particular approach when it’s included with XInclude:

<xi:include
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:bfo="http://bfo.com/ns/publisher"
    xmlns:xila="http://www.w3.org/2001/XInclude/local-attributes"
    bfo:xslt="classpath:org.faceless.publisher.ext.IndexFilter"
    href="about:index"
    xila:subdivision="|"
    xila:division="||"
    xila:symbol="Simbolos"
/>

Alternatively they can be set as environment variables in the document

@bfo env {
    bfo-ext-index-subdivision: "|";
    bfo-ext-index-division: "||";
    bfo-ext-index-symbol: "Simbolos";
}

Finally, note that sorting is done with java.text.Collator. This requires the language to be set correctly on the document.

Graph Library

With BFO’s Graph Library included in the Java CLASSPATH, BFO Publisher can embed graphs and charts directly into the output. This functionality is new in version 1.3.

The XML syntax for the Graph Library predates BFO Publisher, and has no knowledge of CSS. We’ve adapted the syntax so that attributes defined in the XML syntax can be set by CSS; specifically:

  • The width, height, overflow font-family, font-size, font-weight, padding-* CSS properties set the corresponding XML attributes

  • The color CSS property sets the textcolor XML attribute

  • The fill, stroke, stroke-width and stroke-dasharray CSS properties set the color, bordercolor, linethickness and dash XML attributes

Here’s an XML example

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <style>
   axesgraph {
     display: block;
     width: 300pt;
     height: 300pt;
     font: 12px sans-serif;
   }
   data[x="04-MAY-2005"] {
     fill: red;
     stroke: transparent;
   }
  </style>
 </head>
 <body>
  <h1>A graph</h1>
  <axesgraph xmlns="http://bfo.co.uk/ns/graph?version=2">
   <label>My Graph</label>
   <axis pos="bottom" type="date(dd-MMM-yyyy)" density="sparse" align="top left"/>
   <axis pos="left" type="int" min="0" max="20"/>
   <barseries name="Freestyle">
    <data x="02-MAY-2005" y="10"/>
    <data x="04-MAY-2005" y="12"/>
    <data x="06-MAY-2005" y="8"/>
   </barseries>
  </axesgraph>
 </body>
</html>

Graph Library tags are identified by their namespace, so to use them in HTML requires HTML Namespace Extensions. Probably the easiest is to set the namespace for axesgraph, piegraph, and dialgraph in advance, which means the tags can be specified as if they were normal HTML5 tags. The sample example as above in HTML syntax would be:

<!DOCTYPE html>
<html>
 <head>
  <meta name="bfo-ext-html-namespace" content="axesgraph=http://bfo.co.uk/ns/graph?version=2 piegraph=http://bfo.co.uk/ns/graph?version=2 dialgraph=http://bfo.co.uk/ns/graph?version=2"></meta>
  <style>
   ...
  </style>
 </head>
 <body>
  <h1>A graph</h1>
  <axesgraph>
   <label>My Graph</label>
   <axis pos="bottom" type="date(dd-MMM-yyyy)" density="sparse" align="top left"></axis>
   ... content as for the XML example above - but remember, no self-closing tags! ...
  </axesgraph>
 </body>
</html>

Factur-X

Factur-X (also known as ZUGFeRD) is a European initiative for electronic invoicing, primarily used in Germany and France. Technically, it’s nothing more than an XML file embedded in a PDF/A-3 document, so BFO Publisher can create Factur-X invoices very easily.

The steps below presume you already have the invoice data as an XML file; BFO Publisher doesn’t do any checks on the validity of this file, but it should comply with CII XML ("Cross Industry Invoice", more formally UN/CEFACT SCRDM CII D16B) : the the root element should be:

<rsm:CrossIndustryInvoice xmlns:rsm="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100">

To take a file like this and turn it into a Factur-X compliant invoice, there are several options.

Create HTML for the invoice manually, and store the invoice XML in an external file.

If you have code that generates the invoice XML from data from your database, an obvious option is also generate the HTML for the invoice from the same source. The basic format of any HTML file for conversion to a Factur-X invoice is as follows:

<html>
 <head>
  <meta name="bfo-pdf-profile" content="factur-x-basic">
  <link rel="attachment/alternative" name="factur-x.xml" href="path/to/myinvoice.xml">
 </head>
 <body>
  <h1>Invoice</h1>
  ...
 </body>
</html>

The only parts that need adding beyond a basic HTML file are

  • The bfo-pdf-profile meta key is set to factur-x-basic, factur-x-extended or factur-x-en16931

  • The XML is attached using a <link rel="attachment/alternative" name="factur-x.xml" …​>

Create XHTML for the invoice manually, and store the invoice XML in the same file.

If you’re working with XHTML, it’s quite natural to include the Invoice XML data in the same file. This is identical to the process above except the href attribute refers to a fragment URL - the only requirement here is the link is used before the element it refers to. If you’re doing this, you’re probably using XHTML rather than HTML:

<html xmlns="xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta name="bfo-pdf-profile" content="factur-x-basic" />
  <link rel="attachment/alternative" name="factur-x.xml" href="#factur-x" />
   <rsm:CrossIndustryInvoice id="factur-x" xmlns:rsm="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100">
    ...
   </rsm:CrossIndustryInvoice>
 </head>
 <body>
  <h1>Invoice</h1>
  ...
 </body>
</html>
Create HTML/XHTML for the invoice manually, and store the invoice XML in the same file, wrapped in a <script>

The above option requires an id attribute is set on the <rsm:CrossIndustryInvoice> element, but if you don’t want to do this (or if you want to use HTML rather than XHTML) then wrap the Invoice XML in a <script> element. This works in HTML because the datamodel for <script> allows it to embed just about anything; it’s the recommended way to embed JSON+JD, RDF and other content without needing to escape it. If you’re using XHTML then you still need to abide by the rules of XML, which typically means using a <![CDATA[ …​ ]]> immediately inside the <script> to wrap the XML.

<html>
 <head>
  <meta name="bfo-pdf-profile" content="factur-x-basic" />
  <link rel="attachment/alternative" name="factur-x.xml" href="#factur-x" />
  <script type="text/xml" id="factur-x">
   <rsm:CrossIndustryInvoice xmlns:rsm="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100">
    ...
   </rsm:CrossIndustryInvoice>
  </script>
 </head>
 <body>
  <h1>Invoice</h1>
  ...
 </body>
</html>
Use XSLT to convert the Invoice XML to XHTML

The most elegant solution uses an XSL stylesheet to generate the visual appearance of the invoice directly from the invoice XML.

A custom stylesheet may include the appropriate <link> and <meta> tags to attach the Invoice XML and set the bfo-pdf-profile to the Factur-X profile. The XML being attached is actually the source document itself, so the link tag should look like <link rel="attachment/alternative" name="factur-x.xml" href="#"/> (the href "#" refers to the whole document).

Generating a suitable XSL stylesheet is complex. BFO has generated an open-source XSLT stylesheet to demonstrate one way to do it, which is probably a useful starting point if you want to generate your own. These are published at https://github.com/bfocom/publisher-extra/tree/main/factur-x

Altnernatively, the XRechnung developers have published some open-source stylesheets to do this at https://github.com/itplr-kosit/xrechnung-visualization which generate HTML from CII XML input (the one with <rsm:CrossIndustryInvoice> as a root). They do this in two stages; one XSL stylesheet convers from CII to an intermediate XML format, and a second XSL stylesheet converts from this to HTML. Both stylesheets use XSLT 2.0, so you will need an XSLT2 processor. We highly recommend adding Saxon to your CLASSPATH. These stylesheets generate HTML that includes JavaScript and that is not designed for print layout, so some work will be required to get useful output from these when used with BFO Publisher.

We’ll present here a worked example of how to do this with the BFO open-source stylesheets.

First, you’ll need a CII XML document to convert - if you don’t have one, there is a sample included with that package as src/test/instances/wrong-date-with-text-uncefact.xml

The conversion requires a processing instructions is added to the XML; the most obvious (and wrong - we’ll come to that) way to do this is to edit the XML itself. Add the line shown below to the top of your XML file, ensuring you use the correct path to the XSL stylesheets. We’ve placed ours in the root folder of the downloaded package.

<?xml-stylesheet href="invoice.xsl"?>
<rsm:CrossIndustryInvoice xmlns:rsm="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100">
 ...
</rsm:CrossIndustryInvoice>

The <?xml-stylesheet?> line ensures the XML will be transformed by the specified styleheet before it is processed by BFO Publisher. Try it now to convert the XML file to PDF; the output is almost Factur-X compliant.

The problem (which we hinted at above) is that the embedded XML file now contains that processing instruction. Adding it to the file directly won’t work, and in fact it’s far more useful to specify the processing instruction by way of the API or Web Service, whichever you’re using. So if using the API, you might do something like this:

import java.io.*;
import java.util.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import org.faceless.publisher.ReportFactory;
import org.faceless.publisher.Report;
import org.faceless.publisher.output.ReportOutput;
import org.faceless.publisher.type.MediaType;
import org.faceless.publisher.type.ProcessingInstruction;

public class Test {
  ReportFactory factory = new ReportFactory();

  public void convert(File infile, File outfile) throws IOException, SAXException, ParserConfigurationException {
    Report report = factory.createReport();
    List<ProcessingInstruction> pi = report.getProcessingInstructions();
    pi.add(new ProcessingInstruction("xml-stylesheet", "href=\"invoice.xsl\""));
    ReportOutput output = factory.createReportOutput(MediaType.parse("application/pdf"));
    report.setReportOutput(output);
    report.load(infile);
    report.parse();
    FileOutputStream stream = new FileOutputStream(outfile);
    output.write(stream);
    stream.close();
  }
}

if using the Web Service you could send a request like this

{
  "type": "convert",
  "uri": "myinvoice.xml",
  "processing_instructions": [
    {"type":"xml-stylesheet", "data":"href=\"invoice.xsl\""}
  ],
  "put": [
    {"path": "myinvoice.xml", "content_type": "text/xml", "content": ... },
    {"path": "invoice.xsl", "content_type": "application/xslt+xml", "content": ... },
    {"path": "invoice.css", "content_type": "text/css", "content": ... },
    ...
  ]
}

An even smarter solution would be to put the stylesheet the resources it refers to in a shared folder to save uploading them every time. See the extra-folders property in the Web Service documentation for details.

If you’re working with other XSL stylesheets intended for general HTML output, chances are they will not set the correct <link> and <meta> tags in the HTML. You can edit these in, of course, but if that’s not possible both these properties can also be set the API or Web-Service.

To add the <link>, use a link processing instruction - see Linking Resources for details on this custom BFO Publisher extension. And the bfo-pdf-profile metadata can be set any number of ways, as it’s just an Environment variable. It can be set with the environment key to the Web-Service, or add it to the map returned by Report.getEnvironment() if using the API.

What if my initial format is not CII XML?

Factur-X mandates CII XML as the XML format that is embedded, but it’s a bit unfriendly to work with. If you an initial XML format that can be transformed to CII XML using an XSL stylesheet, the transformed XML can be embedded by using the bfo:xslt attribute on the <link>.

<!--
  Use the approach shown in earlier examples to embed CII XML
-->
<html xmlns="xmlns="http://www.w3.org/1999/xhtml">
 <link rel="attachment/alternative" name="factur-x.xml" href="cii.xml" />
 ...
</html>


<!--
  Use this approach to transform another XML format to CII XML before embedding it
-->
<html xmlns="xmlns="http://www.w3.org/1999/xhtml" xmlns:bfo="https://bfo.com/ns/publisher">
 <link rel="attachment/alternative" name="factur-x.xml" href="simple.xml" bfo:xslt="transform.xsl" />
 ...
</html>

The XML syntax is shown, but remember with HTML that non-standard namespaces need special processing. See HTML Namespace Extensions for details.

This will also work with the XSLT-based approach to conversion shown above. The inital XML invoice would be transformed to HTML with one stylesheet, and it would be converted to CII XML using a different stylesheet before embedding. If using a <?link?> processing instruction to do this, the namespace requirement on the xslt "attribute" is dropped.

<?xml-stylesheet href="xsl/convert-to-html.xsl"?>
<?link rel="attachment/alternative" name="factur-x.xml" xslt="xsl/convert-to-cii.xsl" href="#"?>
<my-simple-invoice>
 ...
</my-simple-invoice>

API Usage

BFO Publisher is a Java API at heart, and can be easily embedded in any larger product to enable conversion. It’s built on the BFO PDF Library from https://bfo.com/products/pdf, which is included in the JAR.

Dependencies

There are three JAR files included with the product download - the two main ones are:

bfopublisher.jar

Contains BFO Publisher, the BFO PDF Library (from https://bfo.com/products/pdf) and the open-source BFO JSON library from https://faceless2.github.io/json/.

bfopublisher-bundle.jar

As bfopublisher.jar but also includes htmlparser.jar and the netty JAR listed below. This JAR alone can be used to convert HTML/XML to PDF and run the web service.

When running standlone, the bfopublisher-bundle.jar file is probably the easiest. If you want to manage dependencies yourself then include the bfopublisher.jar in the CLASSPATH, along with any JAR files listed below that are required.

All the below JARs are optional. Conversion can run without them, although some functionality will not be available.
htmlparser-VERSION.jar

Available from https://about.validator.nu/htmlparser/ with more recent builds from https://mvnrepository.com/artifact/nu.validator/htmlparser, we’ve tested both the 1.4 version (from 2012) and recent nightly builds. We strongly recommend this is in the classpath - without it, conversion of HTML (as opposed to XHTML) will not work.

netty-buffer-VERSION.jar
netty-codec-VERSION.jar
netty-common-VERSION.jar
netty-handler-VERSION.jar
netty-transport-VERSION.jar

These five JARs from the Netty project (https://netty.io) are required to use the web-service, but are otherwise optional. Version 4.1.73 or later is required.

saxon-VERSION.jar

From https://www.saxonica.com/, the Saxonica JAR is recommended if XSLT transformation is used. We’ve tested with version 9.9he but any version should be fine as we (mostly) use the standard Java XSLT interface to communicate.

xalan-VERSION.jar

From https://xml.apache.org/xalan-j/, Apache Xalan is an alternative for XSLT transformation. We’ve tested with versiob 2.7.2 but again, any version should be fine as we use the standard Java XSLT interface.

zstd-jni-VERSION-PLATFORM.jar

From https://github.com/luben/zstd-jni - if available, HTTP connections can use Zstd compression

lz4.jar

From https://github.com/lz4/lz4-java - if available, HTTP connections can use LZ4 compression

jlessc-VERSION.jar

From https://github.com/i-net-software/jlessc - if in the classpath, stylesheets with a rel attribute of stylesheet/less will be automatically converted to CSS

sass-embedded-host-VERSION.jar
sass-embedded-protocol-VERSION.jar
protobuf-java-VERSION.jar

From https://github.com/larsgrefer/dart-sass-java - if in the classpath, stylesheets with a rel attribute of stylesheet/scss will be automatically converted to CSS.

webp-imageio-core-VERSION.jar

From https://github.com/nintha/webp-imageio-core - if available, WEBP format images can be loaded.

bfopdf-jj2000.jar

Included with the download package and also available at https://github.com/faceless2/jpeg2000, this JAR is required if JPEG2000 images need to be decompressed to bitmap. This happens when converting an HTML file referencing a JPEG2000 image bitmap, SVG, or PDF/A-1.

Getting started

the API Documentation is included with this help and served by the embedded web-server, or is available from https://publisher.bfo.com/live/help/api. Only classes which are useful for end-users have been documented, so it won’t be too overwhelming

Here’s a simple example

import java.io.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import org.faceless.publisher.ReportFactory;
import org.faceless.publisher.Report;
import org.faceless.publisher.output.ReportOutput;
import org.faceless.publisher.type.MediaType;

public class Test {
  ReportFactory factory = new ReportFactory();

  public void convert(File infile, File outfile) throws IOException, SAXException, ParserConfigurationException {
    Report report = factory.createReport();
    ReportOutput output = factory.createReportOutput(MediaType.parse("application/pdf"));
    report.setReportOutput(output);
    report.load(infile);
    report.parse();
    FileOutputStream stream = new FileOutputStream(outfile);
    output.write(stream);
    stream.close();
  }
}

The conversion process usuaully goes like this:

  1. Create a single ReportFactory object, which is used for multiple conversions. Reusing it allows stylesheets, fonts, images etc. to be cached and shared - it will be faster. A ReportFactory can be shared across threads.

  2. Create a Report for a single conversion by calling factory.createReport()

  3. Create a ReportOutput for that conversion by calling factory.createReportOutput() or simply instantiating the required ReportOutput class directly, e.g. new PDFReportOutput().

  4. At a minimum call report.setReportOutput(), then configure anything else that might be required before conversion.

  5. Call report.load(source) to load the report. The source can be anything we can realistically convert to a byte stream - java.io.File, java.net.URL, java.io.InputStream, byte[] or org.xml.sax.InputSource to name a few.

  6. Call report.parse(). This runs the conversion and writes the report to the ReportOutput

  7. When complete, call output.write() or retrieve the generated report by calling a method on the subclass, eg PDFReportOutput.getPDF()

Exploring the API

Customizing: properties, stylesheets, fonts, resources

We’ve claimed BFO Publisher is entirely configurable with environment properties, so you might expect these to be easy to set throught the API.

For example, to force every PDF conversion to generate PDF/A-3 rather than regular PDF:

ReportFactory factory = new ReportFactory();
factory.getEnvironment().put("bfo-pdf-profile", "PDF/A-3a");

Every Report created from this ReportFactory now has this environment property set. Or it can also be done on a per-report basis:

ReportFactory factory = new ReportFactory();
Report report = factory.createReport();
report.getEnvironment().put("bfo-pdf-profile", "PDF/A-3a");

Adding stylesheets can be done in the same way, by calling either the ReportFactory.getUserStylesheets() or Report.getUserStylesheets() methods as appropriate - new URLs can be added to the returned list. In particular the Util.createDataURL() method is useful for creating a Stylesheet from text.

For example, to reproduce the same functionality as the the getEnvironment() example shown above with a stylesheet:

String stylesheet = "@bfo env { bfo-pdf-profile: \"PDF/A-3a\"; }";
URL2 uri = Util.createDataURL(MediaType.parse("text/css"), stylesheet);
factory.getUserStylesheets().add(uri);

This approach can also be used to preconfigure BFO Publisher with a stylesheet referring to locally stored fonts: unlike your web browser, BFO Publisher does not make use of the system font directory, and we ship with a very limited selection.

A common setup would be a directory containing some of the Noto Fonts. Simply download the fonts you want into a directory, then:

reportFactory.getUserStylesheets().add(URL2.parse("fonts:/path/to/fontdirectory"));

How does this work? BFO Publisher has a special handler for the URL scheme fonts - it’s similar to a file URL but always results in a CSS file describing the font(s) at that path. You can point it at a single font file or a directory, in which case it will recursively scan the directory for font files and generate a CSS file for them all. It will even set up a watch on that folder, so any fonts that are added or deleted will be automatically included when the URL is requested - making it suitable for long-running instances of a BFO Publisher web-service.

If you want to see how this works, the code below shows how to request the content of that URL so you can see it yourself. You can even re-add this as a data: URL using the approach shown in the previous example.

ReportFactory factory = new ReportFactory();
Report report = factory.createReport();
Blob blob = report.getResource(URL2.parse("fonts:/path/to/fontdirectory"));
System.out.println(blob.getString());  // Print the generated CSS
URL2 uri = Util.createDataURL(blob);
reportFactory.getUserStylesheets().add(uri);

This approach is absolutely fine to use if you prefer, although of course the CSS is fixed and will not reflect any changes in the directory.

The code above demonstrates the Report.getResource() method, which can be used to retrieve objects from a URL exactly as the Report would do while parsing. As well as being useful for debugging - you can see exactly what data is being loaded - it allow you to retrieve the about:index file created during Index Generation. This must be done after the parse() method.

Report report = factory.createReport();
report.setReportOutput(new PDFReportOutput());
report.load(inputsource);
report.parse();
Blob blob = report.getResource(URL2.parse("about:index"));
System.out.println(blob.getString());  // This is the generated index as an XML file.

Customizing: extensions

The ReportFactory.getReportFactoryExtensions() method can be used to retrieve a list of ReportFactoryExtension objects which will be used to create the Reports. Several extensions are included by default, all of which will be instances of the various classes in the org.faceless.publisher.ext package.

The API documents for those classes detail how to configure each of them, and the list can be modified to add new extensions if required.

For example, the MailServerExtension can be added to allow email to be retrieved from an IMAP server and converted to PDF. Here’s an example showing how to do this from GMail - a more complete version and an explanation of how it works is given in the API docs for that class.

ReportFactory factory = new ReportFactory();
MailServerExtension ext = new MailServerExtension(factory);
factory.getReportFactoryExtensions().add(ext);

// Configure the MailServer extension to work with GMail
MailServerExtension.GMailHelper gmail = new MailServerExtension.GMailHelper();
gmail.setEmail(username);
gmail.setClientId("NNNNN.apps.googleusercontent.com");
gmail.setClientSecret("NNNNN");
gmail.setRedirectURI("http://127.0.0.1:8000/oauth");
Json json = gmail.create());
URL2 uri = ext.createStore(json);
Store store = ext.getStore(uri);

// We are connected to the server - retrieve a javax.mail.Message and load it
Folder folder = store.getFolder(foldername);
if (!folder.isOpen()) {
   folder.open(Folder.READ_ONLY);
}
Message message = folder.getMessage(num);
Report report = factory.createReport();
ReportOutput output = factory.createReportOutput(MediaType.parse("application/pdf"));
report.setOutput(output);
report.load(message);
report.parse();
output.write(new FileOutputStream("email.pdf"));

Customizing: URL Processors

Any resource loaded from a URL will be processed by all the URLConnectionProcesser object registered with the ReportFactory, which can configure authentication, SSL parameters or similar. As with the previous section, URL processors can be retrieved by calling ReportFactory.getURLConnectionProcessors() which returns a list preconfigured with defaults; each item in the list can be removed or altered as its API allows.

For example, to load client certificates for use with all HTTPS requests made by BFO Publisher

ReportFactory factory = new ReportFactory();
URL2 url = URL2.parse("file:/path/to/clientkeystore.jks");
char[] password = "secret".toCharArray();
SSLURLConnectionProcessor processor = new SSLURLConnectionProcessor(null, url, password, factory);
factory.getURLConnectionProcessors().add(processor);

The available subclasses of URLConnectionProcesser will be clear from the API documentation - each of them details how they can be used to alter any network connection made by BFO Publisher.

Secrets and Authentication

There are some situations where sensitive information is required that may not be appropriate to embed in the document being converted - for example, when creating a PDF that contains a digital signature, the password for the signing key may need to be kept secret.

There are two ways to deal with this.

  • the information can be specified in advance using Environment variables

  • the information can be requested on demand by way of a Callback.

Environment variables have already been discussed. Take this example, which refers to a keystore but doesn’t specify the password to access it

<html>
<body>
 This document contains a digital signature
 <object type="bfo/signature">
   <param name="keystore" value="path/to/keystore.pkcs12">
   <param name="alias" value="myidentity">
   <img src="https://test.com/secret/signature.svg">
  </object>
 </body>
</html>

The environment variable bfo-ext-signature-password will be checked for the missing password, as described in Signature defaults.

Another situation where secret information may be requires is to access a resource that requires a password - in this example, lets presume the SVG requires an HTTP username and password to access it. This can’t be specified with an environment variable; it has to be done using a PasswordAuthenticationURLConnectionProcessor (see Customizing: URL Processors). Other similar classes exist for other types of authentication, such as OAuth2 and AWS4. Note that all of these URL Connection Processor classes will defer to an appropriate Callback if they can’t supply the required information.

Callbacks

When information is not supplied in advance, it needs to be requested on demand by asking the user.

Both the Report and ReportFactory have a setCallbackHandler method designed to support this. Taking a standard Java CallbackHandler, if specified this will be called to retrieve the required information on demand.

This will be called whenever a password or other secret information is required to access a resource. Callbacks are typically the standard PasswordCallback and NameCallback, but we’ve also added our own OAuth2Callback to this list which is used when requesting content protected by OAuth2.

To show a simple example, the command-line client we ship includes a CallbackHandler to prompt for information at the command line.

import javax.security.auth.callback.*;
import java.io.Console;

ReportFactory factory = new ReportFactory();
factory.setCallbackHandler(new CallbackHandler) {
    public void handle(Callback[] callbacks) {
        Console console = System.console();
        for (Callback cb : callbacks) {
            if (cb instanceof NameCallback) {
                NameCallback ncb = (NameCallback)cb;
                System.out.print(ncb.getPrompt() + ": ");
                ncb.setName(console.readLine());
            } else if (cb instanceof PasswordCallback) {
                PasswordCallback pcb = (PasswordCallback)cb;
                System.out.print(pcb.getPrompt() + ": ");
                pcb.setPassword(console.readPassword());
            }
        }
    }
});

Converting the HTML shown above will result in a prompt on the command line to "Enter password". If no CallbackHandler were specified or if the user didn’t enter a password, signing will fail.

Although names and passwords are the most obvious case, some callbacks may just require a name, and it’s possible to combine Environment variables and Callbacks as required. For example, the GlobalSign QSS signing service requires several keys: keystore, apikey and apisecret are all used to specify information required to use the service, and that may be site-wide - they apply to the organization as a whole. The identity key is also required to identify the individual signer.

If we presume the following environment variables have been set:

  • bfo-ext-signature-engine to globalsign.qss

  • bfo-ext-signature-keystore to path/to/keystore.pkcs12#password=secret

  • bfo-ext-signature-apikey to …​ (a value supplied by GlobalSign)

  • bfo-ext-signature-apisecret to …​ (again, supplied by GlobalSign)

then the only required parameter is identity. With an appropriate CallbackHandler set, this will be the only information the signer is prompted for:

<html>
 <body>
  This document contains a digital signature
  <object type="bfo/signature">The Signature</object>
 </body>
</html>

Note that the Web Service is configured with a CallbackHandler that forwards to the client, provided the client is connected over a web-socket. See WebSockets for more information.

Web Service

You’re running BFO Publisher version ${VERSION} as a live web service with the prefix ${BASEPATH}/. This section of the documentation is interactive - it will guide you through the process of generating a PDF.

BFO Publisher can be run as a web service. This section describes the service: if you were reading it as part of an installed copy of BFO Publisher it would be interactive, allowing you to test the web service in real-time. So we recommend you download BFO Publisher and view this content from there. Until them, we’ll use http://example.com to represent the path to the service.

see https://github.com/bfocom/publisher-extra for APIs that work with the web-service

Getting started with HTTP, JSON and CBOR

HTTP forms

Let’s start with a trivial example. Here’s a form field - type some HTML into it and click convert. The PDF will be loaded in this window, just click the back button when you’re done.

<html> <style> address::after { content: " " env(bfo-publisher-version) } </style> Hello, world! <address>Created with BFO Publisher</address> </html>

Now tick  — to request the PDF is returned as a redirect, rather than immediately. Then click convert again.

When you click convert a request is made to ${BASEURL}/convert with the following fields:

  • put.0.content - the content of your field

  • put.0.path - the value file.html

  • redirect - true if the checkbox is ticked, false otherwise.

The request looks like this:

POST ${BASEPATH}/convert HTTP/1.1
Content-Type: application/x-url-form-encoded

redirect=true&put.0.path=file.html&put.0.content=…​

and the response from the Publisher web service looks like this:

HTTP/1.1 200 OK
Content-Type: application/json

{"ok":true,"paths":{
  "base":"store/7W4ABW2UySxh1ra4",
  "log":"store/7W4ABW2UySxh1ra4/log.txt",
  "output":"store/7W4ABW2UySxh1ra4/output.pdf"
}}

Every response will include a boolean ok value (which is false if something went wrong). Assuming it’s true, the response from /convert contains paths for at least the logs and the generated output file (they’re relative to the current request). To retrieve the output, make a standard GET request:

GET ${BASEPATH}/store/7W4ABW2UySxh1ra4/output.pdf HTTP/1.1

As well as regular HTTP POST, it’s possible to use multipart/form-data to upload files for conversion. Click this button to select a file then click convert - as before, the PDF will display in this window so just click the back button when you’re done.

With multipart/form-data the request looks like this:

POST ${BASEPATH}/convert HTTP/1.1
Content-Type: multipart/form-data;boundary=-----ZdboxU5pTD41YBew

-----ZdboxU5pTD41YBew
Content-Disposition: form-data; name="put.0"; filename="file.html"
Content-Type: text/html
..

-----ZdboxU5pTD41YBew--

and the response will be the same as the response above. The file to convert is specified as put.0, and any resources that may be needed by that file for conversion (stylesheets, images etc) may be sent as put.1, put.2 and so on. You can easily convert files this way using the curl command-line tool - assuming the file to convert is file.html:

curl --form redirect=true --form put.0=@file.html --form put.1=@logo.png \
  ${BASEURL}/convert

JSON/CBOR

In both the examples above we’ve used HTML forms to upload data, but we recommend JSON (RFC8259) or ideally CBOR (RFC8949). HTTP uploads are converted to one of these before processing - both of the above requests are essentially identical to this:

curl --request POST --header "Content-Type:application/json" ${BASEURL}/convert --data @- <<EOF
{
  "redirect": true,
  "put": [
    {
      "path": "file.html",
      "content_type": "text/html",
      "content": …​
    },
    {
      "path": "logo.png",
      "content_type": "image/png",
      "content": …​
    }
  ]
}
EOF
POST ${BASEPATH}/convert HTTP/1.1
Content-Type: application/json

{
  "redirect": true,
  "put": [
    {
      "path": "file.html",
      "content_type": "text/html",
      "content": …​
    },
    {
      "path": "logo.png",
      "content_type": "image/png",
      "content": …​
    }
  ]
}

It should now be clear why the HTML form fields were called put.0.path or put.0 - they’re mapped to the corresponding property paths in the JSON object.

Files sent as part of an application/json encoded message may be Base64 encoded - the decoder will try that first, falling back to plain UTF-8 if Base64 decoding fails. But by far the best solution is to use application/cbor as the Content-Type and encode the JSON structure as CBOR. This is faster and more compact than JSON, and as and as it contains a native byte-buffer type no Base64 encoding is required.

From now on all examples will be formatted as JSON for clarity, but CBOR is the recommended encoding for all communication with BFO Publisher. If a request is submitted in CBOR the response will also be CBOR. In all other cases, the response will be JSON. There are CBOR libraries for most languages; if you’re using Java we recommend https://github.com/faceless2/json, which is included as part of BFO Publisher.

The Store

The examples above included a put array, so called because they PUT a file to the Store before conversion. The Store is essentially a simple virtual filesystem - files are first uploaded to it, then BFO Publisher retrieves them for conversion before writing the results back to the Store. They’re then retrieved with a GET, and eventually deleted with DELETE.

The URL for the store looks like ${BASEURL}/store

Files must be stored in a folder - generally (but not necessarily) a conversion will be scoped to a single folder. In the examples above the folder was created automatically, but if you prefer a CRUD interface to upload you can do that too. Pick a random folder name (no slashes, spaces or special characters), then upload the file - for example, with this curl command the following HTTP request would be sent:

curl --header "Content-Type:text/html" --upload-file file.html \
  ${BASEURL}/store/7W4ABW2UySxh1ra4/file.html
PUT ${BASEPATH}/store/7W4ABW2UySxh1ra4/file.html HTTP/1.1
Content-Type: text/html

…​

Files cannot be overwritten: if you need to, you must delete them first. Although a folder name must use a limited range of characters (see below) there are no restrictions on the path within that folder.

Once uploaded, you can run the conversion by specifying the folder to write the output to and the URL of the file to convert with the url property. URLs may be absolute or relative; relative URLs are resolved against the folder so - as we want to refer to the existing folder we created with the previous upload - we set this with the folder property.

POST ${BASEPATH}/convert HTTP/1.1
Content-Type: application/json

{"redirect":true, "folder":"7W4ABW2UySxh1ra4", "url":"file.html"}

A standard HTTP GET can be used to retrieve any file. The directory listing of files in a folder can be retrieved by requesting the folder path:

GET ${BASEPATH}/store/7W4ABW2UySxh1ra4 HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json

[
 {
  "path": "log.txt",
  "length": 3922,
  "grants": { "all": [ "" ] },
  "expiry": 604800000
 }, {
  "path": "file.html",
  "length": 29
  "content_type": "text/html",
  "grants": { "all": [ "" ] },
  "expiry": 604800000,
 }, {
  "path": "output.pdf",
  "length": 4033
  "content_type": "application/pdf",
  "ok": true,
  "grants": { "all": [ "" ] },
  "expiry": 604800000,
 }
]

It’s possible to delete an individual file in the folder - or the entire folder - with an HTTP DELETE

curl --request DELETE ${BASEPATH}/store/7W4ABW2UySxh1ra4
DELETE ${BASEPATH}/store/7W4ABW2UySxh1ra4 HTTP/1.1

Deleting the folder or the output file within a folder will interrupt any conversion that’s current running.

And finally, because the CRUD interface can be a little clumsy - particularly when using PUT to upload a file - it’s possible to do all these operations with a POST of application/json or application/cbor to the /store URL. Both get and delete require only the path property, whereas put takes the other properties defined for the put array specified to /convert. Here are some equivalents of the above examples

Equivalent to GET ${BASEPATH}/store/7W4ABW2UySxh1ra4/output.pdf:

POST ${BASEPATH}/store HTTP/1.1
Content-Type: application/json

{"type":"get", "path":"7W4ABW2UySxh1ra4/output.pdf"}

Equivalent to DELETE ${BASEPATH}/store/7W4ABW2UySxh1ra4:

POST ${BASEPATH}/store HTTP/1.1
Content-Type: application/json

{"type":"delete", "path":"7W4ABW2UySxh1ra4"}

Equivalent to PUT ${BASEPATH}/store/7W4ABW2UySxh1ra4/file.html

POST ${BASEPATH}/store HTTP/1.1
Content-Type: application/json

{
  "type": "put",
  "content_type": "text/html",
  "path": "7W4ABW2UySxh1ra4/file.html",
  "content": …​
}

Some other things to note about the Store:

  • Uploads in all formats are streamed rather than decoded in memory. If you want to upload a 10GB file, it will be read from the stream (raw, CBOR, JSON or HTML encoded) and passed directly to the underlying Store, the default implementation of which will write it directly to disk.

  • If a Content-Type is not specified it will be sniffed from the data, which is necessarily imperfect. To be certain, always specify the type yourself.

  • Any undeleted files and folders will eventually expire some time after their last access, and be deleted automatically. Details on this are in the Configuration section.

  • Retrieving a file that hasn’t yet been created - for example, retrieving the output path while the conversion is still running - returns immediately with HTTP code 202. Retrieving an output path if conversion failed will return HTTP code 500; the logs path will provide detail.

  • Retrieving the logs path before conversion completes will return the existing log data and keep the connection open, with new log messages being streamed as they’re available. When conversion completes the connection is closed.

  • The default FileStore implementation will checksum uploads and de-duplicate them on disk automatically, so uploading a file multiple times will not increase disk space (although of course it will increase network traffic).

Administration

The BFO Publisher web service is designed to be completely configured remotely, which allows identical instances to be deployed from a single virtual-machine image. The following administration tasks can be run:

Pause, Resume and Shutdown

The BFO Publisher web service can be paused at any point - new jobs will be accepted but not started, although running jobs will continue. Resuming will enqueue any jobs that were queued. Both are simple GET requests and take no parameters

Shutdown will pause BFO Publisher, wait for any running jobs to complete, then shut-down the server. If no jobs are running the server is shut-down immediately so no response is typically received.

GET ${BASEPATH}/admin/pause HTTP/1.1
GET ${BASEPATH}/admin/resume HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json

{"ok":true, "message": "Resumed"}
GET ${BASEPATH}/admin/shutdown HTTP/1.1

Status

The internal status of the BFO Publisher web service can be retrieved by calling /status or /admin/status (the latter gives detail on authorization keys, so would typically be restricted - see Access Control). As above, this is a simple GET request and takes no parameters

GET ${BASEPATH}/admin/status HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json

{
  "class": "ServiceEngine",
  "version": "work-42074M-20220113T010928",
  "secure_mode": true,
  "started": 1642036172188,
  "paused": false,
  "stats": {
    "num_queued": 0,
    "num_running": 0,
    "max_queued": 0,
    "max_running": 0,
    "jobcount": {
      "started": 0,
      "cancelled": 0,
      "completed": 0,
      "failed": 0
    }
  },
  "max_threads": 10,
  "max_queue_size": 50,
  "max_buffer_size": 1048576,
  "authorities": [
    {
      "key": "none",
      "grants": [ "*" ]
    }
  ],
  "egress_filter": [
     "no-files",
     "no-local",
     "default"
  ],
  "url_processors": [
    {
      "type": "ssl-default"
    }
  ],
  "store": {
    "classname": "org.faceless.publisher.web.FileStore",
    "default_expiry": 604800000,
    "max_expiry": 9223372036854776000
  },
  "cache_size": 200,
  "ok": true
}

Configuration

Configuration of the web service is done by a POST call to /admin/configure. Configuration can only be performed when no jobs are running - an error will be returned if that’s not the case when /admin/configure is called. The engine can be paused and any existing jobs allowed to complete before trying again.

The value supplied to /admin/configure is a JSON object, with the keys listed at admin/configure (request). All are optional: specified keys will completely replace any existing value. Unspecified keys will remain unchanged, and all properties can be changed while the server is running but paused; so certificates on a long-running instance can be updated if required.

Access Control

By default the web service requires no authentication for any action - anonymous users can perform any action, such as /convert or /admin/configure. In many cases an administrator would want to place some limits on some of these.

The ability to perform any particular action on the web service is called a Grant. One or more Authorities can be set on the web service, each of which can approve a range of Grant actions as requested by the user. The web service uses Java Web Tokens (JWT, RFC7519) as a way of requesting these grants, very similar to the approach used with OAuth2.

Let’s start with the initial configuration. An excerpt from the /admin/status of a new instance of the web service shows one Authority:

"authorities": [
    {
      "key": "none",
      "grants": [ "**" ]
    }
  ]

The key value shows that no key is required - it authorizes anonymous access - and that it will approve any grant. A ** matches any path and a * matches any path segment (segments are divided with /). Any HTTP or WebSocket requests may include up to the default amount of data, which is set with the max_buffer_size configuration property.

Let’s remove this Authority and replace it with two more - one that allows everything, and one that disallows admin rights. For this we’ll need two generate two JSON Web Keys (JWK). The easiest way to do this is at https://mkjwk.org and create two symmetric keys (oct in JWK terminology) using the HS256 algorithm - call one admin and one user. Here’s what we generated:

{
    "kty": "oct",
    "kid": "admin",
    "k": "MbDPJl_WUCDe0HI5ag-czbCXK0_X5iOikg_0GQfTN7IHcLsbsLbiCN2TeCsryKUzgc6aw9kI5vHR0-3BsfyAZwLkJmPnawRwJ0UV8aOspmZlYteWFf7YO3kM1szH6k6C6FwOifLnWreSiu8gFuW7e78aYQPHWQKhhuFcC6oBaXNhYm-ghZrZXA082c-xiJWD2KAJhJGJW2nEHY0NDm9Ae1ZAz7MiKZSDvoDoCViCqCdpznqP_gmniGKSTsXDyW5YcxraMvgwEUgvSRdPovuaKR6cPkXFF8XjL8zRjIqDjlCYYYHZPg_99__HBU9kOKBEXiqI20rXyofuuZ1Du2hhFw",
    "alg": "HS256"
}

{
    "kty": "oct",
    "kid": "user",
    "k": "xqF2PECEf9lzb-B2FOEM6L5SpQRbZHza5_IPAoae-d8AvBDNxb2nd6GhUzKLM61pXsNObJZoUc28Vcihgm3sDsH5Qu-uNGJzjOfz-w0R17qNVZCcUsFdGSxzkbHxUAdvN_rT5pU_9EapIuuEtaaaG8KSD1aenXlBpxxEfaKkGDvqeiYZjnkCx-qX5cFSXTlMo-mVxF8bdT4v9mzR13yIC1BAbexDep4E1Z5vhxCsQujpd1M2Wge20LXuAT4tPmPo86WqkTxPU_SLUAT1fvNUi1vUM0oUrQYTT6wocBX3fPktpqQ8VTUcx2rotNFiOFpMqhwxZKZoVg_A5A5bWxWRVg",
    "alg": "HS256"
}

Now we need to set those keys as Authorities on the web service. The first key, admin, we will allow to approve all grants. The second key user can approve only non-admin grants. The grants largely correspond to the actions defined in the web service, and are currently:

http

Access the web service over HTTP (either this or ws must be granted)

ws

Access the web service over WebSockets

get

Call the get action (the GET method on the store)

put

Call the put action (the PUT method on the store)

delete

Call the delete action (the DELETE method on the store)

convert

Call the convert action

proxy

Call the proxy action to retrieve data from an external URL

status

Retrieve the status of BFO Publisher, excluding any security-related properties

admin/pause

Pause the BFO Publisher service

admin/resume

Resume the BFO Publisher service

admin/shutdown

Shut down the BFO Publisher service, completing all running tasks first

admin/status

Retrieve the status of BFO Publisher, including any security-related properties

admin/configure

Configure BFO Publisher

admin/trusted

Mark a resource added to the store as a trusted resource

bearer/nnn

Access a file in the Store created by bearer token nnn (see Permissions)

So to run all "non-admin" actions over HTTP or WebSockets, a user would need to be granted ["http", "ws", "get", "put", "delete", "convert"] - or, to match any non-admin actions, you can use the grant ["*"]. POST the following message to admin/configure.

POST ${BASEPATH}/admin/configure HTTP/1.1
Content-Type: application/json

{
  "authorities": [
    {
      "key": {
        "kty": "oct",
        "kid": "admin",
        "k": "MbDPJl_WUCDe0HI5ag-czbCXK0_X5iOikg_0GQfTN7IHcLsbsLbiCN2TeCsryKUzgc6aw9kI5vHR0-3BsfyAZwLkJmPnawRwJ0UV8aOspmZlYteWFf7YO3kM1szH6k6C6FwOifLnWreSiu8gFuW7e78aYQPHWQKhhuFcC6oBaXNhYm-ghZrZXA082c-xiJWD2KAJhJGJW2nEHY0NDm9Ae1ZAz7MiKZSDvoDoCViCqCdpznqP_gmniGKSTsXDyW5YcxraMvgwEUgvSRdPovuaKR6cPkXFF8XjL8zRjIqDjlCYYYHZPg_99__HBU9kOKBEXiqI20rXyofuuZ1Du2hhFw",
        "alg": "HS256"
      },
      "grants": [ "**" ]
    },
    {
      "key": {
        "kty": "oct",
        "kid": "user",
        "k": "xqF2PECEf9lzb-B2FOEM6L5SpQRbZHza5_IPAoae-d8AvBDNxb2nd6GhUzKLM61pXsNObJZoUc28Vcihgm3sDsH5Qu-uNGJzjOfz-w0R17qNVZCcUsFdGSxzkbHxUAdvN_rT5pU_9EapIuuEtaaaG8KSD1aenXlBpxxEfaKkGDvqeiYZjnkCx-qX5cFSXTlMo-mVxF8bdT4v9mzR13yIC1BAbexDep4E1Z5vhxCsQujpd1M2Wge20LXuAT4tPmPo86WqkTxPU_SLUAT1fvNUi1vUM0oUrQYTT6wocBX3fPktpqQ8VTUcx2rotNFiOFpMqhwxZKZoVg_A5A5bWxWRVg",
        "alg": "HS256"
      },
      "grants": [ "*" ]
    }
  ]
}
HTTP/1.1 200 OK
Content-Type: application/json

{"ok":true,"code":200,"message":"Engine updated"}

The previous Authority that granted anonymous access has been replaced by these two new Authorities. We can see the effect of this immediately:

GET ${BASEPATH}/admin/status HTTP/1.1
HTTP/1.1 401 Unauthorized
Content-Type: application/json

{"ok":false,"message":"Unauthorized"}

To access the service now, a JSON Web Token must be generated and signed by one of the keys above. We can do this online at https://jwt.io. Copy the value of the user key into box for the secret key on that website, and tick secret key base64 encoded. Then enter the following into the payload box:

{
  "name": "BFO Publisher User",
}

jwt io grant

Or if you’d prefer to do this with an API, here’s how to use our open-source JWT library at https://github.com/faceless2/json:

import com.bfo.json.*;

public class MakeJWT {
    public static void main(String[] args) throws Exception {
        JWK userkey = new JWK(Json.read("{ \"kty\": \"oct\", \"kid\": \"user\", \"k\": \"xqF2PECEf9lzb-B2FOEM6L5SpQRbZHza5_IPAoae-d8AvBDNxb2nd6GhUzKLM61pXsNObJZoUc28Vcihgm3sDsH5Qu-uNGJzjOfz-w0R17qNVZCcUsFdGSxzkbHxUAdvN_rT5pU_9EapIuuEtaaaG8KSD1aenXlBpxxEfaKkGDvqeiYZjnkCx-qX5cFSXTlMo-mVxF8bdT4v9mzR13yIC1BAbexDep4E1Z5vhxCsQujpd1M2Wge20LXuAT4tPmPo86WqkTxPU_SLUAT1fvNUi1vUM0oUrQYTT6wocBX3fPktpqQ8VTUcx2rotNFiOFpMqhwxZKZoVg_A5A5bWxWRVg\", \"alg\": \"HS256\" }"));
        JWT jwt = new JWT();
        jwt.getPayload().put("name", "BFO Publisher User");
        jwt.sign(userkey.getSecretKey(), "HS256");
        System.out.println(jwt);
    }
}

The generated JWT is passed in as a "Bearer Authorization" to the web service, exactly as it would be with OAuth 2.0

POST ${BASEPATH}/convert HTTP/1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiQkZPIFB1Ymxpc2hlciBVc2VyIiwiZ3JhbnRzIjpbImh0dHAiLCJ3cyIsImNvbnZlcnQiLCJnZXQiLCJwdXQiLCJkZWxldGUiXX0.xL0Pk5FPnGbWiiFG_3ICLYbnVpcn-ZZ_0jFUcoRQBYc
Content-Type: application/json

{"url": "https://bfo.com"}

That’s the basic flow of Authentication, so now we can go into a bit more detail.

  • A request to the web service will be matched against each Authority in sequence. Any authority that signed the key can approve any of the requested grants associated with it, and authorization can be granted by more than one Authority - the approved set of grants is the union of all that match. Likewise, the max-buffer-size for a user is the maximum of the max-buffer-size for each matching Authority.

  • Anonymous access to the web service is allowed by an authority with a key of none. This will match any request (with or without a Bearer token). If this Authority exists its grants will be merged with any grants from other Authorities that match the supplied JWT.

  • The * symbol can be used in an Authorities grants to match any character other than /, and the ** symbol matches any character including /. So for example,

    • grants: [ "*" ] matches get but not admin/pause

    • grants: [ "admin/*" ] matches admin/pause but not get

    • grants: [ "*", "admin/*"] or grants: [ "**" ] would match both (the second one matches everything).

    • grants: [ "bearer/*" ] matches any Bearer token - this allows downloading or deleting files from the Store created by other users.

  • Any properties (claims, in the language of JWT) can be set in the payload of the JWT, but only the nbf and exp properties are recognised. nbf and exp stand for not-before time and expiry time, and are standard JWT properties to limit the validity of the token - they are both seconds since the UNIX epoch. The name property, if set, will be used to identify a JWT when logging.

  • Any algorithm defined in the JWT core specification can be used - we’ve demonstrated the shared key approach above, but public/private RSA or Elliptic-Curve keys can be used as well.

  • Finally, although it’s best practice to generate a unique JWT for each user, this might not always be practical. For validation purposes the JWT supplied in the Authorization header will ignore anything following a trailing # - so Authorization: Bearer eyJhBG…​RQBYc#my-unique-fragment would work in the example above.

Finally there are two other key aspects of an Authenticated workflow that do not apply to an anonymous workflow.

Permissions

Any file stored in the Store, whether uploaded by a user or generated by convert, is stored with a set of grants determining which users can GET or DELETE the file. As discussed above, a grant is part of the authorization process, and custom grants can be added to any authorization - for example, "grants": [ "http", "ws", "convert", "get", "put", "delete", "team-b" ] would grant a user access to all the normal grants required for creating PDFs, plus the team-b grant. Every user also automatically is given the set of grants from the JWT they presented for authorization.

By default, the only grant attached to any file is the Bearer token that created it. In the last example above, the file is stored with the following metadata:

{
  "grants": {
    "all":[
      "bearer/eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiQkZPIFB1Ymxpc2hlciBVc2VyIiwiZ3JhbnRzIjpbImh0dHAiLCJ3cyIsImNvbnZlcnQiLCJnZXQiLCJwdXQiLCJkZWxldGUiXX0.xL0Pk5FPnGbWiiFG_3ICLYbnVpcn-ZZ_0jFUcoRQBYc"
    ]
  }
}

Any user presenting that Bearer token will be able to GET or DELETE the file - to use an analogy with UNIX file permissions, the file is stored with permission 0600, and it can be accessed only by the user that created it.

Allowing access to other users means specifying additional grants when the file is created - grants exist for get to allow downloading, delete to allow deletion and any to allow both. They can be specified when converting to control the grants required for the generated file.

The below JSON could be passed to /convert to allow any user to download the generated file, but only users with the team-b grant to delete it.

{"url": "http://bfo.com", "grants": { "get": [ "*" ], "delete": [ "team-b" ] } }

They can be specified when uploading files as part of a convert in a similar way.

{
  "url": "http://bfo.com",
  "put": [
    {
      "path": "style.css",
      "content_type": "text/css",
      "content": "...",
      "grants": { "any": [ "*" ] }
    }
  ]
}

But when uploading files to the store with an HTTP PUT, the Grants must be specified in the X-Grants HTTP header, as the PUT specifies the file directly. Here’s an example showing how to upload a file that could be downloaded by anyone, but only deleted by the user that creates it:

POST ${BASEPATH}/store/publicfile/stylesheet.css HTTP/1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiQkZPIFB1Ymxpc2hlciBVc2VyIiwiZ3JhbnRzIjpbImh0dHAiLCJ3cyIsImNvbnZlcnQiLCJnZXQiLCJwdXQiLCJkZWxldGUiXX0.xL0Pk5FPnGbWiiFG_3ICLYbnVpcn-ZZ_0jFUcoRQBYc
X-Grants: { "get": [ "*" ] }
Content-Type: text/css

…​

This is likely to be most useful when used with the extra_folders property for conversion: this way, shared resources can be put in a folder and made public but read-only.

Overrides and Defaults

The overrides and defaults properties on an Authority can be used to override or provide default values for any requests that it approves. This can be used to add a particular stylesheet to every conversion request perhaps, or to ensure that generated files are available for public download without the need to specify grants on every call to convert.

The syntax for both is a map containing other maps, one for each action they apply to (the only action currently defined is convert. The keys are applied before (for defaults) or after (for overrides) the properties specified in the call to convert.

Here’s an example showing how this could work. The goal here is to ensure that every call to convert has a particular User CSS Stylesheet applied to the conversion (which is presumed to have been uploaded earlier to a shared folder), and to set the default path to a keystore for any Digital Signatures. In addition, the grants property is overridden so that the resulting conversion is always available to download to users.

POST ${BASEPATH}/admin/configure HTTP/1.1
Content-Type: application/json

{
  "authorities": [
    {
      "key": …​
      "grants": …​
      "defaults": {
        "convert": {
          "extra_folders": [ "brandingFolder" ],
          "user_stylesheet": [ "branding-style.css" ]
          "env": {
           "bfo-ext-signature-keystore": "path/to/keystore.pkcs12",
           "bfo-ext-signature-password": "secret"
          }
        }
      },
      "overrides": {
        "grants": { "get": [ "*" ] }
      }
    }
  ]
}

Where the property specified in overrides or defaults is an array property (for example the user_stylesheet property in the above example), then overrides and defaults behave slightly differently.

  • defaults will be inserted before any values specified by the user - the arrays will be merged.

  • overrides will completely replace any values specified by the user.

Security

BFO Publisher inherently deals with URLs, retrieving them and converting them to PDF. There are security implications for this process which are described in this section.

the web service defaults to a secure configuration that disallows all access to the file-system or the local network.

What do we mean by security implications? To answer this we need to clarify two concepts:

  • a resource is anything that BFO publisher needs to load to do its job: an HTML file, an image, a stylesheet etc.

  • an origin is where that resource was loaded from. HTML defines origin in more detail.

The URL and its origin are both themselves URLs, and each URL is divided roughly into one four zones.

  • public resources are HTTP, HTTPS (or theoretically some other network protocol) and are loaded from a public website. A public website is one with a fully qualified domain name, eg http://bfo.com, or a public IP address, eg http://94.228.42.132. The default filter ensures any public resource cannot access a local-network or file resource.

  • local-network resources are as above, but are loaded from a local, link-local or site-local address. These include URLs like http://127.0.0.1, http://192.168.0.1, http://[fe80::1]/ as well as unqualified domains like http://service and .local domains such as http://service.local. The default filter ensures any local-network resource cannot access a file resource.

  • file resources are loaded from a file or jar URL from the local filesystem.

  • trustworthy resources are loaded from some sort of internal structure that requires no filesystem or network access. The most obvious example is a data: or about: URL, but this also includes other schemes internal to BFO Publisher.

Whenever a request is made for any resource, the egress filter determines whether it can be accessed. The egress filter initially set is stricter than the default rules described above for each zone; it disallows all access to file and local-network URLs, regardless of origin.

To adjust this, the egress_filter property must be set in the admin/configure action. In order to do this the service must be run in secure mode, which is not the default. To enable secure mode:

  • When running as a standalone web server, pass in the --web-server-secure parameter on the command line

  • When running as part of a Servlet environment, add <init-param>secure</init-param> to the servlet context in the web.xml file.

this will allow the egress_filter to be changed remotely by anyone granted admin/configure rights, which includes anonymous users by default. This change could allow read-access to the filesystem and local-network of the computer that BFO Publisher is running on. Choosing to run in secure mode means you’ve considered this and have secured the web service appropriately.

The initial-value of egress_filter is the list ["no-files", "no-local", "default"], which puts three filters on any resource being loaded: the first disallows all file access, the second all local-network access, and the third is the default set of rules described for each zone above. Setting this to "egress_filter": "default" would leave only those default rules in place, which would allow the following:

  • Converting file:/path/to/file.html (or file:/etc/passwd) would be allowed. If that URL included another resource from a file URL, that would also be allowed.

  • Converting http://server/file.html would be allowed. If that URL tried to include a file URL it would be disallowed, but it would be allowed to load a resources from other local-network and public URLs.

  • Converting http://public.com/file.html would be allowed. If that URL included another resource from a file or local-network URL it would be disallowed by the default rules.

The default rules essentially mirror the rules used in web browsers, and they should usually be the last item in the egress_filter value. More fine-grained control is possible - for example, it could be preceded with {"search": "^file:/etc", "replace":false} to disallow all access to the /etc/ folder.

Static website

Other than the Store mentioned at the top of this document, the only other file-system access made by the BFO Publisher web-service is to a directory which can be used to serve static pages, for a simple website. This is optional - if set, it is checked after any of the URLs described in this document (so it’s not possible to override /store, for example).

Again, as this involves access to the filesystem it requires the secure mode to be set.

The static_path option is simply the absolute path to the directory to serve web pages from. It may contain a 404.html page which is served for any 404 responses. Symbolic links are followed, but only within that directory.

If set with the serve_help option, the path for the help files and live site test files is checked last, after the static_path.

WebSockets

The WebSocket interface is almost identical to the HTTP interface described above. WebSocket messages can be text or binary - for BFO Publisher, text messages must always be formatted in JSON and binary messages always populated in CBOR.

The WebSocket API is different to the HTTP API in the following ways.

  • HTTP headers are not typically used with WebSockets, so instead of setting the Authorization header to Bearer JWT-token (where JWT-token is the JWT token), the supplied JSON or CBOR should include an authorization property set to JWT-token in every message. The properties for both JSON and CBOR are read in order, so authorization should be one of the first properties in the structure. As of release 1.4, the Authorization header from the initial WebSocket connection will be used as a default if this is not set.

  • WebSocket messages must have a type property set to the various actions listed above - example types are get, put, delete, convert, admin/configure and so on. A normal reply to any one of these messages will be get-response, put-response, admin/configure-response and so on (although the client should be prepared for other response types too, eg error)

  • As of release 1.4, CBOR requests sent to the server may be compressed using one of the algorithms listed in the accept_encodings property returned from a status request. Any compression will be detected and handled automatically and is on a per-message basis; it may be something to consider when sending large HTML files for conversion. Replies will never be compressed, as they are either small or contain a PDF (which is already compressed)

  • WebSocket messages are standalone messages, so do not have the concept of request-then-response that is inherent to HTTP. To assist correlation, if the message sent by the client contains the property message_id then that property will be echoed back as the reply_to property in any response relating to that message. The value can be any string or number.

  • The get, put and delete types use an identical syntax to the POST call to the /store URL described in The Store section. Where an HTTP GET would return the file directly, WebSocket returns it as a structure which will look like the structure below. content is always a buffer (for CBOR) or a Base64-encoded string (for JSON):

    {
      "type": "get-response",
      "content_type": "application/pdf",
      "content": ...
    }
  • WebSockets have one additional message type, callback, which may be sent during a conversion if the conversion requires additional input from the user - usually because a username or password is required to access a resource. The callback message is sent to the client - it must complete the fields for each entry in the callback array (name for a name callback, password for password), change the type to callback-response and send the same object back in reply.

    {
      "type": "callback",
      "reply_to": 123,
      "callback_id": 12313123,
      "callbacks": [
        {"type": "name", "prompt": "http://example.com/secret" },
        {"type": "password", "prompt": "http://example.com/secret" }
      ]
    }
    
    {
      "type": "callback-response",
      "reply_to": 123,
      "callback_id": 12313123,
      "callbacks": [
        {"type": "name", "prompt": "http://example.com/secret", "name": "johndoe123" },
        {"type": "password", "prompt": "http://example.com/secret", "password": "secret" }
      ]
    }
  • When convert is called with redirect set to false (the default), a message with type convert-response is immediately sent in response with "complete":false set on the structure. Logging messages follow, sent with "type":"log", then finally another convert-response message with "complete":true set, containing the content_type and content properties set the type and value of the generated output, exactly as they would be retrieved with a get.

Here’s an example WebSocket conversion for such a convert operation, showing the initial request and the multiple messages that follow:

{
  "type": "convert",
  "authorization": "...jwt token goes here...",
  "message_id": 123,
  "uri": "file.html",
  "put": [
    {"path": "file.html", "content_type": "text/html", "content": ... },
    {"path": "style.css", "content_type": "text/css", "content": ... },
    {"path": "logo.png", "content_type": "image/png", "content": ... }
  ]
}

{
  "type":"convert-response",
  "ok": true,
  "reply_to": 123,
  "complete": false,
  "paths": { "folder": "7W4ABW2UySxh1ra4", "log": "7W4ABW2UySxh1ra4/log.txt" }
}


{
  "type": "log",
  "ok": true,
  "reply_to": 123,
  "log": { ... }               /* will be many of these messages */
}

{
  "type":"convert-response",
  "ok": true,
  "reply_to": 123,
  "complete": true,
  "content_type": "application/pdf",
  "content": ...
}

Java EE Servlet deployment

BFO Publisher ships with a bundled web-server based on Netty, chosen because it has extremely low overhead and is very fast. However some sites may prefer to integrate BFO Publisher into an existing Java EE Servlet container, such as Apache Tomcat or JBoss. This functionality was added in release 1.4

To deploy as a web-application requires a WAR file, which is simple a Zip file with the following structure

WEB-INF/
  web.xml
  configure.json
  lib/
    bfopublisher-bundle.jar

web.xml file is a boilerplate file which only requires the following content:

<web-app>
 <listener>
  <listener-class>
   org.faceless.publisher.web.ServiceEngineServletContextListener
  </listener-class>
 </listener>
</web-app>

configure.json file is an optional file which can be included to set the initial configuration of the web-service - if present, it is sent to the server as an admin/configure message immediately on startup. This can be used to configure default user permissions and other rules.

you should always set an initial configuration. See Security

If preferred, bfopublisher.jar can be used instead of bfopublisher-bundle.jar, so long as the required auxilliary Jars are included in the WEB-INF/lib folder.

Once the Jar is build, deployment is a usually just a matter of copying to the web-apps folder of the servlet engine - if called publisher.jar, it will be deployed to the /publisher path in the servlet engine. The deployment supports the Web Socket and HTTP conversion API detailed in this section, but does not include the static files (such as this help file or the API documentation).

If you’re wondering why we didn’t use a @WebListener annotation to make deployment easier, the problem is that this annotation is in the javax.servlet.annotation package which won’t be present in a non-servlet installation. Many annotation-based libraries work by scanning all classes for their annotations, and an annotation from an uninstalled package may result in a crash if the library is not prepared for this.

WebService Reference

The Web Service can be accessed using standard HTTP GET, PUT and DELETE, but these are all shortcuts for the universal interface used with HTTP POST and WebSockets. The actions (specified with the path for HTTP or with the type key for WebSockets) are listed below with the properties that can be included with each action.

The WebSockets section also lists some additional properties which can be included with each request. The responses are not described here, they are in general much simpler, just refer to the examples in the previous section.

convert (request)

Initiate a conversion. The following parameters can be sent as part of a convert message. Unless specified, all values are strings and are optional.

url

the URL that will be converted. If unspecified it will be set to the path of the first item in the put array, and if no put array exists it is an error. URL may be absolute or relative; relative URLs will be loaded from the current folder in the store

folder

the folder that the output files and log file will be created in. If unspecified, a new random folder will be created. Folder names are not hierarchical; permitted characters are ASCII letters, digits, underscore, hyphen and period.

extra_folders

a list of folders to look in for resources after the current folder, highest priority first.

output_name

the preferred name of the output file, for example report.pdf. If unspecified an appropriate name will be chosen.

output_type

the media-type (also known as content-type) of the output. If unspecified defaults to application/pdf. Other possible values are image/svg+xml, image/tiff, image/png or other Image types supported by Java. The resolution parameter can be used to control output resolution, eg image/tiff;resolution=150dpi, and SVG pagination and encapsulation properties may also be set using the corresponding media type parameters. (Note that only SVG pagination types that result in a single file are supported, the directory pagination type is not.)

redirect

if true, the generated file will not be returned directly - instead a JSON object containing a paths property will be returned, with the path to the output file specified within it. If false (the default) BFO Publisher will return the file as part of the same HTTP request when it is complete, or return an HTTP 500 if it fails. If the request was made over the WebSocket interface, a value of false means BFO publisher will emit a convert-response message with "complete":true when the conversion is complete (see WebSockets section for detail)

baseurl

if set, any relative URLs within the input XML will be resolved against this value. It has the same effect as setting the <base> element in the HTML, although at a lower priority.

lang

the BCP47 language code to use as the document default. This is identical to specifying lang or xml:lang on the root element, but at a lower priority.

referrer_policy

the HTTP Referrer-Policy to use. This is identical to specifying <meta name="referrer"> in the source HTML, but at a lower priority.

user_stylesheets

an array of one or more stylesheets which will be treated as user stylesheets. Each entry is a URL and may be absolute or relative, with relative values resolved against the folders used for the conversion

ua_stylesheets

an array of one or more stylesheets which will be treated as user-agent stylesheets. Each entry is a URL and may be absolute or relative, with relative values resolved against the folders used for the conversion

processing_instructions

an array of one or more processing_instructions specified as a map with type and data keys - for example {"type":"xml-stylesheet", "data":"href=\"stylesheet.xsl\""}. These will be processed while the input file is being parsed, so relative values are relative to the document base (see XSLT).

proxy_servers

[Since 1.1] an array of zero or more strings containing the IP addresses of any HTTP Proxy Servers which proxy for the web-service. If incoming connections are made from one of these address, the value of any X-Forwarded-For HTTP header will used as the addess of the connection for logging and authorization. If multiple proxies add multiple X-Forwarded-For headers, each will be checked against this list in sequence.

env

a map containing environment properties which will be set as environment variables on the parse, accessible by the css env() function. Special values of these are used to control the conversion process - for example: "env": { "bfo-sys-mathml-level": "core"} would restrict MathML 4 to the "core" set rather than the default extended set. See Appendix A: Environment variables

url_processors

an array of one or more URL Processors, which plug-in to the HTTP request engine to manage authentication, configure TLS, add client-certificates or trusted-roots, manage cookies and so on. Each type is described separately below

timeout

the number of milliseconds to wait after the conversion begins before cancelling the job. The default value of 0 means jobs are never cancelled.

put

an array of files to PUT to the current folder before conversion begins. Each value is a map, the fields of which are described immediately below

put[n].path

Required. the path to store this file against in the folder. Paths are typically relative but may also be absolute URLs. If the path is referenced from within the source document it will be retrieved from the folders used for the conversion if it exists there, otherwise it will be retrieved normally.

put[n].content

Required. the content of the file to store in the folder. If the convert message is specified as a CBOR buffer (recommended) it will be stored verbatim. If it is specified as a JSON string or CBOR string, it will be Base64 decoded before storing. If Base64 decoding fails the UTF-8 value of the string will be used as input.

put[n].content_type

the media-type (also known as content-type) of the resource. If not specified, it will be sniffed as the file is being uploaded.

put[n].last_modified

the last-modified time of the file, in milliseconds since the UNIX epoch. Used if the file is to be attached to the PDF.

put[n].expiry

the number of milliseconds after the file is last accessed that it should be considered for deletion, subject to the limitations of the store.

put[n].grants

a map containing the access rights for this file. See Access Control below.

put[n].trusted

boolean; if true and the user has the admin/trusted grant, mark the uploaded resource as trusted (since 1.3).

URL Processors

There are several different types of URL Processor that can be added to the configuration (they will be applied to all conversions) or to an individual convert call.

The type property must be set and determines which other properties apply. The different types and their properties are summarized here:

type=ssl-default

This attaches the normal Java list of trusted certificates to any SSL request. This is the only URL Processor that is set by default, and this (or ssl-mozilla) should always be set for normal use.

type=ssl-mozilla

This attaches the Mozilla CA Certificate List list of trusted certicates to any SSL request. This is the standard list of root certificates used by most browsers. It is a good alternative to the ssl-default.

type=ssl-insecure

This allows all SSL certificates, and is a quick way to work with self-signed certificates. It does no checking on the certificate, so is not as secure as the other SSL options.

type=password

adds support for standard HTTP Basic or Digest password security to a request. if a name and password are specified, they will be used without prompting. Otherwise, if using WebSockets then a callback message will be sent to retrieve the name and password from the user.

name

the username to use for the login.

password

the password to use for the login.

realm_match

an optional array of regular expressions which, if set, will be applied to the realm sent by the server. At least one must match for the password to be applied. If this array is empty or the parameter missing, the password is always applied (subject to the match property described below)

type=aws4

adds support for the AWS4 authentication scheme used to access files stored on Amazon S3 and compatible systems. BFO Publisher will only download files from the store, so authentication is no different to regular password authentication. As with the password type, if name and password are not included they will be requested via the callback message.

name

the access-key to use for the login.

password

the secret-key to use for the login.

type=ssl

adds either a custom trust store (to add trusted root certificates), a custom identity store (to add SSL Client certificats) or both.

trust

the URL of a keystore containing one or more X.509 Cerfiicates which will be used as trusted root certificates.

identity

the URL of a keystore containing one or more private keys and their corresponding X.509 Cerfiicates which will be used as SSL client certificates.

password

if identity is set, this is the password that is used to access the identity keystore.

type=cookie-store

adds a store to read and write HTTP Cookies.

Each entry may also have an optional match parameter:

match

an array of one or more regular expressions which will be matched to the URL. the URL Processor will only be applied if one matches (if no match array exists, or it’s empty, the processor is always applied.

Here’s an example showing a url_processor array that adds a client-certificate, on top of the regular list of certificates, and adds a username/password for a specific site before falling back to prompting the user:

{
  "url_processors": [
    {
      "type": "password",
      "name": "myusername",
      "password": "my-secret-password",
      "match": [ "//example.com", "//login.example.com" ]
    },
    {
      "type": "ssl",
      "identity": "http://myserver.com/path/to/identity.pkcs12",
      "password": "my-secret-keystore-password"
    },
    {
      "type": "password"
    },
    {
      "type": "ssl-default"
    }
  ]
}

We recommended always having an empty password processor and either ssl-default or ssl-mozilla as the final two processors, and this is the default setup.

get (request)

Retrieve a file from The Store. Normally only called from WebSockets, as a standard GET can be used with HTTP

path

The path to retreive from the store. Required.

delete (request)

Delete a file from The Store. Normally only called from WebSockets, as a standard DELETE can be used with HTTP

path

The path to delete from the store. Required.

put (request)

Put a file in The Store. Normally only called from WebSockets, as a standard PUT can be used with HTTP

Parameters are anything that can be set on a single item in the put array for a convert action.

status (request)

Request non-admin status from the server. No parameters.

admin/status (request)

Request admin status from the server. No parameters.

admin/pause (request)

Pause the server for new conversions or store changes. No parameters

admin/resume (request)

Resume a server that was previously paused. No parameters

admin/shutdown (request)

Halt a server. No parameters

admin/configure (request)

Configure the server for use - see Administration

cache_size

Integer. The number of resources to be kept in a memory-sensitive LRU cache in case they’re requested again. Defaults to 200.

max_threads

Integer. Sets the number of threads that will be available for conversion (each conversion process is largely single threaded). Defaults to 0, which means the number of cores available to Java.

max_queue_size

Integer. The number of Jobs to queue when max-threads conversions are already in progress. The default is 0, which means no limit. If this is set lower, conversion requests may be rejected.

max_buffer_size

Long. The maximum number of bytes that can be sent in a single request over HTTP or WebSocket. This value is used only if the corresponding key is not set in any Authority that authorizes the request.

lang

the default value of the property with the same name that’s passed into /convert

referrer_policy

the default value of the property with the same name that’s passed into /convert

user_stylesheets

the default value of the property with the same name that’s passed into /convert

ua_stylesheets

the default value of the property with the same name that’s passed into /convert

processing_instructions

the default value of the property with the same name that’s passed into /convert

env

the default value of the property with the same name that’s passed into /convert

url_processors

the default value of the property with the same name that’s passed into /convert.

license

the license code for BFO Publisher, supplied by BFO when the product is purchased to remove the trial-version stamps from the PDF.

egress_filter

the policy to use for the egress filter used to limit access to external resources. This is a secure option - see Security for more detail. The syntax is one of the following values:

  • never means only trustworthy internal URLs required for normal operation are allowed

  • always means all access is allowed, from any origin

  • default means the typical set of rules used in browsers is allowed.

  • no-files means that all access to filesystem-based URLs is blocked

  • no-local means that all access to local-network URLs is blocked

  • a map may be specified with the following parameters:

    • search is a Java Regular Expression which will be matched to the URL being requested (required).

    • replace is either a string which will be used as part of a replaceAll on the matched URL, or the value false to disable access to that URL. A string value allows URLs to be rewritten; if this filter is part of a list, it is the rewritten URL that is passed into the next filter.

    • baseurl is an optional Java Regular Expression - if set it will be tested against the URL of the resource requesting this URL, and the filter will only be applied if it matches.

  • a list of any of the above values. The filters will be applied in order. If any filter disallows access to a URL then processing stops.

store

a map which specified which Store to use, and configuration values for the Store. If it contains the key classname, that is the Java classname of a subclass of Store - if this differs from the current Store, it will be replaced. Other properties can control the Store operation - they will vary across Store types, but the following are defined for the default FileStore type;

store.path

the directory on the filesystem to store files in. Defaults to java.tmpdir

store.default_expiry

if no expiry is specified when uploading a resource, this value determines the default. It is the number of milliseconds after a File in the Store is last accessed befor the File is considered for deletion. Defaults to 604800000, which is 7 days.

store.max_expiry

the maximum values that can be specified for expiry for any file uploads. Values larger than this will be capped. Default is 0, meaning no max.

authorities

a list of signing authorities which will be used to grant access to BFO Publisher. See Access Control below for full details on these parameters

authorities[n].key

the JWK key used to verify any Bearer tokens supplied in calls to this web service, or the word none to match all requests, including anonymous requests.

authorities[n].grants

a list of strings naming the grants which this Authority is allowed to approve. The wildcard * means match any character except '/' and the wildcard ** means match any character. Examples would be status, admin/status or custom/token/myaction for individual grants, or **, *, admin/* or custom/token/* for wildcard matches.

authorities[n].from

[Since 1.1] an optional list of strings listing the IP addresses to match against this Authority (if unspecified it defaults to ["*"] to match all IP addresses). Addresses are specified in the following formats:

  • * will accept all addresses

  • 127.0.0.1 will accept the specified IPv4 address

  • 192.168.0.0/16 will accept the specified IPv4 address range

  • ::1 will accept the specified IPv6 address

  • 2001:db8::/48 will accept the specified IPv6 address range

  • Each of these can have a ! in front of it to reverse the logic - if any address with a ! prefix is matched, the authority does not match

authorities[n].max_buffer_size

the maximum number of bytes that can be sent in a single request over HTTP or WebSocket for a request approved by this Authority.

authorities[n].overrides

a map of properties which is applied over any properties specified by the user, overriding their supplied values

authorities[n].defaults

a map of properties which is applied under any properties specified by the user, setting defaults which can be overridden by the user

extensions

(since 1.3) a list of maps that describe extensions to be added or removed from Publisher

extensions[n].class

the class name of a ReportFactoryExtension classs to be added or removed the list of extensions used by BFO Publisher. Extensions that are already in the list will be silently ignored; no duplicates are possible. If the class name cannot be resolved, it will be tested with the org.faceless.publisher.ext prefix added before failing. So for example, "extensions": [ { "class": "FreeMarkerExtension" } ] will add the FreeMarkerExtension to Publisher.

extensions[n].remove

if this optional boolean is true, the extension will be removed rather than added

extensions[n].configuration

if this optional value is set, it will be passed to the configure() method of the extension.

Finally, when the web service is running as a standalone web server there are some additional configuration options that can be set, which do not apply when the web service is running as a web application in an existing servlet engine.

server.http_port

The port the server should listen to for HTTP requests, or 0 to disable HTTP entirely. The default value is 8080.

server.https_port

The port the server should listen to for HTTPS requests, or 0 to disable HTTPS entirely. Note that HTTPS also requires a server.keystore to be set. The port must be a different value to the server.http_port. The default value is 8443.

server.serve_help

If this is true (the default), the help files - including this one - and the simple web interface for testing BFO Publisher are served from the embedded web server. If false they will return 404.

server.prefix

The prefix that should be applied to all URL paths used by the web service. The default is / which means that (for example) the store is accessible on /store, but setting this to /publisher would move the store URL to /publisher/store. When running the web service behind a path-based proxy, it’s a good idea to set the prefix to match.

server.keystore

The keystore to use for the key information required to enable HTTPS. The keystore can be specified in a number of different ways:

  • as a URL to a PKCS#12, JKS or JCEKS format java.security.KeyStore, or to a file containing one or more PEM encoded X.509 certificates and exactly one PEM encoded private key.

  • the special URL about:identity can be used to create a self-signed identity on-the-fly. Fragment Parameters to the URL control the details of the identity, and include:

    • CN - the common name to use, eg about:identity#CN=myservername

    • algorithm - the algorithm to use, eg about:identity#algorithm=SHA256withECDSA

    • curve - for elliptic-curve algorithms, the curve to use (defaults to secp256r1)

    • provider - the java.security.Provider to use to create the identity

    • days - the validity of the self-signed certificate (defaults to 365)

    • keylength - for RSA algorithms, the length of the key (defaults to 2048)

    • anything else - will be treated as an X.500 field of the identity being generated.

  • a JWK Key Set listing exactly one private key and one or more X.509 Certificates, with the server certificate first in the list.

server.password

The password required to access the keystore set by server.keystore, if required.

server.static_path

The optional directory from which to serve static files that don’t match any other URL recognised by the web service. This is a secure option - see Security for more detail.

web_app.websocket_controller-class

String. The Java classname of the controller for the WebSocket interface. Override this with a subclass of WebsocketController to add new functionality to the WebSocket interfae

web_app.http_controller_class

String. The Java classname of the controller for the HTTP interface. Override this with a subclass of HttpController to add new functionality to the HTTP interface

Appendix A: Environment variables

Below is a largely complete list of environment variables recognised by the current release. Others may be added and the functionality of some may be removed; this list is for information purposes only.

Name Default Description

User Properties

bfo-lang

"" (empty string)

the default language when no other value is known (also the environment language for SVG)

bfo-lookahead

true

should the parser run in lookahead mode (see Lookahead mode)

bfo-stylesheet-alternate

none

the alternate stylesheet name to use

bfo-pages

auto

Guess for the number of pages in the document (to better-size the pages counter in non-lookahead mode)

bfo-publisher-version

the current version of BFO Publisher (read-only)

bfo-format

the output format - "pdf" or "svg" (read-only)

bfo-location

the URI of the document being parsed (read-only)

bfo-metadata-location

public

the URL to record in the Metadata as the source for the generated PDF: none for none, always for the value of bfo-location, public for the value of bfo-location only if it is a public URL. Any value will be stored as specified.

bfo-table-row-group-buffer

100

the number of table rows to buffer when searching for a table-footer-group element: none for no buffering, infinite to buffer all table rows, or a positive integer of the number of rows to buffer. Set this to a higher value if your table-footer-group is not appearing, or move the table-footer-group before the table-row-group if possible.

System Properties

bfo-sys-html-xml-base

false

does HTML input accept xml:base attribute?

bfo-sys-mathml-level

extended

level of MathML support: core or extended

bfo-sys-resource-timeout

30s

how long to wait before resource requests time out

bfo-sys-same-origin-policy

true

enforce CORS same-origin policy for any network requests

bfo-sys-hold-queue-length

40

how many elements to buffer before starting layout

bfo-sys-display-run-in

new

the display:run-in layout model: new to match the CSS spec, old to match the old CSS2 model and none to not recognise run-in display types

bfo-sys-display-columns

true

whether the layout engine recognises the columns and related properties

bfo-sys-display-grid

true

whether the layout engine recognises the grid display properties (note grid is unsupported; setting this to true will cause an exception to throw if grid layout is encountered)

bfo-sys-display-flex

true

whether the layout engine recognises the flex display properties

bfo-sys-content-in-nodes

false

determines whether the full range of values for the content property can be set on nodes; if false, it only applies to pseudo-nodes (as specified in css-content-3)

bfo-sys-content-in-content

false

determines whether the the content() value of the content property can be used in pseudo-nodes

bfo-sys-viewport-zoom

1

the default value for zoom if not specified in @viewport

bfo-sys-fallback-id-attribute

"id" "xml:id"

the attributes to use for the id on elements in unrecognised namespaces

bfo-sys-fallback-class-attribute

"class"

the attributes to use for the class on elements in unrecognised namespaces

bfo-sys-fallback-base-attribute

"xml:base"

the attributes to use for the base-url on elements in unrecognised namespaces

bfo-sys-fallback-lang-attribute

"xml:lang"

the attributes to use for the language on elements in unrecognised namespaces

bfo-sys-fallback-style-attribute

none

the equivalent to the HTML style attribute on elements in unrecognised namespaces

bfo-sys-first-letter-skips-inline-marker

true

are inline markers from list-items considered part of the first letter? https://github.com/w3c/csswg-drafts/issues/4506

bfo-sys-zero-leading-expands-linegap

false

do fonts with a line-height of 1em have padding added to the linegap (as Firefox does) or to the ascent/descent (as Chrome/Safari do)

bfo-sys-initial-letter-align

false

can the initial-letter use baseline-shift/alignment-baseline?

bfo-sys-intrinsic-stretch-padding

false

does replaced content sized to stretch fit include padding (as Chrome/Safari) or not (as Firefox)

bfo-sys-counter-image-suffix

" " (two spaces)

the suffix to automatically apply after an image in a counter.

bfo-sys-font-step

1.2

step adjustment for font-size: larger and font-size: smaller

bfo-sys-font-smallcaps-size

0.7

the size to multiply font-size by when synthesizing small-caps

bfo-sys-page-first-of-group

false

whether the :first selector matches the first

bfo-sys-target-text-max-length

80

how manyu characters of text to store from each node with an id to support the target-text() function

bfo-sys-legacy-css-common

false

whether to automatically support a set of legacy prefixed CSS properties

bfo-sys-legacy-css-ah

false

whether to automatically support a subset of CSS properties used by Antenna House™ Formatter

bfo-sys-legacy-css-epub

false

whether to automatically support a subset of CSS properties used by ePub documents

bfo-sys-legacy-css-all

false

whether to automatically support all above legacy subsets CSS properties

bfo-sys-string-counter-by-reference

false

does string-set(x, counter(page)) copy the counter by reference? https://github.com/w3c/csswg-drafts/issues/4740

bfo-sys-svg-respect-tainting

true

whether we honour the tainting rules for SVG filters (they have no security implications in BFO Publisher)

bfo-sys-svg-non-uniform-turbulence

false

does the SVG feTurbulence algorithm use the legacy, non-uniform vector

bfo-sys-backdrop-filter-color

white

color over which the root element should be composed for CSS backdrop-filter

bfo-sys-svg-blur-default-edgemode

none

default edgeMode for SVG blur filter: spec has none, all implementations use duplicate

bfo-sys-svg-edgemode-box

filter

which box to treat as the edge for the "blur" filter - primitive for the content rectangle, filter for the filter rectangle

bfo-sys-svg-path-earlyclose

false

whether we support the SVG path "early close" mechanism from SVGnext?

bfo-sys-svg-path-bearings

true

whether we support the SVG path bearing command from SVGnext?

bfo-sys-break-all-allows-break-edge

true

whether a break opportunity exists before the first char in a word-break: break-all: https://github.com/w3c/csswg-drafts/issues/3897

bfo-sys-viewport-units

base

how to resolve vw/vh units? base for take from the "base" @page rule, first for take from the first page, dynamic for resolve from the current context (which may fail when used for font-related units across page-breaks where the page size changes). https://github.com/w3c/csswg-drafts/issues/5437

bfo-sys-reorient-svg-images

false

when embedding images in an SVG, should we rotate any images that rely on the EXIF tags for rotation, which is unsupported by many SVG renderers

bfo-sys-colors

none

which extra color functions we recognise - list of strings including "cmyk" "rgb-icc" "gray" "device-gray" "-ro-spot" "prince-color"

bfo-sys-page-size-nnn

default pages sizes; for example, the user-agent stylesheet includes a rule bfo-sys-page-size-a4: 210mm 297mm

bfo-sys-font-xheight-derive

false

whether we derive the x-height form a font from the lowercase 'o', as specified. Can lead to odd results

bfo-sys-font-capheight-derive

false

whether we derive the cap-height form a font from the uppercase 'O', as specified. Can lead to odd results

bfo-sys-font-xheight-default

0.8

the default x-height (if no other way is available to determine it)

bfo-sys-font-capheight-default

1

the default cap-height (if no other way is available to determine it)

bfo-sys-font-superpos-default

0.34

the default superscript baseline shift

bfo-sys-font-superpos-default

-0.2

the default subscript baseline shift

bfo-sys-font-family-default

serif

the default font-family

bfo-sys-raster-resolution-default

serif

the default resolution to use when rasterizing images

bfo-sys-raster-resolution-max

infinity

the maximum resolution for any raster image stored in the PDF; anything above this will be downsampled

bfo-sys-raster-resolution-target

infinity

for any images downsampled due to the above setting, the resolution to downsample them to (used resolutionwill be the min(bfo-sys-raster-resolution-max, bfo-sys-raster-resolution-target)

bfo-sys-running-combines

true

whether content: element(nnn) can be combined with other content (spec says no)

bfo-sys-page-collapse-margin

never

whether the margin on :root collapses with the page-margins (always for yes, empty-margins to match prince and only do so if the margin area is empty, never to match everyone else)

bfo-sys-text-indent-percentage-old

false

whether to resolve percentages in text-indent against width of containing block (old) or content-width of element (new): https://github.com/w3c/csswg-drafts/issues/2394

bfo-sys-svg-features

see text

the list of supported SVG 1.1 features (the default is the full list of "SVG-Static" from SVG 1.1, excluding "Font" but adding "Hyperlinking" and a few others. Used by the requiredFeatures SVG property

bfo-sys-svg-extension-nnn

add a url to the list of SVG 2.0 extensions. For example, the user-agent stylesheet includes a rule bfo-sys-svg-extension-html: "http://www.w3.org/1999/xhtml". Used by the requiredExtensions SVG property

bfo-sys-epub-extension-nnn

add a url to the list of EPUB extensions.

bfo-sys-box-shadow-model

default

which model to use for calculation the spread for box-shadow if the box has rounded corners: default or ratio. https://github.com/w3c/csswg-drafts/issues/7103

Media Properties

bfo-media

print

the media type: print or screen

bfo-media-size

A4

the media size - a shortcut for bfo-media-width and bfo-media-height (default is equivalent to bfo-media-width: 210mm; bfo-media-height: 297mm)

bfo-media-color

24

the media color property

bfo-media-resolution

infinite

the media resolution property

bfo-media-overflow

paged

the media paged property

bfo-media-update

none

the media update property

bfo-media-hover

none

the media hover property

bfo-media-scripting

false

the media scripting property

bfo-media-monochrome

0

the media monochrome property

PDF Output Properties

bfo-pdf-profile

none

list of PDF output-profiles to apply

bfo-pdf-profile-feature-nnn

initial

turn on/off a specified PDF output-profile feature. Values are require, deny, ignore or initial

Extension Properties

bfo-ext-html-namespace

none

how to handle namespace extensions in HTML (see HTML Namespace Extensions)

bfo-ext-signature-nnn

default value for param nnn for any signature objects

bfo-ext-index-division

"//"

the string to use for division token in the IndexFilter

bfo-ext-index-subdivision

"/"

the string to use for subdivision token in the IndexFilter

bfo-ext-index-xref

"->"

the string to use for xref in the IndexFilter

bfo-ext-index-sort

"{ }"

the string (or two strings separated by space) to use for sort token in the IndexFilter

bfo-ext-index-comma

","

the string to use for comma in the IndexFilter

bfo-ext-index-dash

"–"

the string to use for dash in the IndexFilter

bfo-ext-index-separator

","

the string to use for separator in the IndexFilter

bfo-ext-mail

default

the Configuration the MailExtension should convert RFC822 messages. Current values are default, or PDF/mail-1s + PDF/A-3

bfo-ext-spider

none

how the Spider extension attaches any content it traverses? none to not spider, attach to attach as an attached file, include ident to include at the end of the current file or (if an ident is specifed) as children of the specified node.

Appendix B: Properties

This section lists every property recognised by BFO Publisher, along with link to the spec (or specs) that define them. If the property value varies from the official specification that’s noted, otherwise the property definitions can be retrieved from the listed specification.

align-content
alignment-baseline
aspect-ratio
backdrop-filter
backface-visibility
background
background-attachment
background-attachment is parsed but cannot be supported in print output; it is unsupported in BFO Publisher
background-blend-mode
background-clip
background-color
background-image
background-origin
background-position
background-repeat
background-size
baseline-shift
baseline-source
bleed
bookmark-label
bookmark-level
none | <integer> | copy | increment
none
BFO Publisher accepts the custom values "copy" and "increment"
bookmark-state
-bfo-bookmark-target
auto | <url>
border-block-color
border-block-end-color
border-block-end-style
border-block-end-width
border-block-start-color
border-block-start-style
border-block-start-width
border-block-style
border-block-width
border-bottom
border-bottom-color
border-bottom-left-radius
border-bottom-right-radius
border-bottom-style
border-bottom-width
border-collapse
border-color
border-image
border-image-outset
border-image-repeat
border-image-slice
border-image-source
border-image-width
border-inline-color
border-inline-end-color
border-inline-end-style
border-inline-end-width
border-inline-start-color
border-inline-start-style
border-inline-start-width
border-inline-style
border-inline-width
border-left
border-left-color
border-left-style
border-left-width
border-radius
border-right
border-right-color
border-right-style
border-right-width
border-spacing
border-style
border-top
border-top-color
border-top-left-radius
border-top-right-radius
border-top-style
border-top-width
border-width
box-decoration-break
box-shadow
break-before
break-inside
clip-path
clip-rule
color-interpolation
auto | sRGB | linearRGB
auto;
color-interpolation-filters
column-count
column-fill
column-gap
column-rule
column-rule-color
column-rule-style
column-rule-width
contain-intrinsic-block-size
contain-intrinsic-height
contain-intrinsic-inline-size
contain-intrinsic-width
counter-increment
counter-reset
counter-set
cx
<length-percentage>
cy
<length-percentage>
d
<path>
none
display
"ruby-base", "ruby-text", "ruby-base-container", "ruby-text-container", "grid" and "inline grid" are not supported.
dominant-baseline
empty-cells
fill
fill-opacity
fill-rule
flex-basis
flex-direction
flex-flow
flex-grow
flex-shrink
flex-wrap
float
none | left | right | inline-start | inline-end | footnote | top | bottom | inside | outside
none
BFO Publisher also accepts the values "inside" or "outside", to float to the inside or outside edge of the current page.
flood-color
flood-opacity
flow-from
flow-into
font-family
font-feature-settings
font-kerning
font-language-override
font-palette
font-size
font-size-adjust
font-stretch
font-style
font-synthesis
font-synthesis-small-caps
font-synthesis-style
font-synthesis-weight
font-variant
font-variant-alternates
font-variant-caps
font-variant-east-asian
font-variant-ligatures
font-variant-numeric
font-variant-position
font-weight
footnote-display
hanging-punctuation
hyphenate-character
hyphenate-limit-chars
[ auto | <integer> ]{1,4}
auto
BFO Publisher accepts a non-standard fourth integer, which is the minimum distance from any other hyphen (auto, soft or hard) that an auto-hyphen will be considered. If unspecified it defaults to zero.
hyphenate-limit-lines
hyphenate-limit-zone
hyphens
none | auto | manual | -bfo-no-caps | -bfo-smart-caps | -bfo-nowrap
manual
BFO Publisher accepts the value "-bfo-nowrap", as proposed in https://github.com/w3c/csswg-drafts/issues/3434. The value "-bfo-no-caps" will prevent any hyphenation of words containing a capital letter. The value "-bfo-smartcaps" will not hyphenate any words previously seen in the middle of a sentence with a capital letter, as an experimental approach to identifing and excluding proper nouns in English.
image-orientation
image-resolution
ths property is at risk of removal from CSS. BFO Publisher defaults to "from-image"
-bfo-index
none | <tag> <content-list>?
initial-letter
initial-letter-align
[ border-box? [ auto | alphabetic | ideographic | hanging | leading ]? ]!
alphabetic
BFO Publisher also accepts the value "auto", which will choose the most appropriate value based on the script of the initial-letter.
initial-letter-wrap
inline-sizing
inset-block-end
inset-block-start
inset-inline-end
inset-inline-start
justify-content
justify-items
justify-self
letter-spacing
percentage values are not yet added to the spec, but evaluate against the font size. See https://github.com/w3c/csswg-drafts/issues/2165
lighting-color
line-break
line-grid
line-height
line-height-step
line-snap
list-style
list-style-image
list-style-position
list-style-type
margin
margin-block
margin-block-end
margin-block-start
margin-bottom
margin-break
margin-inline
margin-inline-end
margin-inline-start
margin-left
margin-right
margin-top
marker
none | <url>
not defined for shorthand properties
marker-end
none | <url>
marker-mid
none | <url>
marker-side
marker-start
none | <url>
none
marks
mask-border
mask-border-mode
mask-border-outset
mask-border-repeat
mask-border-slice
mask-border-source
mask-border-width
mask-clip
mask-composite
mask-image
mask-mode
mask-origin
mask-position
mask-repeat
mask-size
mask-type
max-block-size
max-inline-size
-bfo-metadata-subject
none | <url>
min-block-size
min-inline-size
object-position
opacity
outline
outline-color
outline-style
outline-width
overflow
"scroll" is treated as "hidden". The value "clip" is supported.
overflow-wrap
padding
padding-block
padding-block-end
padding-block-start
padding-bottom
padding-inline
padding-inline-end
padding-inline-start
padding-left
padding-right
padding-top
page-break-after
page-break-before
page-break-inside
-bfo-page-label
normal | none | <content-list>
normal
page-orientation
paint-order
path-length
<length-percentage>
this is an SVG presentation attribute. Not yet added to specification - see https://github.com/w3c/svgwg/issues/773
-bfo-pdf-tag
none | <tag> | <tag> map-to <tag> | <namespace> "|" <tag> | <namespace> "|" <tag> map-to <namespace> "|" <tag>
perspective
perspective-origin
place-content
place-items
place-self
position
static | relative | absolute | fixed | sticky | running(<name>)
static
the value "sticky" is parsed but is treated as "relative"
r
<length-percentage>
-bfo-raster-resolution
<resolution>
the value of the "bfo-sys-raster-resolution-default" environment property, which defaults to 200dpi
if content has to be rasterized to a bitmap (as it would be for box-shadow, text-shadow, filter) and the resolution is not implied by the operation, this sets the resolution to rasterize at.
row-gap
rx
<length-percentage>
ry
<length-percentage>
shape-image-threshold
shape-inside
shape-margin
shape-outside
shape-padding
stop-color
<color>
black
stop-opacity
<number> | <percentage>
1
stroke
stroke-dasharray
stroke-dashoffset
stroke-linecap
stroke-linejoin
stroke-miterlimit
stroke-opacity
stroke-width
tab-size
table-layout
text-align
auto | start | end | left | right | center | justify | inside | outside
start
the "inside" and "outside" values are BFO Publisher extensions which will align the text to the inside or outside edge of the current page. They can also be applied to "text-align-all" and "text-align-start"
text-align-all
text-align-last
text-anchor
text-decoration
text-decoration-color
text-decoration-fill
<paint>
text-decoration-line
text-decoration-skip-ink
-bfo-text-decoration-skip-ink-clearance
<length>
0.05em
this property determines how much of a gap to leave between the text-decoration and the text if "text-decoration-skip-ink" is set
text-decoration-stroke
<paint>
text-decoration-style
text-decoration-thickness
text-edge
text-emphasis
text-emphasis is not supported by BFO Publisher.
text-emphasis-color
text-emphasis is not (yet) supported in BFO Publisher
text-emphasis-position
text-emphasis is not (yet) supported in BFO Publisher
text-emphasis-style
text-emphasis is not (yet) supported in BFO Publisher
-webkit-text-fill-color
<color>
currentColor
text-indent
text-justify
text-shadow
-webkit-text-stroke
<line-width> || <color>
-webkit-text-stroke-color
<color>
currentColor
-webkit-text-stroke-width
<line-width>
0
text-transform
text-underline-offset
text-underline-position
transform
transform-box
transform-origin
transform-style
BFO Publisher does not (yet) support the "3d" value for transform-style
translate
-bfo-trim
<length> {1,4} | to [ auto | [ <page-size> || portrait | landscape ] ]
-bfo-trim-bottom
<length>
-bfo-trim-left
<length>
-bfo-trim-right
<length>
-bfo-trim-size
<length>
-bfo-trim-top
<length>
unicode-bidi
vertical-align
visibility
white-space
word-break
word-spacing
percentage values are not yet added to the spec, but evaluate against the font size. This is different from the unspecified-but-implemented-in-Firefox behaviour, where percentages are evaluated against the width of the space character. This legacy behaviour can be selected by setting the environment property "bfo-sys-word-spacing-percentage-basis" to "space". See https://github.com/w3c/csswg-drafts/issues/2165
word-wrap
wrap-after
wrap-before
wrap-inside
writing-mode
vertical writing modes are not supported in BFO Publisher
x
<length-percentage>
y
<length-percentage>
z-index

Appendix C: Supported types

At Rules

File Formats for Replaced Content

  • image/tiff

  • image/png

  • image/jpeg

  • image/gif

  • image/x-portable-anymap

  • image/x-portable-bitmap

  • image/x-portable-pixmap

  • image/x-portable-graymap

  • image/jp2

  • image/bmp

  • image/svg+xml

  • image/* (with ImageIO plugin)

  • application/pdf

  • video/mp4

  • video/quicktime

  • model/u3d

  • model/prc

  • audio/mpeg

  • audio/x-wav

  • audio/x-aiff

  • audio/basic

  • audio/* (with AudioFileReader plugin)

  • bfo/barcode

The application/pdf replaced content supports the page and viewrect fragments as defined in RFC8118.

All image/* replaced content supports the generic xywh media fragment from https://www.w3.org/TR/media-frags/ to embed part of an image. image/tiff also supports the page media fragment, to load a particular page - the first page is 1.

<img src="file.png#&xywh=200,200,100,50">
<img src="file.tif#page=1&xywh=200,200,100,50">
<img src="file.pdf#&page=2&viewrect=xywh=200,200,100,50">

URL Schemes

The classpath URL scheme can be used to load an object from the classpath. It takes a relative path from the Report class (classpath:data/nss.crt), an absolute path beginning with a slash (classpath:/com/example/Name) or an absolute path with dots (classpath:com.example.Name)

Appendix D: Report Generator

The Big Faceless Report Generator is BFOs first attempt at an XML+CSS to PDF converter, first releasted in 2001. It shares no code with BFO Publisher other than the underlying PDF library. However as it’s quite possible some users will wish to migrate from the Report Generator to Publisher, and this section will go into detail on how to do so.

The Report Generator uses a custom XML format that is similar to, but not quite the same as HTML. It also nominally uses CSS, although in practice the meaning of the properties and the parsing rules means that there are significant differences.

By contrast, BFO Publisher uses HTML and CSS, exactly as they are in the published specification. So migrating from Report Generator to Publisher means converting from the custom Report Generator XML and CSS to standard HTML and CSS.

BFO Publisher can do this in one pass, reading the Report Generator XML, converting to HTML, then parsing that HTML. To identify the Report Generator XML we need to assign it a Media-Type.

  • application/x-vnd.bfo.report;version=1 for the report-1.1 syntax

  • application/x-vnd.bfo.report;version=0 for the report-1.0 syntax

Both of these will be automatically identified from the input files so there will rarely be a need to specify it explicitly. Which means parsing a Report Generator file with BFO Publisher is now fairly simple. Using the command line approach described in the Quick Start section:

$ java -jar bfopublisher-bundle.jar --format pdf --output out.pdf reportgeneratorinput.xml

This should give you a PDF that is almost, but not exactly, the same as you would see from the Report Generator.

Almost? There are some differences:

  • inline vertical alignment as used in Report Generator is very different to how it’s supposed to be in CSS. This will be most noticable when mixing text or images with very different sizes on the same line. This is likely to need some manual adjustment.

  • the table layout algorithm, which determines the widths of each column based on its content, is different, although in many cases the differences are slight.

  • the requote attribute (which converts straight quotes into curly quotes) is not yet supported.

  • the algorithm for determining whether a box can fit on the page is slightly different, with Publisher more likely to find it possible. Long documents may use slightly fewers pages in Publisher as a result.

  • Publisher has no support for the axesgraph or piegraph elements of the Report Generator

  • Report Generator had no concept of collapsed borders in Tables. Emulating this exactly in Publisher is quite complicated and on some occasions we get this wrong.

  • The overflow property in Report Generator worked in a very different way to CSS and cannot be duplicated exactly.

  • superscript and subscript are positioned based on the correct font metrics, which puts them in a slightly differnet location.

  • list bullets are done very differently in Report Generator. BFO Publisher will attempt to duplicate the approach used in Report Generator, but is unable to do so for hierarcical lists. However CSS has a much better model for these, so it will always be simply to revert to the CSS list model.

  • Form element styling is slightly different.

There are also some layout bugs relating to display: flow-root which are particularly obvious with this conversion, although these will be fixed over time.

Converting Report Generator input to HTML input

What we’ve shown above is the process for conversion Report Generator XML into PDF. But a proper migration from Report Generator to Publisher would need to focus on converting Report Generator XML into XHTML. Presumably the reason for migration is to access new CSS features, or to avoid the various issues with page-breaking that were an aspect of the Report Generator? If so, the fixes belong in the HTML and CSS, not the original input file.

BFO Publisher can write the HTML it generates from the Report Generator XML as an output format. Modifying the example above:

$ java -jar bfopublisher-bundle.jar --format application/x-vnd.bfo.publisher+xml \
                                    --output out.xhtml reportgeneratorinput.xml

The Media-Type application/x-vnd.bfo.publisher+xml will generate XML output based on the input. It can be used with any input type, although if you’re reading in a supported XML format like XHTML it will simply be an identity transform.

However when the input has special processing it can be useful. In particular, when:

  • the input is a special format, like the Report Generator XML format or a filetype like ODF

  • the input makes use of XInclude or XSLT to transform the input in advance

  • the input is badly formed HTML and you want to tidy it into XML

This approach will show you exactly what is being parsed by BFO Publisher. And, because the output is now regular HTML, you can load it into a web-browser to see how it works.

Most of our Report Generator customers are using some sort of template, populating it with data from a source, then feeding that XML input through the Report Generator to create a PDF. The approach for migration we would recommend is:

  1. Generate a sample XML input and save it to a file.

  2. Run it through BFO Publisher as shown in the example above, to convert Report Generator XML into generic HTML.

  3. Open that HTML in a browser.

  4. Start editing the HTML. In particular

    • the generated HTML will have CSS rules that don’t apply and can be dropped.

    • the CSS will attempt to be bug-for-bug compatible with the Report Generator - for example, Report Generator could not break table-cells across pages, so the CSS contains a rule th, td, li { break-inside: avoid }. Removing this will allow cells and list items to break at page ends.

    • the CSS will all be in the one file, included as a <style>. The bfo-title attribute on each <style> element shows the source of the data, and it will include (for example) our system stylesheets report-1.1.css and conversion.css. It may make sense to break those out into external stylesheet files and include them with <link>

    • in many cases the CSS we generate is quite awful, to try and retain compaibility with Report Generator. For example, lists bullets are created using absolutely positioned generator content. Removing all the CSS that does this and simply relying on regular CSS list bullets is certainly the better option.

  5. Once the HTML is styled to your satisfaction, reverse the process in step 1 - create a template from the HTML and use that HTML template instead of the original Report Generator template.

  6. At that point the Report Generator can be removed from the code and BFO Publisher substituted in.

Finally, we should note there is no particular need to do this. The BFO Publisher codebase is ten times the size of the Report Generator; it is a much more capable product, but simple means fast. The Report Generator uses no caching - fonts, images, stylesheets etc are parsed each time they’re used, but the layout is very quick. By contrast, BFO Publisher will cache and share resources like these, but layout is slower. And of course, there will be a development cost to any migration, although for customers with support coverage BFO will certainly be able to help with this.

Out advice would be: if the layout limitations of the Report Generator are causing problems then consider migration, otherwise stay where you are.

Appendix E: Licensing