The CCM-PDF Pipeline
Before this PDF exploration begins in earnest, let's define some terms, like what we mean when we say "CCMS/CCS" content. For most cubicle denizens, normal content works as follows: you have a Word file; then you make a PDF of that file. Congratulations! You have made "normal" content . . or, more specifically, a content deliverable, since the content lives in Word. In CCM, you have a bunch of content files, a set of conditions, and a map that points to those files. You run the map through a processor, it grabs the files (transclusion), sets conditions (filtering conditional content), and that's what makes the deliverable (PDF, HTML, ePub, IETM, NROFF, etc). How the map, conditions, and files relate to your product is architecture; how the files are written is the markup. The process by which the map is turned into a deliverable is a build process, but -- somehow in doculand -- it's come to have many names: stylesheet, transformation, formatter, pipeline, report, etc.
Once you get used to the CCMS/CCS way of doing things, you can do revision service for just the files that changed, instead of whole deliverables every single rev. You can also make different maps with different conditions - in effect, generating entirely different documents - by making very small changes in a small number of source files. If you have this working well, it can make you look like some sort of wizard. That's "single source" AKA CCM, the simplest functional definition. Heaven knows there's some fiddle-diddle over definitions in this industry, but that's the short version. DITA, S1000D, DocBook, some Wikis, Asciidoc, ReStructuredText (sorta), and many others can all do CCM to some extent, but at minimum it requires 1) simple transclusion, 2) component content re-use (selective transclusion), and 3) a usable conditional content mechanism ("if X then show Y").
This CCM approach reaps all sorts of theoretical efficiencies and lends itself to a variety of streamlining practices. Hopefully at some point you'd eventually stop using PDFs for review and adopt something like a git branching model to capture feedback/approvals, with the side effect of broadening your net of contributors. Also hopefully: you'd spool up a build engine to automate output so you don't need headcount for pounding an "Output" button all day. There's all sorts of different kinds of CMSs to do this stuff as well, some good, some not so good, some unspeakably awful. We're not going to get to any of that today, and we're not going to address issues with the whole concept of CCM (because there's no shortage of those, some are very serious, and wizards aren't real). Today is all about that build process for a PDF deliverable, template mechanisms for turning CCM into PDFs. Weird ones.
The part where you make CCM into PDF has never been what you'd call a smooth highway. It's still a road, of sorts, but it's not very well marked, and the vendor community occasionally booby traps alternate routes with punji sticks and ISO standards. The CCM road might even be a dumb road to begin with, entirely dependent on content architecture and institutional self-knowledge. For now, though, let's pretend our hypothetical CCM architecture has emerged, like Athena, from Zeus' brow, perfect and immutable. This'll free us to look at the PDF situation. Specifically, an open PDF template standard that's 1) flexible and 2) usable by normal people. "Usable" meaning "without involving a small army of consultants to figure out things like half size pages or cross-margin content". The conceptual CCM problem will raise its head again, but we'll be giving it a full treatment in a separate article. With vengeance, and with math. Stay tuned, but today, PDFs.
Since W3C CSS PMM L3 (paged media module, CSS standard for granular print control) never quite came out of its cave as it was promised back in 2007, "single source" (CCM ) analysts tasked with complex PDF requirements have gotten to know the following:
The exact second that Google (i.e., "The Internet") rolls in CSS PMM L3 with all the promised doodads, that will be the death rattle of the PDF file format. Roll on that happy day. Until then we have the problem of turning CCM / "single source" content into virtual dead trees, a format that suits it not at all. Fun! We'll be taking a closer look at our weapons - which we met briefly, above - but first, let's pick a CCM/single-source markup to use as a strawman input markup. Remember, for today, our architecture is perfect - we just need to pick out our strawman markup specification, so we can shop around the various types of pdf pipelines. In the process, we'll take a brief peek at the general landscape of CCM markup specifications.
For our strawman CCM format, let's beat up on Asciidoc (aka *.adoc, although according to the working group, that filex will change, ). Adoc has the basic attributes of a component content vocabulary, along with a bundle of other neat tricks. It's flat text - an LML - so content is tracked with plain ol' version control: no massive bespoke CMS, no XML merge problem, no Normalization Gremlins waiting in the indents. Flat text is nice - but then, what about other flat text LMLs? Aren't they nice? Not nice enough for us. Adoc's lightweight markup siblings either 1) can't make complex PDFs, 2) don't have CCM-like capability, 3) are "forked all to heck", 4) lack inherent semantics, and/or 5) they're too dependent on a specific software ecosystem (hello, Sphinx). More importantly, adoc's semantically equivalent to DocBook XML, so what we say for adoc is going to be at least mostly relevant for the other CCM tooling, which is overwhelmingly XML/SGML. DocBook has been around approximately forever, what I call a legacy tail. Adoc-DocBook equivalence dramatically simplifies ETL to / from other XML vocabularies, thanks to that tail, which solves a lot of XML problems.
XML, XML, XML . . what about XML, then, why not one of those XML thingies that get all the press? All XML vocabularies struggle to some extent with a vanilla VCS due to normalization, parser variance, and other fun times. XML namespaces can be a serious problem in a modern web framework, if they're even being used (and if they're not being used, you're setting yourself up for far worse problems downstream, this is one of the things that killed the Semantic Web). There's other - far more serious - problems with XML, but they lie far away down the information science rabbit hole: datatypes; 1NF; restrictive charset; semantic model; "Hierarchies Are Everything!". Alternatively just google "XML hate" for amusing nighttime reading.
Most importantly (for us), XML-based candidates are far too strenuous and/or pricey to set up for an exploratory wiggle - or for any wiggle, actually. So much of this ecosystem is vendor-locked that it's sort of painful. Monolithic vocabularies like S1000D and DITA (The "Big XML" vocabularies) have fewer output options due to sheer complexity, overactive customization mechanisms (specialization/BREX/version), CGMs ("stealth proprietary entities"), mixed DTD/XSD validation, spotty XML compliance ("Oh. Leading whitespace in attributes. Marvelous."), and a wicked stew of hidden integration requirements (S1000D brings its own architecture with it, and will have a hard time if you're doing something different, like, say, API docs). Other "public" XML vocabularies like OOXML aren't going anywhere near a PDF without a vendor that's squatted on ISO / ECMA WGs for the past 20 years. Poor oManual is just too dang new, although its inventor, Dozuki, is doing the Lord's Work when it comes to maintenance documentation. Godspeed, Dozuki. Yet other XML/SGML-derived document "specs" are little better than undocumented base64 riding around, cowboy style, in elements/external entities/ processing instructions(!!) . . combining the efficiency of XML with the transparency of binary formats. Amazing!
Then there's MIL-STD . . <sigh/> . . you know what? This is a terrible subject. Let's take a step back and see why, so we can move on.
Strict conformance validation for a speculative definition is worse than useless. Lightweight markup is taking over the world for a reason: namely, hardly anyone knows what the product is before they have to write about it. Horn's "Information Mapping" is, taken to extremes, like building cannon for a ship that no one's thought about yet; this is an analogy that describes more pubs cost overruns than I feel comfortable mentioning. The CM / Product hamster wheel is hard enough when it runs the architecture, but when that churn bleeds into document semantics - the actual day-to-day document markup - it makes a writer's job impossible almost by definition. To use a "normal content" analogy: tying markup to speculative product data is like having MS Office punctuation rules changing every day, every project, every deliverable, every ten minutes, or at random whenever the project office felt like it. Every now and then someone might decide against punctuation entirely, but they'd change their minds by EoB to have it brought back as character sets based on Etruscan. By start of business tomorrow. Chop chop!
We're steering away from all that for this discussion. Regarding architecture, we have that perfect theoretical CCM architecture from the Gods. Remember that? The PLM is masterful; our filenames are glorious; our parts data derived from an ineffable data warehouse; serialized assemblies in near-real-time CMMS; warnings and cautions from a centralized and certified safety office. It's all generally swell. None of this is real, but we're pretending it is for the moment. As for the markup itself, we're using Asciidoc. There's no Information Mapping here. Adoc semantics don't know CM / Syseng / ILS from Adam. It doesn't enforce inwork, taxonomy, CSNs, namespaces, systems engineering, reliability analysis, ILS, or hell, even a schema. Adoc just makes CCM from flat text, with includes, conditionals, transclusion, and some tricks like TextQL. That's it. It leaves everything else to dedicated software, slurping up data when it needs to. And that's why Asciidoc is our strawman format for this exercise. It keeps itself focused on the doc semantics, and doesn't worry about what the architecture might end up being. Running adoc pipelines is a fun weekend wiggle as opposed to a seven-month-and-fifteen-million-dollar software project. So let's return to Asciidoc, the available PDF tools, and a problem that we can maybe fix.
Asciidoctor-pdf (based on Prawn) is a good representative of the "dedicated libraries for PDF generation" type deal. This sucker is unbreakable (particularly compared to DocBook-XSL); prawn eats anything adoc can throw at it, and - unlike XSL - prawn error handling doesn't stdout to a captive black hole. So, fast and good, hooray . . but, downside, its template options don't have the flexibility for complex PDF layouts (not style - layouts - title/subtitle position, position of footnotes, biblio placement, index handler, weird borders, etc). This is typical of narrowly scoped PDF libraries. You'll need to learn <RANDOM PROGRAMMING LANGUAGE> to crack the hood and wrench on these, but then, even if your hack works, you'll have a harder time in maintenance down the road, then still more work after you hand off the system. Your customers very likely operate in an industry where they can't contribute anything back upstream to OSS, so this hack is going to be off the core project radar. When the OSS core project updates, no one's checking to see if it wrecks your hack. So you hand it off, customer is afraid, core never upgrades, things grow moss, and eventually your shiny customized binary becomes the "Holy Relic We Dareth Not Toucheth". Eventually IT is going to complain, and that's the end of the line for your system.
Interestingly, this is a common stop point for proprietary PDF solutions as well; they're a greased weasel out of the box, but when you start customizing heavily, throwing everything from //idstatus or //docinfo into the footer/outer margins/new nonstandard page areas .. or dynamic running content . . or sticking content into the processing layer . . <cough/> . . anyway! The more customization, the more things get clogged and broken (different things!). Also, as we mentioned above, the harder it is to maintain, they leave off their updates, moss, risk, shutdown. The customer's going to resent coming back to you for every layout tweak; if they can't figure out how to do that inside the system, it's a failed system.
There's a "Conservation of Complexity" at work here: focused binaries - both OSS and proprietary - are handling the same amount of complexity on the content layer, so the amount of expertise needed for template customization boils out the same. Either the customer brings 1) programmer-y skills plus three files with twenty keystrokes each or 2) specialized "big IDE" skills plus twenty mouse seeks and twenty keystrokes. I know, I know, the Miracle of Abstraction . . but this problem domain is so narrow that anything abstracted down here isn't going anywhere. It won't be very good abstraction. Then, after all that work abstracting, you have a big IDE laying around. Abstraction has its own overhead; if you don't believe me, look at your giant skull in a mirror, then look at a picture of a monkey. In a proprietary system, the difference at this point- the customization/maintainability problem - is all about the vendor support: a sharp vendor can save the day; a bad vendor leaves you with a problem that you aren't allowed to look at, let alone fix.
领英推荐
Let's try a more Gordian tack. For these two categories - proprietary systems/narrow scoped systems - you could try and convince your customers to accept the stock PDF template mechanisms. Whatever the doohickey came with, like asciidoctor-pdf's YML. Works great! Unfortunately, by this point, if your customers are pushing this hard for super-specific PDFs, they probably do actually need those fancy customizations. Your customers are probably also furious about something else that you personally had nothing to do with, but we can't fix that. They do need those weird PDFs, though, and maybe we can help with those. Onwards with the dead tree simulator.
DBLaTeX is, well, LaTeX, but it's also Python-based, challenging to maintain in a "traditional" controlled Windows environment with no access to Docker, VMs, or WSL. Then you have Python version drama adding to that. You might also be walked out by Security for mentioning PyPi - we'll have similar problems with "web print" systems, as we shall soon see. You will need to add TeX to your "general writer skills", not a small hurdle for a typical org. LaTeX is also more of a middle layer in a print pipeline, and for our adoc strawman it's often the second transitional middle layer (adoc->DocBook->LaTeX->PDF, or maybe adoc->HTML5->LaTeX->PDF). See how these red flags are stacking up? Aside from the complexity and the red flags, LaTeX is a potent solution when it comes to layout - maybe the most potent, with gorgeous print output - but you'll have a heck of a learning curve (although better than that of XSL) and a lot of paperwork to clear with the corporate folks because of scattered, *NIX-y organization.
Assuming we got the overlords' permission, why not write in LaTeX to start with, since it's so good at print? It's a good question, and if your primary output is print you should consider it. The first answer to that question is: "Good luck finding a writer team that'll use it". The other answer? It's not got the full gamut of CCM capabilities out of the gate: like Markdown, it depends on a processor to hold condition data; like virtually all markup languages, it doesn't do component re-use (transclusion) without very serious wrenching (which I personally have never gotten to work with TeX). Fine, then. Let's say the team is cool with learning TeX, that leaves CCM as the last roadblock requirement. So let's get down to brass tacks: do you need CCM? As we hinted way back at the beginning, the jury is still very much out on whether CCM is ever worth the complexity it requires, particularly since there's a disturbing overlap with modern version control systems ("Isn't CCM what forks are for?"). As previously mentioned, this particular debate is far too involved for a ramble about PDFs. Suffice to say CCM is a hard requirement for lots of customers, regardless of whether or not it's actually good. Besides, as mentioned in the intro, our theoretical CCM has an architectural perfection as unto the Gods . . for the purpose of this discussion, anyway. This is a bit of a cop-out, but we'll revisit the CCM question in a future essay. For now, we move along, assuming we need CCM in the content layer. Sorry LaTeX.
And so we come to XSL, probably XSL-FOP. DocBook-XSL in the case of our Asciidoc strawman, since adoc is one-for-one with DocBook structure. As terrible as XSL is, this is still DocBook, so it has a very long legacy "tail" going way back into the nineties and beyond. DocBook-XSL is also built-in to the only extant editor wholly dedicated to Asciidoc, that's a big deal for new adopters. DocBook is not going anywhere, and neither are the other "Big XML" vocabularies, because a surreal amount of money has been spent on them. XSL itself is like the old college buddy with a DoD trust fund who keeps crashing in your living room and hogging the remote. XSL doesn't use RPMS, doesn't leave the house, just sits in its ancient JAR file and farts up the lounger with weird ideas. "Dude," XSL might say, horking down pizza rolls, "what if a state for a process . . was also . . like . . a function?". That's fine, XSL, but can you please turn this S1000D into a PDF that has //@rfu in the center header and //datarest in the outer margin? Right now? "Sure, dude," says XSL, showing the reason you let him crash on your couch. Just remember he's not a full-up language, he needs a buddy to pay the bills and take out the trash. "I'm, like, washing the dishes organically," XSL might say.
XSL also has a very long and scream-y relationship with Java, which you will be spending a lot of time with because XSL is one of the most convoluted major languages to see wide adoption in the history of computing. XSL inherits the VCS handling problems of XML, to boot, since XSL is itself XML. If you're transforming mixed SGML/XML, boy oh boy, you'd better believe in whitespace, and you'll probably end up integrating some functional programming in your XSL to handle it. Even straight XML - S1000D 3.X - makes use of non-XML-compliant whitespace. You'll often find it easier to fine-tune the JARs (or add parameters) than to mess with the XSL itself (which is why so many XSL pipes use a hacked staging technique or "multipass", passing output from one XSL to the next, rather than messing with modes). For most non-print XSL jobs you can get away with xpath plus Generic Framework X, or even straight xquery, which is almost as slick as XSL is awkward. Also, offtopic, xquery is in NPM now, which is well worth taking a look at for all you poor folks tasked with supporting XML frameworks.
Note also that core XSL / FOP development has flatlined; what you got today is all you're going to get. Occasionally someone will port Saxon to JS or Python. XSL 3.0 - heck, XSL 2.0 - has a tendency to get platform-locked (i.e., only works on Saxon, only works for XEP, only works on Xalan, nothing works on MSXML, try a hacked libxml, etc). Even assuming 3.0 gets rolling, refactoring XSL . . it's not really refactoring, it's more like "rebuilding", although that's at least partially the fault of XSL's horrific readability. Finally, anyone who knows anything about XSL is going to be mid-to-late career, meaning more expensive for a smaller pool of candidates.
Let's step back a second from this crazy binary/LaTeX/XSL freakshow and see if we can figure out why it's such a mess. Cue the sad music. Aside from islands of rock-solid (but closed-standard) vendor implementations, almost everyone needing print content on this planet has made peace with the limited layouts of CSS+HTML. Alternatively, given the ubiquity of handheld computing, they've ditched the idea of print wholesale, even as an abstract concept. Cross references, bibliography, footnotes, indices, version control, and countless similar constructs are far better handled in software dynamically, mitigating the complexity involved in rendering constructs like "List of Effective Pages" as static output. A PR / changelog / log -p output is a million times more usable and informative than a traditional RevHist / LoEP / @rfc. That's without even mentioning search, data viz, 3d, LMS, predictive maintenance, and all the other cool stuff you can't do well on dead trees. To serve those who can't or won't adapt, content specialists have been keeping old or bespoke tools to hand with their weird data constructs, praying they neither break nor require change of any kind. We're here talking right now because those prayers failed, very badly, for one or both of us. But the evolutionary drive in content is overwhelmingly moving away from print, which, one suspects, is why the final CSS PMM stalled out into a starburst of @media calls, right in time for the possible end of the open Web itself. And we're out here on the PDF outskirts of the information highway. End the sad music, we've got a job to do.
The story doesn't end here. There's the new(-ish) "Random web print engine du jour" like Weasy, Paged.js, whatever they're calling Relax these days, or a zillion others with lots of shared libraries (many of them forks of Relax, actually). They're all basically placeholders for the full CSS PMM, which make this something of a Holy Grail, and very similar to how Prince works under the hood. The problem? Immaturity. These things are still missing a lot of bits. Much like with web 3d content (X3D vs WebGL, most of the time), it's still too early to guess which will become the standard for "print HTML".?Maybe one will become the standard, maybe none will. Maybe Adobe pays off the W3C and writes the standard themselves. Maybe ISO standardizes the whole lot and it disappears forever like the Ark of the Covenant (maybe they'll stick the "Web Print" crate right next to the cobwebbed crate for DSSSL). If you guess the wrong standard, you've stuck your org with an orphaned stack until doomsday. And then you're in trouble.
Another thing: these "web print" stacks are not boxed software like the other options are; they're modern (i.e., "not finished") software that's dependent on package managers: they need infrastructure; you need a way to safely get updates to them; they need a deploy path and build engine. If you don't have a developer culture in your org you're going to be standing all that up from zero. It's not technically hard to do - I was rolling Weasy pipelines in thirty minutes on a "naked" VM, Asciidoctor-PDF.js (based on Relax, sort of) was even easier - but these stacks are risky for a big company where every dependency update needs to be provisioned, cleared, documented, triple-approved, and on-site whether they're hosted or bundled (and bundling has its own risks downstream). If your org has a "developer culture", this stuff's already been done for you, you just need to find the group that does it. You can ride on their instance with a few modules, set up some fancy report graphs for compliance, buy IT a pizza, and you're done basically forever. If your org doesn't have that developer culture, you've opened a giant can of worms that will probably get you fired.
On the plus side, "Web Print" aka HTML5 + CSS + "Template Language Du Jour" might be the best of all possible worlds when it comes to balancing design elegance and complex print layout, with a bonus that the skills (CSS + Javascript/Python) are ubiquitous and therefore cheap(er) to staff for. Templating as a procedural language gives you an insane amount of layout control that you can summon from the content layer, and CSS can be reused (sorta) from web output so you're only styling once. CSS itself is no slouch: you can pull off tricks with CSS that would take you days to figure out in XSL, due in no small part to the superior tooling (although, coming from XSL/TeX, you do lose some fine control of the print, kerning and such). Risks and rewards! With the newer "web print" tech you can roll those dice, but be aware that the downsides are really bad, and the overlords are always watching. As architect you need to guess right - or "right enough" - all the time, every time, seven days a week, fifty two weeks a year. Unless you're consulting, in which case you're used to seeing an effigy of yourself burnt in a wicker man by the finance ladies. You probably have a little YouTube compilation of Tarannic ceremonies that you enjoy from your giant sofa made of money.
And that's why there's so much old cruft in PDF stacks. The moral : there's no CCM PDF layout pipeline today that doesn't involve some balance of pain, risk, or a dump truck full of cash. For today, you're probably stuck with XSL in some form, but "web print" should be a development target. And, as always, before you even start, stay away from CCM unless you're absolutely, positively, completely sure that your deliverables share 50-70% of their content, and you have a damn clever architecture to re-use that common content. Otherwise you're taking a dive into what, for this family-oriented article, is a very poopy place. We'll take a deeper dive into "The CCM Trap" later and at length.
Why is PDF so hard? This seems like a ridiculous problem to have in this day and age, but when you stop and think about it, it's not quite so mysterious. We've hinted at the problems of CCM already, those problems leak into the output pipe. PDF is constantly plagued with awesome security vulnerabilities, the management of which constantly breaks PDF pipelines both CCM and "normal". But PDF itself is inherently a problematic format, occupying a phantom zone between text and graphic, content and data, display and editor (yes, editor! it's used that way, every day). PDFs sit out in front of the user interface, before the interface is even a thing, meaning you get all of the audience, even the ones using brightly colored crayons. PDF formats inherit typewriter requirements - like LoEPs - which go with DVCS / CCM not even a little bit, LoEPs being as they are from the 19th century. A component in a CCM can go literally anywhere; it's a node with N relations. All a PDF knows is pages - node of 2 relations, previous/next - but it's mixed up with a derived metadata model superimposed on the PDF "page" construct, and not in a transparent way, either. Separation of Concerns (SoC) as adapted for content systems has always been clunky, and PDF is where the rubber hits the road; all your files/conditions/maps/layouts/styles get integrated and thrashed against the processor. SoC is a much more fundamental issue, like the CCM problem, but it's an issue all the same. All this complexity and a PDF still needs to be populated from markup that's simple enough for writers to write in. Complexity has to go somewhere, and a lot of the time no one knows where the complexity of a printed page even came from in the first place. Why is the first letter in a chapter bigger than the others? Well, it's called a Drop Cap or Historiated Initial, and to find out why it's important, you'll need a time machine set to 700 AD Britain. Now add in all the assumptions from fifty-some MIL-STD specs, and you've got a very large pile of crufty requirements that no one can explain.
All of this assumes someone's already hashed out the regulatory requirements, or even basic print requirements, which is - speaking generously - not always a given. Finally, Adobe - again, speaking generously - is not always acting in good faith when it comes to technology and standards. Yes, PDF is theoretically ISO 32k (just like how OOXML is ISO/IEC 29500), but do you really want to start a frank discussion about the ISO standards process? I don't because I like being employed, and you probably like being employed too.
All that and - just your luck - PDF is still likely to be THE "official" signed-by-grand-poobah official format for a big company. "Weird" PDF layouts are key to unlocking customers from highly regulated - and highly paid - industries. These customers have very specific ideas about how a print document looks, and you can bet your rear quarters they're not going to want to read "no dang internet on no dang tiny tv!" ("Requirement 0.0.1: Do not write in 'Internet'"). You'll always need some aces in your deck to deal with this. Hopefully this scribble has given you a whiff of . . well, if not how to deal with it exactly, then at least an idea of what to expect when you take a bite.
It's not as scary as it sounds, particularly if your organization has boiled out a rock solid set of requirements - or even a clear idea of what they want. It's a test of organizational self-knowledge, although not as soul-searching as the CCM - or, G-D forbid, the S1000D - process. Requirements is where 90% of drama comes from, for this problem, or for any software-related problem, really. More good news: lightweight markup speeds write/dev/test/review time dramatically, if you're on the fence between "Big XML" and an LML. Finally, having even a single manager on the customer side who's tech-savvy can make the whole process pretty dang painless. If the customer's a slick techie who knows the latest doodads, even better. Finally, find the doc team headcount who's got the best grasp on the technology and make best friends, so they can smoothly take up the reins when you eventually move on.
"Wait a second," you might say, "to heck with PDFs. Go back to earlier. How do you do single source with lightweight markup?" Now that right there, that's a much happier story. We'll take a look at it in a future scribble, right after we put the math to the CCM approach to see why it fails so often.
Product Writing Manager @Gutsy
3 年Humbling. John Mogilewsky, wonderful overview albeit terrifying. There is so much more to learn.