LEFT TEXT
_library/index-include.md.cms
_library/sitemap-include.md.cms
Pandoc, the universal document converter, can serve as a nice intro into functional programming with Haskell. For many contributors, including the author of this guide, pandoc was their first real exposure to this language. Despite its impressive size of more than 60.000 lines of Haskell code (excluding the test suite), pandoc is still very approachable due to its modular architecture. It can serve as an interesting subject for learning.
This guide exists to navigate the large amount of sources, to lay-out a path that can be followed for learning, and to explain the underlying concepts.
A basic understanding of Haskell and of pandoc’s functionality is assumed.
Pandoc has a publicly accessible git repository on GitHub: https://github.com/jgm/pandoc. To get a local copy of the source:
git clone https://github.com/jgm/pandoc
The source for the main pandoc program is app/pandoc.hs. The source for the pandoc library is in src/, the source for the tests is in test/, and the source for the benchmarks is in benchmark/.
Pandoc has long supported filters, which allow the pandoc abstract syntax tree (AST) to be manipulated between the parsing and the writing phase. Traditional pandoc filters accept a JSON representation of the pandoc AST and produce an altered JSON representation of the AST. They may be written in any programming language, and invoked from pandoc using the --filter option.
Although traditional filters are very flexible, they have a couple of disadvantages. First, there is some overhead in writing JSON to stdout and reading it from stdin (twice, once on each side of the filter). Second, whether a filter will work will depend on details of the user’s environment. A filter may require an interpreter for a certain programming language to be available, as well as a library for manipulating the pandoc AST in JSON form. One cannot simply provide a filter that can be used by anyone who has a certain version of the pandoc executable.
Starting with version 2.0, pandoc makes it possible to write filters in Lua without any external dependencies at all. A Lua interpreter (version 5.3) and a Lua library for creating pandoc filters is built into the pandoc executable. Pandoc data types are marshaled to Lua directly, avoiding the overhead of writing JSON to stdout and reading it from stdin.
This document describes pandoc’s handling of JATS.
abstract<abstract> element.
authorlist of article contributors. Each author should have a surname and a given name listed in the entry; if the author has no surname value, then the item will be used as the contributors string-name.
orcidsurnamesurname of the contributor. Usually the family name in western names.
See <surname>.
given-namespersonal names of the contributor; this includes middle names (if any) in western-style names.
See <given-names>.
nameauthor.surname is not available. Tagged with <string-name>.
emailthe contributor’s email address.
Used as the contents of the <email> element.
affiliationeither full affiliation entries as described in field affiliation, or a list of affiliation identifiers.
The identifiers link to the organizations with which an author is affiliated. Each identifier in this list must also occur as the id of an affiliation listed in the top-level affiliation list.
If the top-level affiliation field is set, then this entry assumed to be a list of identifiers, and a list of full entries if that field is unset.
Full entries must be given if the articleauthoring tag set it used, as affiliation links are not allowed in that schema.
equal-contribequal-contrib attribute, set to yes, is added to the author’s <contrib> element if this is set to a truthy value.
cor-idarticle.author-notes.corresp. If the cor-id value is then, an <xref> link of ref-type corresp is added. The rid attribute is set to cor-<ID>, where <ID> is the stringified value of this attribute.
affiliationthe list of organizations with which contributors are affiliated. Each institution is added as an <aff> element to the author’s contrib-group.
The fields are given in the order in which they are included in the output.
id<aff> element’s id value, prefixed with aff-.
group<institution> element with content-type set to group.
department<institution> element with content-type set to dept.
organization<institution> element. The institution element is wrapped in an <institution-wrap> element; any identifiers, like ringgold or ror, are added to the wrapper and must hence belong to this organization (not the department or group).
isni<institution-id> element with institution-id-type set to ISNI.
ringgold<institution-id> element with institution-id-type set to Ringgold.
ror<institution-id> element with institution-id-type set to ROR.
pid<institution-id> elements. Each item must contain a map with keys type, used as institution-id-type, and id, used as element content.
street-address<addr-line> element, separated by a comma and space (,).
citystreet-address is not given, in which case the value is wrapped in a <city> element.
country<country> element.
country-codecountry][attr:country] attribute in element <country> (if the latter is present).
copyrightCopyright and licensing information. This information is rendered via the <permissions> element.
It is recommended to use the license field (described below) for licensing information. If licensing information is included below copyright, then the variables type, link, and text should always be used together.
statement<copyright-statement>. Use a list for multiple statements.
year<copyright-year>. Use a list to for multiple copyright years. The JATS documentation states that this field need not to be used if the year is included in the copyright statement.
holder<copyright-holder> element. Use a list for multiple copyright holders.
text<license-p> element.
typelicense-type attribute.
linkxlink:href attribute in the <license> element.
datepublication date. This value should usually be a string representation of a date. Pandoc will parse and deconstruct the date into the components given below. It is also possible to pass these components directly.
The publication date is recorded in the document via the <pub-date> element and its sub-elements. The publication-format attribute is always set to electronic.
iso-8601ISO-8601 representation of the publication date. Used as the value of the <pub-date> element’s iso-8601-date attribute.
This value is set automatically if pandoc can parse the date value as a date.
day, month, yearDay, month, and year of the publication date. Only the publication year is required. The values are used as the contents of the elements with the respective names.
The values are set automatically if pandoc can parse the date value as a date.
typedate-type attribute on the <pub-date> element and defaults to “pub” if not specified.
articleinformation concerning the article that identifies or describes it. The key-value pairs within this map are typically used within the <article-meta> element.
publisher-id<article-id> element with attribute pub-id-type set to publisher-id.
doi<article-id> element with attribute pub-id-type set to doi.
pmid<article-id> element with attribute pub-id-type set to pmid.
pmcid<article-id> element with attribute pub-id-type set to pmcid.
art-access-id<article-id> element with attribute pub-id-type set to art-access-id.
heading<subject> element, nested in a <subj-group> element which has heading as its subj-group-type attribute.
categories<subject> element, grouped in a single <subj-group> element with its subj-group-type attribute set to categories.
author-notesAdditional information about authors, like conflict of interest statements and corresponding author contact info. Wrapped in an [<author-notes>][elem:author-notes] element.
conflict<fn>) of fn-type conflict.
con<fn>) of fn-type con.
correspid and email. The info is then rendered via a <corresp> element.
funding-statementfunding-statement element.
journalinformation on the journal in which the article is published. This must be a map; the following key/value pairs are recognized.
publisher-id<journal-id> with attribute journal-id-type set to publisher-id.
nlm-ta<journal-id> with attribute journal-id-type set to nlm-ta.
pmc<journal-id> with attribute journal-id-type set to pmc.
title<journal-title> element.
abbrev-title<abbrev-journal-title> element.
pissn<issn> element with the publication-format attribute set to print.
eissn<issn> element with the publication-format attribute set to electronic.
publisher-name<publisher-name> element.
publisher-loc<publisher-loc> element.
licenseArticle licensing information. Each item of this field is rendered as a <license> element within the <permissions> element.
Item content should be either a single paragraph, or a map with the fields listed below.
text<license-p> element.
typelicense-type attribute.
linkxlink:href attribute in the <license> element.
notes<notes> element.
subtitle<subtitle> element.
tags<kwd> element; the elements are grouped in a <kwd-group> with the kwd-group-type value author.
title<article-title> element.
Pandoc’s handling of org files is similar to that of Emacs org-mode. This document aims to highlight the cases where this is not possible or just not the case yet.
The following export keywords are supported:
AUTHOR: comma-separated list of author(s); fully supported.
CREATOR: output generator; passed as plain-text metadata entry creator, but not used by any default templates.
DATE: creation or publication date; well supported by pandoc.
EMAIL: author email address; passed as plain-text metadata field email, but not used by any default templates.
LANGUAGE: document language; included as plain-text metadata field lang. The value should be a BCP47 language tag.
SELECT_TAGS: tags which select a tree for export.
EXCLUDE_TAGS: tags which prevent a subtree from being exported. Fully supported.
TITLE: document title; fully supported.
EXPORT_FILE_NAME: target filename; unsupported, the output defaults to stdout unless a target has to be given as a command line option.