0

convert html to markdown

neo created at1 month ago view count: 10
pandoc -f html -t markdown hello.html

pandoc

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, Haddock markup, OPML, Emacs Org mode, DocBook, txt2tags, EPUB, ODT and Word docx; and it can write plain text, Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, reStructuredText, XHTML, HTML5, LaTeX (including beamer slide shows), ConTeXt, RTF, OPML, DocBook, OpenDocument, ODT, Word docx, GNU Texinfo, MediaWiki markup, DokuWiki markup, ZimWiki markup, Haddock markup, EPUB (v2 or v3), FictionBook2, Textile, groff man pages, Emacs Org mode, AsciiDoc, InDesign ICML, TEI Simple, and Slidy, Slideous, DZSlides, reveal.js or S5 HTML slide shows. It can also produce PDF output on systems where LaTeX, ConTeXt, or wkhtmltopdf is installed.

Pandoc's enhanced version of Markdown includes syntax for footnotes, tables, flexible ordered lists, definition lists, fenced code blocks, superscripts and subscripts, strikeout, metadata blocks, automatic tables of contents, embedded LaTeX math, citations, and Markdown inside HTML block elements. (These enhancements, described further under Pandoc's Markdown, can be disabled using the markdown_strict input or output format.)

In contrast to most existing tools for converting Markdown to HTML, which use regex substitutions, pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document, and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer.

Because pandoc's intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc's simple document model. While conversions from pandoc's Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc's Markdown can be expected to be lossy.  

Using pandoc

If no input-file is specified, input is read from stdin. Otherwise, the input-files are concatenated (with a blank line between each) and used as input. Output goes to stdout by default (though output to stdout is disabled for the odt, docx, epub, and epub3 output formats). For output to a file, use the -o option:

pandoc -o output.html input.txt

By default, pandoc produces a document fragment, not a standalone document with a proper header and footer. To produce a standalone document, use the -s or --standalone flag:

pandoc -s -o output.html input.txt

For more information on how standalone documents are produced, see Templates, below.

Instead of a file, an absolute URI may be given. In this case pandoc will fetch the content using HTTP:

pandoc -f html -t markdown http://www.fsf.org

If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing. This feature is disabled for binary input formats such as EPUB, odt, and docx.

The format of the input and output can be specified explicitly using command-line options. The input format can be specified using the -r/--read or -f/--from options, the output format using the -w/--write or -t/--to options. Thus, to convert hello.txt from Markdown to LaTeX, you could type:

pandoc -f markdown -t latex hello.txt

To convert hello.html from HTML to Markdown:

pandoc -f html -t markdown hello.html

Supported output formats are listed below under the -t/--to option. Supported input formats are listed below under the -f/--from option. Note that the rst, textile, latex, and html readers are not complete; there are some constructs that they do not parse.

If the input or output format is not specified explicitly, pandoc will attempt to guess it from the extensions of the input and output filenames. Thus, for example,

pandoc -o hello.tex hello.txt

will convert hello.txt from Markdown to LaTeX. If no output file is specified (so that output goes to stdout), or if the output file's extension is unknown, the output format will default to HTML. If no input file is specified (so that input comes from stdin), or if the input files' extensions are unknown, the input format will be assumed to be Markdown unless explicitly specified.

Pandoc uses the UTF-8 character encoding for both input and output. If your local character encoding is not UTF-8, you should pipe input and output through iconv:

iconv -t utf-8 input.txt | pandoc | iconv -f utf-8

Note that in some output formats (such as HTML, LaTeX, ConTeXt, RTF, OPML, DocBook, and Texinfo), information about the character encoding is included in the document header, which will only be included if you use the -s/--standalone option.  

Creating a PDF

To produce a PDF, specify an output file with a .pdf extension. By default, pandoc will use LaTeX to convert it to PDF:

pandoc test.txt -o test.pdf

Production of a PDF requires that a LaTeX engine be installed (see --latex-engine, below), and assumes that the following LaTeX packages are available: amsfonts, amsmath, lm, ifxetex, ifluatex, eurosym, listings (if the --listings option is used), fancyvrb, longtable, booktabs, graphicx and grffile (if the document contains images), hyperref, ulem, geometry (with the geometry variable set), setspace (with linestretch), and babel (with lang). The use of xelatex or lualatex as the LaTeX engine requires fontspec; xelatex uses mathspec, polyglossia (with lang), xecjk, and bidi (with the dir variable set). The upquote and microtype packages are used if available, and csquotes will be used for smart punctuation if added to the template or included in any header file. The natbib, biblatex, bibtex, and biber packages can optionally be used for citation rendering. These are included with all recent versions of TeX Live.

Alternatively, pandoc can use ConTeXt or wkhtmltopdf to create a PDF. To do this, specify an output file with a .pdf extension, as before, but add -t context or -t html5 to the command line.

PDF output can be controlled using variables for LaTeX (if LaTeX is used) and variables for ConTeXt (if ConTeXt is used). If wkhtmltopdf is used, then the variables margin-left, margin-right, margin-top, margin-bottom, and papersize will affect the output, as will --css.  

report

Comments

search keywords