Skip to content

Introduction

UpDoc is an Umbraco package that lets editors create content documents by extracting content from external sources — PDFs, web pages, and markdown files — instead of manually copying and pasting.

Organisations often have content that already exists in another format: travel brochure PDFs, legacy website pages, specification documents, marketing collateral. Getting that content into Umbraco means opening the source, copying text field by field, formatting it, and pasting it into the right properties. It’s slow, repetitive, and error-prone.

UpDoc adds a “Create from Source” option to Umbraco’s content section. An editor selects a source document, and UpDoc extracts the content, maps it to the correct fields in a document blueprint, and creates a new Umbraco document — populated and ready to publish.

The extraction is driven by workflows — configurable pipelines that define:

  • Where content comes from (the source — a PDF in the media library, a URL, or a markdown file)
  • How to interpret it (transform rules that identify headings, body text, lists, and group them into meaningful sections)
  • Where it goes (the destination — which blueprint fields and block grid properties receive which content)

UpDoc has two audiences:

Editors use UpDoc through the familiar Umbraco content section. They right-click a content node (or use the collection toolbar button), choose “Create from Source”, select a source document, and get a new page. No technical knowledge needed — the workflows are already configured for them.

Workflow authors (typically developers or site administrators) set up the extraction workflows in the Umbraco Settings section. They define the transform rules that tell UpDoc how to interpret each source format, and map extracted sections to destination fields. This is a one-time setup per document type — once configured, editors can create as many documents as they need.

Source typeHow it works
PDFUpload a PDF to the Umbraco media library. UpDoc extracts text with full metadata (font size, colour, position) for precise rule matching.
Web pageProvide a URL. UpDoc fetches the page and extracts content using the HTML structure (tags, CSS classes, containers) for rule matching.
MarkdownUpload a markdown file to the media library. UpDoc parses the heading structure for straightforward section identification.

UpDoc is a Razor Class Library (RCL) distributed as a NuGet package. Install it into any Umbraco 17+ project:

Terminal window
dotnet add package Umbraco.Community.UpDoc

No database tables, no configuration files to create manually. UpDoc stores its workflow configuration as JSON files in the updoc/workflows/ directory — human-readable, git-trackable, and compatible with deployment tools like uSync and Umbraco Deploy.

ConceptWhat it means
WorkflowA complete pipeline connecting a source type to a destination blueprint. One workflow per source type per blueprint.
SourceThe external document to extract content from (PDF, web page, or markdown file).
DestinationThe Umbraco document blueprint that defines the target structure — fields, block grids, and block lists.
Transform rulesRules that identify and classify extracted content (e.g., “text in 36pt blue Helvetica is a tour title”).
MappingThe wiring between a transformed source section and a destination field or block property.
ExtractionThe raw output from parsing a source document — every text element with its metadata.
TransformThe shaped output after applying rules — meaningful sections with headings, content, and descriptions.