Tools and Technologies

Overview

This section provides a bit more background on some of the other key tools and methodologies used in our digital regulatory infrastructure projects.

This page is written for a non-technical audience and explains some software and applications that will be well known to developers already.

We are excited to explore how to make these tools approachable for non-technical people to amplify their knowledge and experience and accelerate their existing workflows.

Why these tools matter

Combining DocRef, MCP servers, and other approaches and applications mean that digital regulatory infrastructure can be supplemented by:

Precise Citations - Every human or AI-retrieved piece of content includes exact source references
Verifiability - Human experts can verify AI outputs or human-authored work products against original sources
Transparency - Audit trails of what content was accessed and used
Quality Control - Structured data enables systematic validation
Reproducibility - Others can follow the same process with the same tools (although results will not be identical given the nature of LLM systems)

List of tools and technologies

DocRef

DocRef is a digital regulatory infrastructure system that converts regulatory text into structured, machine-readable data with precise, granular citations. You can find out more here.

Key capabilities include:

Machine-readable format for regulatory documents
Download regulatory documents in formats like HTML, CSV (spreadsheet) and JSON
Granular citation system (unique URLs for every paragraph, table cell, etc)
Annotation and tagging system for expert review
Hierarchical modeling of relationship between parent/child document elements (ie, a subsection to a section)

DocRef MCP Servers (Model Context Protocol)

Model Context Protocol (MCP) servers provide AI systems with access to structured regulatory data through a GraphRAG (retrieval augmented generation using graph database structures) approach. You can find out more here.

Key features:

Semantic search across regulatory content
Hierarchical context retrieval
Citation-based verification
Graph database queries

Granular hierarchical identifiers

Granular hierarchical identifiers (GHIs) are the unique codes that DocRef assigns to every element in a document. Think of them as precise addresses for specific parts of a document, similar to how a postal address gets more specific as you add the street number, unit number, and room.

How they work:

A GHI like part1-section2-para6-1-b tells you exactly where to find a piece of content:

part1 — Part 1 of the document
section2 — Section 2 within that part
para6 — Paragraph 6 within that section
1-b — Sub-paragraph (1)(b) within that paragraph

Why this matters for digital regulatory infrastructure:

Every paragraph, bullet point, table cell, and definition gets its own unique identifier
Traditional documents let you reference "paragraph 6" or "section 2", but these references can be ambiguous—GHIs provide machine-readable precision
Software systems—including AI—can locate exact content, track changes between document versions, and maintain reliable links over time
When AI retrieves information, these identifiers travel with the content, so you always know exactly where it came from

Example from our projects: In the Identification Management project, the system tracked 10,208 parent-child relationships between document elements—sections containing subsections, subsections containing paragraphs, and so on. Instead of saying "see the authentication requirements in the standards", you can link directly to a specific control like AA9.04 and anyone (or any software) can retrieve exactly that content.

URL (uniform resource locator)

A URL is a web address—the text you type into a browser to visit a website, like https://docref.digital.govt.nz. URLs are fundamental to how the internet works, allowing any resource to be located and accessed.

How DocRef uses URLs:

DocRef combines standard URLs with granular hierarchical identifiers to create precise, clickable links to specific document elements. For example:

https://docref.digital.govt.nz/nz/dia-apis-parta/2022/en/#part2-section1-para3

This URL takes you directly to a specific paragraph within a specific section of the API Guidelines. You can click the link and your browser scrolls straight to that content. The URL tells you: it's from the NZ DIA API Guidelines, Part A, 2022 version, English language, Part 2, Section 1, Paragraph 3.

Why this matters:

You can share a link to a specific paragraph, not just a whole document
URLs include version information, so you can tell if the source has been updated
When AI generates content with citations, each citation is a clickable link back to the exact source—transforming vague references into verifiable links that anyone can check

Example from our projects: The API Standard project produced 280 citations, each linking to a specific location in the source documents.

Learn more about URLs

Graph databases

A graph database stores information as a network of connected points (called "nodes") rather than in traditional rows and columns like a spreadsheet. Imagine a mind map or a family tree—each item is connected to related items, and you can follow the connections to explore relationships.

Why this matters:

Documents aren't flat—they have structure (sections within parts, paragraphs within sections) and relationships (cross-references, related concepts)
Graph databases naturally represent these connections
AI can "walk" through the graph to find related content, not just search for keywords

Key concepts:

Nodes: Individual pieces of content (a paragraph, a definition, a table cell)
Relationships: Connections between nodes (e.g., "CHILD_OF" for hierarchy, "SEMANTIC_SIMILARITY" for related meaning)
Properties: Information attached to each node (the text content, its identifier, tags, annotations)

Example from our projects: The Identification Management project database contains 9,374 nodes from 30 source documents, connected by 10,208 hierarchical links (showing document structure) and 89,735 semantic similarity links (showing which pieces of content are related by meaning).

Neo4j, the graph database we use, has introductory resources explaining graph concepts. You can also learn more about graph databases on Wikipedia.

Semantic search

Semantic search finds content based on meaning rather than exact word matches. Traditional search requires you to guess the exact words used in a document. Semantic search understands that "authentication" and "verifying user identity" mean similar things, even though they use different words.

How it works (simplified):

Each piece of content is converted into a mathematical representation (called a "vector embedding") that captures its meaning
When you search, your query is also converted into this mathematical form
The system finds content whose mathematical representation is closest to your query's

Why this matters:

You can find relevant content even when different documents use different terminology
Searching for "how to handle errors" will find content about "exception management" or "fault tolerance"
AI can discover connections between documents that might be missed by keyword search

Example from our projects: In the API Standard project, a search for "OAuth authentication and authorization" found 40 relevant pieces across all three parts of the source guidelines—even when the source text used different phrasing like "access tokens" or "security credentials."

Natural language vs machine languages

Natural language is how humans communicate: English, te reo Māori, and other languages people speak and write. It's flexible, expressive, and sometimes ambiguous—"run" can mean sprint, operate a program, or execute a series of events depending on context.

Machine languages (also called programming languages or code) are strict, unambiguous instructions for computers. They use precise syntax where every character matters. For example, { and } have specific meanings—they're not punctuation, they're instructions. A missing comma or quotation mark makes the code fail completely.

Why this matters for digital regulatory infrastructure:

Regulatory documents are written in natural language - they need to be readable and interpretable by humans, including non-specialists
Software systems that interpret regulations need machine languages - they require precise, unambiguous definitions so computers can process rules consistently
The translation between them is critical - converting human-readable policy into machine-executable rules is complex and requires careful validation

Example from our projects: The Identification Management project started with natural language requirements from government policy. These were structured into DocRef (which remains human-readable) but also converted into machine-readable formats (JSON, formal rules) so that systems could automatically check compliance. The original English-language policy was preserved so humans could understand the intent, while the structured formats enabled automated verification.

The role of tools: Many tools in this section exist to bridge the gap—Markdown looks like natural text but has syntax rules; JSON looks structured but humans can read it; DocRef preserves natural language while adding machine-readable identifiers. Even AI systems like Claude work by recognizing patterns in natural language and can generate both human-readable text and code.

VS Code (Visual Studio Code)

VS Code is a free, popular application for writing and editing code and text files. It's made by Microsoft and used by millions of developers worldwide. Think of it as a very powerful text editor with extra features for working with code and documents.

Why we use it:

It's the environment where Claude Code (the AI coding assistant) operates
It can display and edit many types of files (code, markdown, JSON, etc.)
Extensions add extra capabilities (like connecting to databases or AI services)
It's free and works on Windows, Mac, and Linux

In our projects: We use VS Code as the workspace where AI-assisted drafting happens. The AI can read files, make edits, run commands, and access external tools—all within VS Code.

Download VS Code

JSON

JSON (JavaScript Object Notation) is a way of formatting data so that both humans and computers can read it. It uses a simple structure of labels and values, with curly braces {} for groups and square brackets [] for lists.

Example:

{
  "title": "API Security Requirements",
  "section": "Part 2",
  "paragraphs": ["Use HTTPS for all connections", "Validate all inputs"]
}

Why this matters:

DocRef documents can be downloaded in JSON format for processing by software
MCP server configurations are stored as JSON files
It's a universal format that almost any programming language can work with

In our projects: When AI queries the DocRef database, results come back in JSON format, which the AI then processes to extract the relevant content and citations.

Learn more about JSON

HTML

HTML (HyperText Markup Language) is the standard language for creating web pages. When you view any website, your browser is reading HTML code and displaying it as formatted text, images, and links.

Why this matters:

DocRef documents can be downloaded in HTML format for viewing in browsers
Many source documents start as HTML on government websites before being converted to structured data
The Transparency Hub itself is built from HTML pages

In our projects: Source regulatory documents are often published as HTML web pages. DocRef converts these into structured data while preserving the ability to export back to HTML for human reading.

Learn more about HTML

CSV

CSV (Comma-Separated Values) is a simple format for storing data in a table structure—like a spreadsheet. Each line is a row, and commas separate the columns. You can open CSV files in Excel, Google Sheets, or any spreadsheet application.

Why this matters:

DocRef documents can be downloaded as CSV files, letting you work with regulatory content in spreadsheet tools
Useful for analysis, filtering, and transformation using familiar tools
Easy to share with colleagues who may not have technical software

In our projects: Exporting regulatory documents as CSV allows policy analysts to filter, sort, and analyse requirements using tools they already know, without needing specialised software.

Markdown

Markdown is a simple way to format text using plain characters. For example, **bold** becomes bold, and # Heading becomes a heading. It's designed to be readable even before it's converted to formatted text.

Why this matters:

All the transparency materials on this site are written in Markdown
It's easy to learn and doesn't require special software
Files stay readable as plain text, making them easy to version control and share
AI systems can easily read and write Markdown

In our projects: Research notes, drafts, and final documents are all written in Markdown. The API Standard project saved research findings as organised Markdown files (e.g., research/01_design.md, research/02_security.md) before drafting began.

StrucDown: Syncopate has developed a variant called "StrucDown" that adds conventions for the granular hierarchical identifiers DocRef requires. StrucDown documents can be converted into fully structured DocRef datasets while remaining readable as plain text.

Learn Markdown in 10 minutes or explore the full guide at markdownguide.org.

Claude and Claude Code

Claude is an AI assistant made by Anthropic. Claude Code is a version of Claude that can work directly with files and code in VS Code—reading documents, making edits, running commands, and connecting to external tools like MCP servers.

Key capabilities:

Can work with a "context window" of about 200,000 tokens (roughly 150,000 words) at once
Can read and write files, run terminal commands, and use external tools
Follows detailed instructions provided in special configuration files (called CLAUDE.md files)

Why this matters:

Claude Code can systematically research source documents, then draft new content with citations
It can be given specific constraints and instructions for each project
All its actions are logged, creating a transparency trail
It can write scripts and validation tools that can be independently reviewed and verified before being run

In our projects: The API Standard was generated using Claude Code (specifically Claude Sonnet 3.5), which conducted 47 searches across source documents before drafting a 14,657-word standard with 280 citations—all in about 2 hours.

The human role: Claude Code doesn't work autonomously. A human operator directs the work, reviews outputs, verifies citations, and makes corrections. The AI accelerates research and drafting, but human judgment remains essential for quality and accuracy.

Learn more about Claude and Anthropic's documentation.

Large Language Models

Large Language Models (LLMs) are AI systems trained on vast amounts of text to understand and generate human language. They power tools like Claude, ChatGPT, and many others. "Large" refers to the billions of parameters (adjustable values) that shape how the model processes language.

Key characteristics:

They predict what text should come next, based on patterns learned from training data
They can summarise, translate, answer questions, and generate new content
They don't "know" things the way humans do—they recognise and reproduce patterns

Important limitations:

They can generate plausible-sounding but incorrect information ("hallucinations")
They work best when given specific, structured information to work with
They benefit from human oversight and verification

In our projects: We use LLMs not as general-purpose tools, but constrained by structured data, specific instructions, and citation requirements. The goal is to amplify human expertise, not replace it. Every AI-generated output includes citations so humans can verify the sources.

Why this matters for digital regulatory infrastructure: LLMs can dramatically accelerate work with regulatory documents—summarising, comparing, drafting, and answering questions. But their statistical nature means they can produce errors or fabricate citations. DocRef's structured data and citation system provides the verification infrastructure needed to use LLMs responsibly: every claim can be traced back to source material.

Scripts

A script is a small program that automates a specific task. Instead of manually performing repetitive steps, you write instructions once and the computer follows them every time. Scripts are typically short and focused on a single purpose.

Why this matters:

Scripts can validate outputs systematically (checking every citation, verifying formatting)
They ensure consistency—the same checks are applied every time
They save time on repetitive tasks and reduce human error

AI-written scripts: Importantly, AI can write these scripts too. Unlike AI-generated prose (which might contain subtle errors or "hallucinations"), a script can be reviewed, tested, and verified before it runs. Once validated, the script executes deterministically—it does exactly what the code says, every time. This means AI can help create validation tools that are themselves fully transparent and verifiable.

In our projects: Validation scripts check that AI-generated documents meet quality standards—for example, verifying that all citations link to valid sources, or that required sections are present. The Identification Management project used scripts to verify that all 109 core controls from source documents were preserved in the final output. These scripts can be archived alongside the outputs they validated, allowing independent parties to re-run the same checks.

RAG (retrieval-augmented generation)

RAG is a technique that gives AI access to external information before it generates a response. Instead of relying only on what it learned during training, the AI first retrieves relevant content from a database or document collection, then uses that content to inform its response.

The problem RAG solves:

AI models have a limited "context window"—the amount of information they can consider at once
While source documents might fit in a single context window initially, ongoing conversation, refinements, and outputs quickly consume available space
Without RAG, the AI would have to work from memory alone, increasing the risk of errors and limiting the depth of analysis possible

How it works (simplified):

You ask a question or give a task
The system searches for relevant content in a database
Retrieved content (with citations) is provided to the AI
The AI generates a response based on this retrieved content
Citations travel through the process, appearing in the final output

GraphRAG combines RAG with graph databases, allowing the AI to follow relationships between content—not just find matching keywords, but explore connected concepts.

Example from our projects: In the API Standard project, while the 5,612 source document nodes could theoretically fit in a single context window, the ongoing process of research, drafting, refinement, and conversation would have quickly overwhelmed available space. Instead, the AI performed 47 targeted searches, retrieving approximately 1,073 relevant results with full citations. These were saved to research files, then selectively read during drafting. The result: a comprehensive document with 280 verified citations, with source material accessed systematically rather than loaded all at once.

Learn more about RAG

Static site

A static site is a website made of pre-built HTML files that are the same for every visitor. Unlike dynamic sites (like Facebook or online banking) that generate pages on-the-fly based on user data, static sites are simpler, faster, and more secure.

Why this matters:

Fast loading—files are ready to serve immediately
Secure—no database to hack, no server-side code to exploit
Reliable—fewer things can go wrong
Easy to host—can be served from simple, inexpensive hosting

How this site works:

Content is written in Markdown files
A tool called 11ty (Eleventy) converts Markdown to HTML
The site is hosted on Vercel, a platform for static sites
When we update content, the site automatically rebuilds

For regulatory documents: Static sites are well-suited to publishing regulatory content because regulations should be stable, reliable, and accessible over long periods. DocRef publishes documents as static HTML pages that can be hosted simply and preserved indefinitely.

In our projects: The Transparency Hub you're reading now is a static site. This approach keeps things simple and focused on the content, while ensuring the site loads quickly and remains accessible.

Learn more about static sites

Tools and Technologies

Overview

Why these tools matter

List of tools and technologies

DocRef

DocRef MCP Servers (Model Context Protocol)

Granular hierarchical identifiers

URL (uniform resource locator)

Graph databases

Semantic search

Natural language vs machine languages

VS Code (Visual Studio Code)

JSON

HTML

CSV

Markdown

Claude and Claude Code

Large Language Models

Scripts

RAG (retrieval-augmented generation)

Static site

Further reading