Tools and Technologies
Overview
This section provides a bit more background on some of the other key tools and methodologies used in our digital regulatory infrastructure projects.
This page is written for a non-technical audience and explains some software and applications that will be well known to developers already.
We are excited to explore how to make these tools approachable for non-technical people to amplify their knowledge and experience and accelerate their existing workflows.
Why these tools matter
Combining DocRef, MCP servers, and other approaches and applications mean that digital regulatory infrastructure can be supplemented by:
- Precise Citations - Every human or AI-retrieved piece of content includes exact source references
- Verifiability - Human experts can verify AI outputs or human-authored work products against original sources
- Transparency - Audit trails of what content was accessed and used
- Quality Control - Structured data enables systematic validation
- Reproducibility - Others can follow the same process with the same tools (although results will not be identical given the nature of LLM systems)
List of tools and technologies
DocRef
DocRef is a digital regulatory infrastructure system that converts regulatory text into structured, machine-readable data with precise, granular citations. You can find out more here.
Key capabilities include:
- Machine-readable format for regulatory documents
- Download regulatory documents in formats like HTML, CSV (spreadsheet) and JSON
- Granular citation system (unique URLs for every paragraph, table cell, etc)
- Annotation and tagging system for expert review
- Hierarchical modeling of relationship between parent/child document elements (ie, a subsection to a section)
DocRef MCP Servers (Model Context Protocol)
Model Context Protocol (MCP) servers provide AI systems with access to structured regulatory data through a GraphRAG (retrieval augmented generation using graph database structures) approach. You can find out more here.
Key features:
- Semantic search across regulatory content
- Hierarchical context retrieval
- Citation-based verification
- Graph database queries
Granular hierarchical identifiers
Granular hierarchical identifiers (GHIs) are the unique codes that DocRef assigns to every element in a document. Think of them as precise addresses for specific parts of a document, similar to how a postal address gets more specific as you add the street number, unit number, and room.
How they work:
A GHI like part1-section2-para6-1-b tells you exactly where to find a piece of content:
part1— Part 1 of the documentsection2— Section 2 within that partpara6— Paragraph 6 within that section1-b— Sub-paragraph (1)(b) within that paragraph
Why this matters for digital regulatory infrastructure:
- Every paragraph, bullet point, table cell, and definition gets its own unique identifier
- Traditional documents let you reference "paragraph 6" or "section 2", but these references can be ambiguous—GHIs provide machine-readable precision
- Software systems—including AI—can locate exact content, track changes between document versions, and maintain reliable links over time
- When AI retrieves information, these identifiers travel with the content, so you always know exactly where it came from
Example from our projects: In the Identification Management project, the system tracked 10,208 parent-child relationships between document elements—sections containing subsections, subsections containing paragraphs, and so on. Instead of saying "see the authentication requirements in the standards", you can link directly to a specific control like AA9.04 and anyone (or any software) can retrieve exactly that content.
URL (uniform resource locator)
A URL is a web address—the text you type into a browser to visit a website, like https://docref.digital.govt.nz. URLs are fundamental to how the internet works, allowing any resource to be located and accessed.
How DocRef uses URLs:
DocRef combines standard URLs with granular hierarchical identifiers to create precise, clickable links to specific document elements. For example:
https://docref.digital.govt.nz/nz/dia-apis-parta/2022/en/#part2-section1-para3
This URL takes you directly to a specific paragraph within a specific section of the API Guidelines. You can click the link and your browser scrolls straight to that content. The URL tells you: it's from the NZ DIA API Guidelines, Part A, 2022 version, English language, Part 2, Section 1, Paragraph 3.
Why this matters:
- You can share a link to a specific paragraph, not just a whole document
- URLs include version information, so you can tell if the source has been updated
- When AI generates content with citations, each citation is a clickable link back to the exact source—transforming vague references into verifiable links that anyone can check
Example from our projects: The API Standard project produced 280 citations, each linking to a specific location in the source documents.
Graph databases
A graph database stores information as a network of connected points (called "nodes") rather than in traditional rows and columns like a spreadsheet. Imagine a mind map or a family tree—each item is connected to related items, and you can follow the connections to explore relationships.
Why this matters:
- Documents aren't flat—they have structure (sections within parts, paragraphs within sections) and relationships (cross-references, related concepts)
- Graph databases naturally represent these connections
- AI can "walk" through the graph to find related content, not just search for keywords
Key concepts:
- Nodes: Individual pieces of content (a paragraph, a definition, a table cell)
- Relationships: Connections between nodes (e.g., "CHILD_OF" for hierarchy, "SEMANTIC_SIMILARITY" for related meaning)
- Properties: Information attached to each node (the text content, its identifier, tags, annotations)
Example from our projects: The Identification Management project database contains 9,374 nodes from 30 source documents, connected by 10,208 hierarchical links (showing document structure) and 89,735 semantic similarity links (showing which pieces of content are related by meaning).
Neo4j, the graph database we use, has introductory resources explaining graph concepts. You can also learn more about graph databases on Wikipedia.
Semantic search
Semantic search finds content based on meaning rather than exact word matches. Traditional search requires you to guess the exact words used in a document. Semantic search understands that "authentication" and "verifying user identity" mean similar things, even though they use different words.
How it works (simplified):
- Each piece of content is converted into a mathematical representation (called a "vector embedding") that captures its meaning
- When you search, your query is also converted into this mathematical form
- The system finds content whose mathematical representation is closest to your query's
Why this matters:
- You can find relevant content even when different documents use different terminology
- Searching for "how to handle errors" will find content about "exception management" or "fault tolerance"
- AI can discover connections between documents that might be missed by keyword search
Example from our projects: In the API Standard project, a search for "OAuth authentication and authorization" found 40 relevant pieces across all three parts of the source guidelines—even when the source text used different phrasing like "access tokens" or "security credentials."
Natural language vs machine languages
Natural language is how humans communicate: English, te reo Māori, and other languages people speak and write. It's flexible, expressive, and sometimes ambiguous—"run" can mean sprint, operate a program, or execute a series of events depending on context.
Machine languages (also called programming languages or code) are strict, unambiguous instructions for computers. They use precise syntax where every character matters. For example, { and } have specific meanings—they're not punctuation, they're instructions. A missing comma or quotation mark makes the code fail completely.
Why this matters for digital regulatory infrastructure:
- Regulatory documents are written in natural language - they need to be readable and interpretable by humans, including non-specialists
- Software systems that interpret regulations need machine languages - they require precise, unambiguous definitions so computers can process rules consistently
- The translation between them is critical - converting human-readable policy into machine-executable rules is complex and requires careful validation
Example from our projects: The Identification Management project started with natural language requirements from government policy. These were structured into DocRef (which remains human-readable) but also converted into machine-readable formats (JSON, formal rules) so that systems could automatically check compliance. The original English-language policy was preserved so humans could understand the intent, while the structured formats enabled automated verification.
The role of tools: Many tools in this section exist to bridge the gap—Markdown looks like natural text but has syntax rules; JSON looks structured but humans can read it; DocRef preserves natural language while adding machine-readable identifiers. Even AI systems like Claude work by recognizing patterns in natural language and can generate both human-readable text and code.
VS Code (Visual Studio Code)
VS Code is a free, popular application for writing and editing code and text files. It's made by Microsoft and used by millions of developers worldwide. Think of it as a very powerful text editor with extra features for working with code and documents.
Why we use it:
- It's the environment where Claude Code (the AI coding assistant) operates
- It can display and edit many types of files (code, markdown, JSON, etc.)
- Extensions add extra capabilities (like connecting to databases or AI services)
- It's free and works on Windows, Mac, and Linux
In our projects: We use VS Code as the workspace where AI-assisted drafting happens. The AI can read files, make edits, run commands, and access external tools—all within VS Code.
JSON
JSON (JavaScript Object Notation) is a way of formatting data so that both humans and computers can read it. It uses a simple structure of labels and values, with curly braces {} for groups and square brackets [] for lists.
Example:
{
"title": "API Security Requirements",
"section": "Part 2",
"paragraphs": ["Use HTTPS for all connections", "Validate all inputs"]
}
Why this matters:
- DocRef documents can be downloaded in JSON format for processing by software
- MCP server configurations are stored as JSON files
- It's a universal format that almost any programming language can work with
In our projects: When AI queries the DocRef database, results come back in JSON format, which the AI then processes to extract the relevant content and citations.
HTML
HTML (HyperText Markup Language) is the standard language for creating web pages. When you view any website, your browser is reading HTML code and displaying it as formatted text, images, and links.
Why this matters:
- DocRef documents can be downloaded in HTML format for viewing in browsers
- Many source documents start as HTML on government websites before being converted to structured data
- The Transparency Hub itself is built from HTML pages
In our projects: Source regulatory documents are often published as HTML web pages. DocRef converts these into structured data while preserving the ability to export back to HTML for human reading.
CSV
CSV (Comma-Separated Values) is a simple format for storing data in a table structure—like a spreadsheet. Each line is a row, and commas separate the columns. You can open CSV files in Excel, Google Sheets, or any spreadsheet application.
Why this matters:
- DocRef documents can be downloaded as CSV files, letting you work with regulatory content in spreadsheet tools
- Useful for analysis, filtering, and transformation using familiar tools
- Easy to share with colleagues who may not have technical software
In our projects: Exporting regulatory documents as CSV allows policy analysts to filter, sort, and analyse requirements using tools they already know, without needing specialised software.
Markdown
Markdown is a simple way to format text using plain characters. For example, **bold** becomes bold, and # Heading becomes a heading. It's designed to be readable even before it's converted to formatted text.
Why this matters:
- All the transparency materials on this site are written in Markdown
- It's easy to learn and doesn't require special software
- Files stay readable as plain text, making them easy to version control and share
- AI systems can easily read and write Markdown
In our projects: Research notes, drafts, and final documents are all written in Markdown. The API Standard project saved research findings as organised Markdown files (e.g., research/01_design.md, research/02_security.md) before drafting began.
StrucDown: Syncopate has developed a variant called "StrucDown" that adds conventions for the granular hierarchical identifiers DocRef requires. StrucDown documents can be converted into fully structured DocRef datasets while remaining readable as plain text.
Learn Markdown in 10 minutes or explore the full guide at markdownguide.org.
Claude and Claude Code
Claude is an AI assistant made by Anthropic. Claude Code is a version of Claude that can work directly with files and code in VS Code—reading documents, making edits, running commands, and connecting to external tools like MCP servers.
Key capabilities:
- Can work with a "context window" of about 200,000 tokens (roughly 150,000 words) at once
- Can read and write files, run terminal commands, and use external tools
- Follows detailed instructions provided in special configuration files (called
CLAUDE.mdfiles)
Why this matters:
- Claude Code can systematically research source documents, then draft new content with citations
- It can be given specific constraints and instructions for each project
- All its actions are logged, creating a transparency trail
- It can write scripts and validation tools that can be independently reviewed and verified before being run
In our projects: The API Standard was generated using Claude Code (specifically Claude Sonnet 3.5), which conducted 47 searches across source documents before drafting a 14,657-word standard with 280 citations—all in about 2 hours.
The human role: Claude Code doesn't work autonomously. A human operator directs the work, reviews outputs, verifies citations, and makes corrections. The AI accelerates research and drafting, but human judgment remains essential for quality and accuracy.
Learn more about Claude and Anthropic's documentation.
Large Language Models
Large Language Models (LLMs) are AI systems trained on vast amounts of text to understand and generate human language. They power tools like Claude, ChatGPT, and many others. "Large" refers to the billions of parameters (adjustable values) that shape how the model processes language.
Key characteristics:
- They predict what text should come next, based on patterns learned from training data
- They can summarise, translate, answer questions, and generate new content
- They don't "know" things the way humans do—they recognise and reproduce patterns
Important limitations:
- They can generate plausible-sounding but incorrect information ("hallucinations")
- They work best when given specific, structured information to work with
- They benefit from human oversight and verification
In our projects: We use LLMs not as general-purpose tools, but constrained by structured data, specific instructions, and citation requirements. The goal is to amplify human expertise, not replace it. Every AI-generated output includes citations so humans can verify the sources.
Why this matters for digital regulatory infrastructure: LLMs can dramatically accelerate work with regulatory documents—summarising, comparing, drafting, and answering questions. But their statistical nature means they can produce errors or fabricate citations. DocRef's structured data and citation system provides the verification infrastructure needed to use LLMs responsibly: every claim can be traced back to source material.
Scripts
A script is a small program that automates a specific task. Instead of manually performing repetitive steps, you write instructions once and the computer follows them every time. Scripts are typically short and focused on a single purpose.
Why this matters:
- Scripts can validate outputs systematically (checking every citation, verifying formatting)
- They ensure consistency—the same checks are applied every time
- They save time on repetitive tasks and reduce human error
AI-written scripts: Importantly, AI can write these scripts too. Unlike AI-generated prose (which might contain subtle errors or "hallucinations"), a script can be reviewed, tested, and verified before it runs. Once validated, the script executes deterministically—it does exactly what the code says, every time. This means AI can help create validation tools that are themselves fully transparent and verifiable.
In our projects: Validation scripts check that AI-generated documents meet quality standards—for example, verifying that all citations link to valid sources, or that required sections are present. The Identification Management project used scripts to verify that all 109 core controls from source documents were preserved in the final output. These scripts can be archived alongside the outputs they validated, allowing independent parties to re-run the same checks.
RAG (retrieval-augmented generation)
RAG is a technique that gives AI access to external information before it generates a response. Instead of relying only on what it learned during training, the AI first retrieves relevant content from a database or document collection, then uses that content to inform its response.
The problem RAG solves:
- AI models have a limited "context window"—the amount of information they can consider at once
- While source documents might fit in a single context window initially, ongoing conversation, refinements, and outputs quickly consume available space
- Without RAG, the AI would have to work from memory alone, increasing the risk of errors and limiting the depth of analysis possible
How it works (simplified):
- You ask a question or give a task
- The system searches for relevant content in a database
- Retrieved content (with citations) is provided to the AI
- The AI generates a response based on this retrieved content
- Citations travel through the process, appearing in the final output
GraphRAG combines RAG with graph databases, allowing the AI to follow relationships between content—not just find matching keywords, but explore connected concepts.
Example from our projects: In the API Standard project, while the 5,612 source document nodes could theoretically fit in a single context window, the ongoing process of research, drafting, refinement, and conversation would have quickly overwhelmed available space. Instead, the AI performed 47 targeted searches, retrieving approximately 1,073 relevant results with full citations. These were saved to research files, then selectively read during drafting. The result: a comprehensive document with 280 verified citations, with source material accessed systematically rather than loaded all at once.
Static site
A static site is a website made of pre-built HTML files that are the same for every visitor. Unlike dynamic sites (like Facebook or online banking) that generate pages on-the-fly based on user data, static sites are simpler, faster, and more secure.
Why this matters:
- Fast loading—files are ready to serve immediately
- Secure—no database to hack, no server-side code to exploit
- Reliable—fewer things can go wrong
- Easy to host—can be served from simple, inexpensive hosting
How this site works:
- Content is written in Markdown files
- A tool called 11ty (Eleventy) converts Markdown to HTML
- The site is hosted on Vercel, a platform for static sites
- When we update content, the site automatically rebuilds
For regulatory documents: Static sites are well-suited to publishing regulatory content because regulations should be stable, reliable, and accessible over long periods. DocRef publishes documents as static HTML pages that can be hosted simply and preserved indefinitely.
In our projects: The Transparency Hub you're reading now is a static site. This approach keeps things simple and focused on the content, while ensuring the site loads quickly and remains accessible.
Further reading
- About DocRef — How the DocRef system works
- About MCP Servers — How AI systems access regulatory data
- Digital Regulatory Infrastructure and Rules as Code — The broader context for this work