About MCP Servers

About DocRef MCP Servers

This page explains how DocRef datasets are made available to AI systems through MCP servers, enabling a technique called GraphRAG (Graph Retrieval-Augmented Generation).

What is an MCP Server?

MCP (Model Context Protocol) is a standardized protocol developed by Anthropic that allows AI systems to connect to external data sources through dedicated servers. MCP servers act as intermediaries between the AI and specialized databases or APIs.

How it works:

  1. The AI sends a structured query to the MCP server
  2. The server translates the query into database operations (searches, traversals)
  3. The server retrieves results from the database
  4. Results are formatted and returned to the AI with proper citations

The key benefit is that AI systems can access far more information than fits in their context window, retrieving only what's relevant to each specific query.

What is GraphRAG?

GraphRAG combines two components:

  1. Knowledge Graph: A database structure (typically Neo4j) that stores documents as connected nodes with relationships, properties, and metadata tags
  2. Retrieval-Augmented Generation (RAG): An AI technique that retrieves relevant information from external sources before generating content

In DocRef's implementation:

  • Documents are broken into individual elements (paragraphs, sections, definitions, tables)
  • Each element becomes a node in a graph database
  • Nodes are connected by relationships: CHILD_OF (hierarchical structure) and SEMANTIC_SIMILARITY (meaning-based connections)
  • Vector embeddings enable semantic search - finding content by meaning, not just keywords

MCP Servers Used in This Project

identification-management-standards

This MCP server provides access to New Zealand's identification management standards.

Database Statistics:

  • 9,374 DocumentNode entities across 30 documents
  • 7,454 nodes with embeddings (79.5% coverage using 768-dimensional vectors)
  • Relationships:
    • CHILD_OF: 10,208 hierarchical parent-child links
    • SEMANTIC_SIMILARITY: 89,735 precomputed semantic similarity connections (K=10 neighbors per node)

Content includes:

  • 4 core identification standards (Federation, Information, Authentication, Binding)
  • 4 implementation guides (one per standard)
  • Supporting materials: terminology, risk assessment, conformance guidance, counter-fraud techniques
  • Related legislation: Privacy Act 2020, Digital Identity Services Trust Framework Act

Available tools:

Tool Purpose
semantic_search Natural language queries for content discovery
find_semantic_neighbors Fast traversal of precomputed semantic similarity links
search_by_document Retrieve all nodes from a specific document
get_hierarchical_context Navigate parent/child/sibling relationships
run_cypher_query Custom Neo4j queries for advanced analysis
get_document_stats Metadata and collection statistics
get_schema Node properties and available relationships

How it was used: This server was the primary source for content retrieval when generating the consolidated identification management standard. AI agents queried the server to retrieve relevant content with full citations, which were then synthesized into the output document.

generative-ai-guidance-gcdo

This MCP server provides access to the Government Chief Digital Office's guidance for responsible AI use in the public service.

Database Statistics:

  • 1,393 DocumentNode entities across 23 documents
  • 1,027 nodes with embeddings (73.7% coverage using 768-dimensional vectors)
  • Relationships:
    • CHILD_OF: 1,212 hierarchical links
    • SEMANTIC_SIMILARITY: 10,215 connections (K=10, threshold=0.7)

Content includes:

  • Public Service AI Framework (foundational principles)
  • Responsible AI Guidance for GenAI
  • Topic-specific guidance: governance, security, privacy, transparency, bias/discrimination, accessibility
  • Implementation guidance: procurement, skills/capabilities
  • Supporting materials: glossary of AI terms, cloud jurisdictional risk guidance

How it was used: This server was used for evaluation, not generation. It allowed validation of AI-generated content against government AI principles.

How MCP Servers Enable AI-Assisted Drafting

The combination of DocRef structured data and MCP servers creates a powerful workflow:

Source Documents
       ↓
   [DocRef]
  Structure + Citations + Embeddings
       ↓
   [MCP Server]
  Semantic Search + Graph Traversal
       ↓
   [AI System]
  Query → Retrieve → Generate
       ↓
  Output with Citations
       ↓
   [Human Review]
  Click citations to verify

Key capabilities enabled:

  1. Overcome context limitations: Source documents often exceed AI context windows (e.g., 9,374 nodes across 30 documents). MCP servers allow selective retrieval of only relevant content.

  2. Preserve citations: DocRef URLs travel with the content through every step. AI outputs include clickable links to exact source locations.

  3. Semantic understanding: Vector embeddings enable finding content by meaning. A search for "authentication requirements" returns relevant content even if it uses different words.

  4. Structural navigation: Graph relationships allow traversing from a specific paragraph to its parent section, sibling paragraphs, or related content across documents.

  5. Systematic research: AI agents can execute comprehensive searches, save results, then synthesize - ensuring no relevant content is missed.

Question-and-Answer Capabilities

MCP servers also enable human-AI interaction for document understanding:

  • Human asks question: "What are the requirements for biometric authentication?"
  • AI queries MCP server: Semantic search across all documents
  • AI provides answer: Natural language response with pinpoint citations
  • Human verifies: Click DocRef links to see exact source text

This creates an "explainable knowledge base" where AI answers can always be traced back to authoritative source documents.

Technical Implementation

For those interested in the technical details:

  • Database: Neo4j graph database
  • Embeddings: 768-dimensional vectors (Nomic AI embeddings)
  • MCP Protocol: Standardized JSON-RPC interface
  • Access: Via npx mcp-remote command or direct API calls
  • Context management: File-based external memory for systematic research

For more details on the API Standard project's implementation, see the Methodology and Technology Overview.

Learn More