Local LLM Infrastructure: Self-Hosted AI on Mac Mini

Production-grade self-hosted AI infrastructure on Mac Mini M4 Pro. Complete automation with Ollama, Open WebUI, and MCP bridge. Deep dive into LLM deployment, RAG implementation, and enterprise-ready AI systems.

secivres lla fo htlaeh kcehC # sutats iatrsecivres lla fo htlaeh kcehC # sutats iatrAI Infrastructure | Privacy & Sovereignty | Enterprise-Ready Deployment

Project Results:

  • Complete self-hosted AI infrastructure on Mac Mini M4 Pro
  • - 3-service architecture: Ollama + Open WebUI + MCP bridge
  • - Qwen 2.5:14B primary model with RAG integration
  • - Full automation with LaunchAgents and Python workflows
  • - Cross-platform memory system (RT-Assistant JSONL ledger)
  • - Zero ongoing API costs for local inference
  • - Foundation for enterprise AI deployments

Overview

What if you could run ChatGPT-level AI on your desk—completely private, fully automated, and ready for prototype development—while building deep expertise in LLM infrastructure?

In 2025, I built a production-grade self-hosted AI infrastructure on a Mac Mini M4 Pro. The project demonstrates sophisticated AI deployment, automation engineering, and systems integration—skills that position me for enterprise AI implementations.

This wasn't about saving money on API costs. (Spoiler: I didn't.) This was about capability building: gaining hands-on expertise in LLM operations, deployment, and infrastructure that would translate to client work and enterprise implementations.


The Problem

Three Interconnected Challenges

1. Privacy & Data Sovereignty

Working with sensitive information (client data, strategic planning, personal notes) through cloud APIs means:

  • All data passes through third-party servers
  • - Content subject to provider terms of service
  • - No control over data retention or processing
  • - Dependency on external service availability

Goal: Complete data sovereignty—all processing happens locally, zero data leaves the machine.

2. Context Loss Across Platforms

Each AI platform (Claude, ChatGPT, local tools) maintains separate conversation histories:

  • Re-explaining context every new session
  • - No shared memory across tools
  • - Losing decision rationale over time
  • - Inability to reference past conversations months later

Goal: Unified memory system accessible across all AI platforms.

3. Learning Challenge: Understanding LLM Operations

Using cloud APIs is easy—but provides zero insight into how LLMs actually work:

  • What happens during inference?
  • - How do context windows and chunking work?
  • - What's involved in model serving at scale?
  • - How do you configure and tune production systems?

Goal: Hands-on operational experience with LLM infrastructure to enable future client deployments.


Technical Approach

Hardware Selection: Mac Mini M4 Pro

Rationale:

  • 24GB unified memory: Sufficient for 14B parameter models with headroom
  • - M4 Pro chip: Excellent performance per watt for inference
  • - 4TB external storage: Room for multiple models and knowledge bases
  • - Always-on capability: Reliable 24/7 operation
  • - Quiet operation: Suitable for home office desk placement

Cost: ~$1,400 (Mac Mini) + storage

Alternative considered: Cloud GPU instances (rejected due to ongoing costs and complexity)


Architecture Overview

Three-Service Architecture

The system consists of three integrated services running on macOS:

1. Ollama (Port 11434)

  • Role: Model inference backend
  • - Function: Serves LLM completions via OpenAI-compatible API
  • - Primary model: Qwen 2.5:14B-instruct
  • - Embeddings: nomic-embed-text (for RAG)

2. Open WebUI (Port 3000)

  • Role: Browser-based interface
  • - Function: ChatGPT-like interface with advanced features
  • - Deployment: Docker container with persistent data volumes
  • - Features: RAG collections, workspaces, tool invocation, memory integration

3. mcpo MCP Bridge (Port 11620)

  • Role: Model Context Protocol integration
  • - Function: Translates MCP STDIO tools to OpenAPI for Open WebUI
  • - Capability: Enables LLM filesystem read/write access
  • - Integration: Connects RT-Assistant memory system to Open WebUI

Key Technical Features

  1. Complete Auto-Start Infrastructure

Challenge: Three services must start automatically on boot in correct order.

Solution: macOS LaunchAgents with dependency management

Boot sequence:

  1. User logs in
  2. 2. Docker Desktop starts (if auto-start enabled)
  3. 3. Ollama starts via LaunchAgent
  4. 4. Open WebUI waits for Docker readiness, then starts container
  5. 5. mcpo starts MCP bridge
  6. 6. All services ready in ~30-60 seconds

Management CLI: rtai

Created unified management tool for all services:

rtai start # Start all services

rtai stop # Stop all services

rtai restart # Restart all services

rtai logs # View recent logs

Impact: Zero-touch operation—system runs reliably without manual intervention.


  1. Model Context Protocol (MCP) Integration

The Innovation: Giving the LLM active filesystem access via Model Context Protocol.

What is MCP?

  • New protocol for connecting AI systems to external tools
  • - Enables LLMs to read/write files, execute commands, access APIs
  • - STDIO-based communication between LLM and tool servers

The Challenge: Open WebUI doesn't natively support MCP STDIO tools.

The Solution: mcpo Bridge

mcpo translates MCP STDIO tools into OpenAPI format that Open WebUI understands.

What This Enables:

  • LLM can read files from RT-Assistant knowledge base
  • - LLM can write new entries to memory ledger
  • - Direct access to 300+ articles in RAG collection
  • - Active memory system integration (not just passive context)

Impact: Cross-platform memory system becomes actively accessible during conversations.


  1. RAG Implementation with Knowledge Collections

Open WebUI RAG Collection:

  • Collection ID: RT_Articles (300+ Substack posts)
  • - Embeddings: nomic-embed-text model
  • - Search: Semantic similarity search during conversations

Article Sync Automation:

Watcher script: rt_watch_poll.sh

  • Monitors /articles/ directory for changes
  • - Detects new/modified markdown files
  • - Automatically uploads to Open WebUI
  • - Adds to RT_Articles collection
  • - Runs continuously in background

Impact: RAG collection stays current with latest writing automatically.


  1. Cross-Platform Memory System Integration

RT-Assistant Memory Ledger:

Format: JSONL (JSON Lines) for human-readable, AI-compatible entries

Cross-Platform Access:

  • Claude Code: Via MCP filesystem integration
  • - ChatGPT: Manual session rendering
  • - Local LLM (Open WebUI): Via mcpo MCP bridge

Automation: Nightly Compaction

LaunchAgent: com.offlineai.memorycompact.v2.plist

  • Schedule: Daily at 02:05 AM + on system load
  • - Script: compact_memory.py
  • - Function: Consolidates entries, removes duplicates, creates summary
  • - Output: memory_compact.jsonl (lightweight version for loading)

Impact: Persistent memory across all AI interactions, automatically maintained.


Development Process

Timeline

Planning & Research: August 2025 (2-3 weeks)

Initial Setup: September 2025 (1 week)

Open WebUI Configuration: September 2025 (1 week)

MCP Integration: September-October 2025 (2 weeks)

Automation Development: October 2025 (2 weeks)

Total Time: ~6-8 weeks

Effort: 10-20 hours/week (part-time alongside other projects)


Configuration Challenges Overcome

  1. Open WebUI Learning Curve (~1 week)

Challenge: Documentation unclear for advanced settings

Approach:

  • Trial-and-error experimentation
  • - Community forum research
  • - Testing different configurations

Outcome: Mastered configuration, but time-intensive


  1. Model Context Protocol Integration (~2 weeks)

Challenges:

  • MCP: New technology with limited documentation
  • - mcpo: STDIO to OpenAPI translation not intuitive
  • - Relative path conventions confusing
  • - Tool invocation failures difficult to debug

Solution Process:

  • Read mcpo source code
  • - Experiment with different path configurations
  • - Monitor logs for errors
  • - Iterate until filesystem access working

Outcome: Fully functional MCP integration enabling filesystem R/W


  1. LaunchAgent Auto-Start (~1 week)

Challenge: Three services must start in correct order with dependencies

Solution:

  • Created three separate LaunchAgent plists
  • - Added KeepAlive for auto-restart
  • - Configured startup delays for dependencies
  • - Tested boot sequence reliability

Outcome: Reliable auto-start with zero manual intervention


Results & Validation

Technical Success ✅

Achieved:

  • ✅ Fully functional self-hosted AI infrastructure
  • - ✅ Complete automation (boot → ready in 60 seconds)
  • - ✅ Cross-platform memory integration working
  • - ✅ RAG collection with 300+ articles searchable
  • - ✅ Model Context Protocol filesystem access
  • - ✅ Zero ongoing API costs for local inference
  • - ✅ Production-ready deployment practices

Performance Metrics:

  • Inference speed: Medium-slow compared to cloud APIs
  • - RAM usage: Comfortable headroom with 24GB (14B model uses ~18GB)
  • - Concurrent usage: Multiple sessions work seamlessly
  • - Uptime: Reliable 24/7 operation

Cost Reality: No Savings (Yet) 💰

Initial Goal: Reduce cloud API spending

Reality: Cloud API costs actually increased

  • ChatGPT usage: Decreased significantly
  • - Claude Code usage: Increased substantially (heavy development work)
  • - Local LLM usage: Decreased from initial period

Net effect: No immediate cost savings achieved

BUT: Future value unlocked:

  • Foundation to deploy cost-effective production alternatives
  • - Capability to prototype without API billing concerns
  • - Skills to set up systems for clients without recurring costs

Conclusion: This was never about immediate ROI—it was about capability building.


Capabilities Unlocked 🚀

1. Infrastructure Deployment Skills

Can now confidently:

  • Set up LLM systems locally or in cloud
  • - Deploy for personal use or client projects
  • - Configure and tune model serving infrastructure
  • - Troubleshoot production AI deployments

2. Prototype Development Without API Costs

Value: Test AI ideas without worrying about API billing

3. Decision Tracking & Organizational Memory

Memory system provides:

  • Capture of decision rationale across projects
  • - Ability to review past choices months later
  • - Foundation for potential enterprise knowledge management

4. Foundation for Enterprise Implementation

Skills gained position me for:

  • Client deployments of AI infrastructure
  • - Enterprise knowledge management solutions
  • - Organizational decision tracking systems
  • - Privacy-preserving AI implementations

5. Preparation for Agentic AI Systems

Local LLM experience provides foundation for:

  • Understanding agentic system implementation
  • - Configuring agents securely
  • - Ensuring proper operation and constraints
  • - Managing a

The Problem

Three Interconnected Challenges

1. Privacy & Data Sovereignty

Working with sensitive information (client data, strategic planning, personal notes) through cloud APIs means:

  • All data passes through third-party servers
  • - Content subject to provider terms of service
  • - No control over data retention or processing
  • - Dependency on external service availability

Goal: Complete data sovereignty—all processing happens locally, zero data leaves the machine.

2. Context Loss Across Platforms

Each AI platform (Claude, ChatGPT, local tools) maintains separate conversation histories:

  • Re-explaining context every new session
  • - No shared memory across tools
  • - Losing decision rationale over time
  • - Inability to reference past conversations months later

Goal: Unified memory system accessible across all AI platforms.

3. Learning Challenge: Understanding LLM Operations

Using cloud APIs is easy—but provides zero insight into how LLMs actually work:

  • What happens during inference?
  • - How do context windows and chunking work?
  • - What's involved in model serving at scale?
  • - How do you configure and tune production systems?

Goal: Hands-on operational experience with LLM infrastructure to enable future client deployments.


Technical Approach

Hardware Selection: Mac Mini M4 Pro

Rationale:

  • 24GB unified memory: Sufficient for 14B parameter models with headroom
  • - M4 Pro chip: Excellent performance per watt for inference
  • - 4TB external storage: Room for multiple models and knowledge bases
  • - Always-on capability: Reliable 24/7 operation
  • - Quiet operation: Suitable for home office desk placement

Cost: ~$1,400 (Mac Mini) + storage

Alternative considered: Cloud GPU instances (rejected due to ongoing costs and complexity)


Architecture Overview

Three-Service Architecture

The system consists of three integrated services running on macOS:

1. Ollama (Port 11434)

  • Role: Model inference backend
  • - Function: Serves LLM completions via OpenAI-compatible API
  • - Primary model: Qwen 2.5:14B-instruct
  • - Embeddings: nomic-embed-text (for RAG)

2. Open WebUI (Port 3000)

  • Role: Browser-based interface
  • - Function: ChatGPT-like interface with advanced features
  • - Deployment: Docker container with persistent data volumes
  • - Features: RAG collections, workspaces, tool invocation, memory integration

3. mcpo MCP Bridge (Port 11620)

  • Role: Model Context Protocol integration
  • - Function: Translates MCP STDIO tools to OpenAPI for Open WebUI
  • - Capability: Enables LLM filesystem read/write access
  • - Integration: Connects RT-Assistant memory system to Open WebUI

Key Technical Features

  1. Complete Auto-Start Infrastructure

Challenge: Three services must start automatically on boot in correct order.

Solution: macOS LaunchAgents with dependency management

Boot sequence:

  1. User logs in
  2. 2. Docker Desktop starts (if auto-start enabled)
  3. 3. Ollama starts via LaunchAgent
  4. 4. Open WebUI waits for Docker readiness, then starts container
  5. 5. mcpo starts MCP bridge
  6. 6. All services ready in ~30-60 seconds

Management CLI: rtai

Created unified management tool for all services:

rtai start # Start all services

rtai stop # Stop all services

rtai restart # Restart all services

rtai logs # View recent logs

Impact: Zero-touch operation—system runs reliably without manual intervention.


  1. Model Context Protocol (MCP) Integration

The Innovation: Giving the LLM active filesystem access via Model Context Protocol.

What is MCP?

  • New protocol for connecting AI systems to external tools
  • - Enables LLMs to read/write files, execute commands, access APIs
  • - STDIO-based communication between LLM and tool servers

The Challenge: Open WebUI doesn't natively support MCP STDIO tools.

The Solution: mcpo Bridge

mcpo translates MCP STDIO tools into OpenAPI format that Open WebUI understands.

What This Enables:

  • LLM can read files from RT-Assistant knowledge base
  • - LLM can write new entries to memory ledger
  • - Direct access to 300+ articles in RAG collection
  • - Active memory system integration (not just passive context)

Impact: Cross-platform memory system becomes actively accessible during conversations.


  1. RAG Implementation with Knowledge Collections

Open WebUI RAG Collection:

  • Collection ID: RT_Articles (300+ Substack posts)
  • - Embeddings: nomic-embed-text model
  • - Search: Semantic similarity search during conversations

Article Sync Automation:

Watcher script: rt_watch_poll.sh

  • Monitors /articles/ directory for changes
  • - Detects new/modified markdown files
  • - Automatically uploads to Open WebUI
  • - Adds to RT_Articles collection
  • - Runs continuously in background

Impact: RAG collection stays current with latest writing automatically.


  1. Cross-Platform Memory System Integration

RT-Assistant Memory Ledger:

Format: JSONL (JSON Lines) for human-readable, AI-compatible entries

Cross-Platform Access:

  • Claude Code: Via MCP filesystem integration
  • - ChatGPT: Manual session rendering
  • - Local LLM (Open WebUI): Via mcpo MCP bridge

Automation: Nightly Compaction

LaunchAgent: com.offlineai.memorycompact.v2.plist

  • Schedule: Daily at 02:05 AM + on system load
  • - Script: compact_memory.py
  • - Function: Consolidates entries, removes duplicates, creates summary
  • - Output: memory_compact.jsonl (lightweight version for loading)

Impact: Persistent memory across all AI interactions, automatically maintained.


Development Process

Timeline

Planning & Research: August 2025 (2-3 weeks)

Initial Setup: September 2025 (1 week)

Open WebUI Configuration: September 2025 (1 week)

MCP Integration: September-October 2025 (2 weeks)

Automation Development: October 2025 (2 weeks)

Total Time: ~6-8 weeks

Effort: 10-20 hours/week (part-time alongside other projects)


Configuration Challenges Overcome

  1. Open WebUI Learning Curve (~1 week)

Challenge: Documentation unclear for advanced settings

Approach:

  • Trial-and-error experimentation
  • - Community forum research
  • - Testing different configurations

Outcome: Mastered configuration, but time-intensive


  1. Model Context Protocol Integration (~2 weeks)

Challenges:

  • MCP: New technology with limited documentation
  • - mcpo: STDIO to OpenAPI translation not intuitive
  • - Relative path conventions confusing
  • - Tool invocation failures difficult to debug

Solution Process:

  • Read mcpo source code
  • - Experiment with different path configurations
  • - Monitor logs for errors
  • - Iterate until filesystem access working

Outcome: Fully functional MCP integration enabling filesystem R/W


  1. LaunchAgent Auto-Start (~1 week)

Challenge: Three services must start in correct order with dependencies

Solution:

  • Created three separate LaunchAgent plists
  • - Added KeepAlive for auto-restart
  • - Configured startup delays for dependencies
  • - Tested boot sequence reliability

Outcome: Reliable auto-start with zero manual intervention


Results & Validation

Technical Success ✅

Achieved:

  • ✅ Fully functional self-hosted AI infrastructure
  • - ✅ Complete automation (boot → ready in 60 seconds)
  • - ✅ Cross-platform memory integration working
  • - ✅ RAG collection with 300+ articles searchable
  • - ✅ Model Context Protocol filesystem access
  • - ✅ Zero ongoing API costs for local inference
  • - ✅ Production-ready deployment practices

Performance Metrics:

  • Inference speed: Medium-slow compared to cloud APIs
  • - RAM usage: Comfortable headroom with 24GB (14B model uses ~18GB)
  • - Concurrent usage: Multiple sessions work seamlessly
  • - Uptime: Reliable 24/7 operation

Cost Reality: No Savings (Yet) 💰

Initial Goal: Reduce cloud API spending

Reality: Cloud API costs actually increased

  • ChatGPT usage: Decreased significantly
  • - Claude Code usage: Increased substantially (heavy development work)
  • - Local LLM usage: Decreased from initial period

Net effect: No immediate cost savings achieved

BUT: Future value unlocked:

  • Foundation to deploy cost-effective production alternatives
  • - Capability to prototype without API billing concerns
  • - Skills to set up systems for clients without recurring costs

Conclusion: This was never about immediate ROI—it was about capability building.


Capabilities Unlocked 🚀

1. Infrastructure Deployment Skills

Can now confidently:

  • Set up LLM systems locally or in cloud
  • - Deploy for personal use or client projects
  • - Configure and tune model serving infrastructure
  • - Troubleshoot production AI deployments

2. Prototype Development Without API Costs

Value: Test AI ideas without worrying about API billing

3. Decision Tracking & Organizational Memory

Memory system provides:

  • Capture of decision rationale across projects
  • - Ability to review past choices months later
  • - Foundation for potential enterprise knowledge management

4. Foundation for Enterprise Implementation

Skills gained position me for:

  • Client deployments of AI infrastructure
  • - Enterprise knowledge management solutions
  • - Organizational decision tracking systems
  • - Privacy-preserving AI implementations

5. Preparation for Agentic AI Systems

Local LLM experience provides foundation for:

  • Understanding agentic system implementation
  • - Configuring agents securely
  • - Ensuring proper operation and constraints
  • - Managing autonomous AI workflows

Technical Learnings

What Worked Well

1. Learning Experience

  • Exceeded expectations for educational value
  • - Hands-on knowledge more valuable than theoretical study
  • - Foundation for future AI implementation work

2. Open Source Ecosystem

  • Ollama: Easy to get started, powerful capabilities
  • - Open WebUI: Feature-rich interface once configured
  • - Community support: Active development and documentation

3. Hardware Choice

  • M4
Subscribe to my monthly newsletter

No spam, no sharing to third party. Only you and me.