Local LLM Infrastructure: Self-Hosted AI on Mac Mini
secivres lla fo htlaeh kcehC # sutats iatrsecivres lla fo htlaeh kcehC # sutats iatrAI Infrastructure | Privacy & Sovereignty | Enterprise-Ready Deployment
Project Results:
- Complete self-hosted AI infrastructure on Mac Mini M4 Pro
- - 3-service architecture: Ollama + Open WebUI + MCP bridge
- - Qwen 2.5:14B primary model with RAG integration
- - Full automation with LaunchAgents and Python workflows
- - Cross-platform memory system (RT-Assistant JSONL ledger)
- - Zero ongoing API costs for local inference
- - Foundation for enterprise AI deployments
Overview
What if you could run ChatGPT-level AI on your desk—completely private, fully automated, and ready for prototype development—while building deep expertise in LLM infrastructure?
In 2025, I built a production-grade self-hosted AI infrastructure on a Mac Mini M4 Pro. The project demonstrates sophisticated AI deployment, automation engineering, and systems integration—skills that position me for enterprise AI implementations.
This wasn't about saving money on API costs. (Spoiler: I didn't.) This was about capability building: gaining hands-on expertise in LLM operations, deployment, and infrastructure that would translate to client work and enterprise implementations.
The Problem
Three Interconnected Challenges
1. Privacy & Data Sovereignty
Working with sensitive information (client data, strategic planning, personal notes) through cloud APIs means:
- All data passes through third-party servers
- - Content subject to provider terms of service
- - No control over data retention or processing
- - Dependency on external service availability
Goal: Complete data sovereignty—all processing happens locally, zero data leaves the machine.
2. Context Loss Across Platforms
Each AI platform (Claude, ChatGPT, local tools) maintains separate conversation histories:
- Re-explaining context every new session
- - No shared memory across tools
- - Losing decision rationale over time
- - Inability to reference past conversations months later
Goal: Unified memory system accessible across all AI platforms.
3. Learning Challenge: Understanding LLM Operations
Using cloud APIs is easy—but provides zero insight into how LLMs actually work:
- What happens during inference?
- - How do context windows and chunking work?
- - What's involved in model serving at scale?
- - How do you configure and tune production systems?
Goal: Hands-on operational experience with LLM infrastructure to enable future client deployments.
Technical Approach
Hardware Selection: Mac Mini M4 Pro
Rationale:
- 24GB unified memory: Sufficient for 14B parameter models with headroom
- - M4 Pro chip: Excellent performance per watt for inference
- - 4TB external storage: Room for multiple models and knowledge bases
- - Always-on capability: Reliable 24/7 operation
- - Quiet operation: Suitable for home office desk placement
Cost: ~$1,400 (Mac Mini) + storage
Alternative considered: Cloud GPU instances (rejected due to ongoing costs and complexity)
Architecture Overview
Three-Service Architecture
The system consists of three integrated services running on macOS:
1. Ollama (Port 11434)
- Role: Model inference backend
- - Function: Serves LLM completions via OpenAI-compatible API
- - Primary model: Qwen 2.5:14B-instruct
- - Embeddings: nomic-embed-text (for RAG)
2. Open WebUI (Port 3000)
- Role: Browser-based interface
- - Function: ChatGPT-like interface with advanced features
- - Deployment: Docker container with persistent data volumes
- - Features: RAG collections, workspaces, tool invocation, memory integration
3. mcpo MCP Bridge (Port 11620)
- Role: Model Context Protocol integration
- - Function: Translates MCP STDIO tools to OpenAPI for Open WebUI
- - Capability: Enables LLM filesystem read/write access
- - Integration: Connects RT-Assistant memory system to Open WebUI
Key Technical Features
- Complete Auto-Start Infrastructure
Challenge: Three services must start automatically on boot in correct order.
Solution: macOS LaunchAgents with dependency management
Boot sequence:
- User logs in
- 2. Docker Desktop starts (if auto-start enabled)
- 3. Ollama starts via LaunchAgent
- 4. Open WebUI waits for Docker readiness, then starts container
- 5. mcpo starts MCP bridge
- 6. All services ready in ~30-60 seconds
Management CLI: rtai
Created unified management tool for all services:
rtai start # Start all services
rtai stop # Stop all services
rtai restart # Restart all services
rtai logs # View recent logs
Impact: Zero-touch operation—system runs reliably without manual intervention.
- Model Context Protocol (MCP) Integration
The Innovation: Giving the LLM active filesystem access via Model Context Protocol.
What is MCP?
- New protocol for connecting AI systems to external tools
- - Enables LLMs to read/write files, execute commands, access APIs
- - STDIO-based communication between LLM and tool servers
The Challenge: Open WebUI doesn't natively support MCP STDIO tools.
The Solution: mcpo Bridge
mcpo translates MCP STDIO tools into OpenAPI format that Open WebUI understands.
What This Enables:
- LLM can read files from RT-Assistant knowledge base
- - LLM can write new entries to memory ledger
- - Direct access to 300+ articles in RAG collection
- - Active memory system integration (not just passive context)
Impact: Cross-platform memory system becomes actively accessible during conversations.
- RAG Implementation with Knowledge Collections
Open WebUI RAG Collection:
- Collection ID: RT_Articles (300+ Substack posts)
- - Embeddings: nomic-embed-text model
- - Search: Semantic similarity search during conversations
Article Sync Automation:
Watcher script: rt_watch_poll.sh
- Monitors
/articles/directory for changes - - Detects new/modified markdown files
- - Automatically uploads to Open WebUI
- - Adds to RT_Articles collection
- - Runs continuously in background
Impact: RAG collection stays current with latest writing automatically.
- Cross-Platform Memory System Integration
RT-Assistant Memory Ledger:
Format: JSONL (JSON Lines) for human-readable, AI-compatible entries
Cross-Platform Access:
- Claude Code: Via MCP filesystem integration
- - ChatGPT: Manual session rendering
- - Local LLM (Open WebUI): Via mcpo MCP bridge
Automation: Nightly Compaction
LaunchAgent: com.offlineai.memorycompact.v2.plist
- Schedule: Daily at 02:05 AM + on system load
- - Script:
compact_memory.py - - Function: Consolidates entries, removes duplicates, creates summary
- - Output:
memory_compact.jsonl(lightweight version for loading)
Impact: Persistent memory across all AI interactions, automatically maintained.
Development Process
Timeline
Planning & Research: August 2025 (2-3 weeks)
Initial Setup: September 2025 (1 week)
Open WebUI Configuration: September 2025 (1 week)
MCP Integration: September-October 2025 (2 weeks)
Automation Development: October 2025 (2 weeks)
Total Time: ~6-8 weeks
Effort: 10-20 hours/week (part-time alongside other projects)
Configuration Challenges Overcome
- Open WebUI Learning Curve (~1 week)
Challenge: Documentation unclear for advanced settings
Approach:
- Trial-and-error experimentation
- - Community forum research
- - Testing different configurations
Outcome: Mastered configuration, but time-intensive
- Model Context Protocol Integration (~2 weeks)
Challenges:
- MCP: New technology with limited documentation
- - mcpo: STDIO to OpenAPI translation not intuitive
- - Relative path conventions confusing
- - Tool invocation failures difficult to debug
Solution Process:
- Read mcpo source code
- - Experiment with different path configurations
- - Monitor logs for errors
- - Iterate until filesystem access working
Outcome: Fully functional MCP integration enabling filesystem R/W
- LaunchAgent Auto-Start (~1 week)
Challenge: Three services must start in correct order with dependencies
Solution:
- Created three separate LaunchAgent plists
- - Added KeepAlive for auto-restart
- - Configured startup delays for dependencies
- - Tested boot sequence reliability
Outcome: Reliable auto-start with zero manual intervention
Results & Validation
Technical Success ✅
Achieved:
- ✅ Fully functional self-hosted AI infrastructure
- - ✅ Complete automation (boot → ready in 60 seconds)
- - ✅ Cross-platform memory integration working
- - ✅ RAG collection with 300+ articles searchable
- - ✅ Model Context Protocol filesystem access
- - ✅ Zero ongoing API costs for local inference
- - ✅ Production-ready deployment practices
Performance Metrics:
- Inference speed: Medium-slow compared to cloud APIs
- - RAM usage: Comfortable headroom with 24GB (14B model uses ~18GB)
- - Concurrent usage: Multiple sessions work seamlessly
- - Uptime: Reliable 24/7 operation
Cost Reality: No Savings (Yet) 💰
Initial Goal: Reduce cloud API spending
Reality: Cloud API costs actually increased
- ChatGPT usage: Decreased significantly
- - Claude Code usage: Increased substantially (heavy development work)
- - Local LLM usage: Decreased from initial period
Net effect: No immediate cost savings achieved
BUT: Future value unlocked:
- Foundation to deploy cost-effective production alternatives
- - Capability to prototype without API billing concerns
- - Skills to set up systems for clients without recurring costs
Conclusion: This was never about immediate ROI—it was about capability building.
Capabilities Unlocked 🚀
1. Infrastructure Deployment Skills
Can now confidently:
- Set up LLM systems locally or in cloud
- - Deploy for personal use or client projects
- - Configure and tune model serving infrastructure
- - Troubleshoot production AI deployments
2. Prototype Development Without API Costs
Value: Test AI ideas without worrying about API billing
3. Decision Tracking & Organizational Memory
Memory system provides:
- Capture of decision rationale across projects
- - Ability to review past choices months later
- - Foundation for potential enterprise knowledge management
4. Foundation for Enterprise Implementation
Skills gained position me for:
- Client deployments of AI infrastructure
- - Enterprise knowledge management solutions
- - Organizational decision tracking systems
- - Privacy-preserving AI implementations
5. Preparation for Agentic AI Systems
Local LLM experience provides foundation for:
- Understanding agentic system implementation
- - Configuring agents securely
- - Ensuring proper operation and constraints
- - Managing a
The Problem
Three Interconnected Challenges
1. Privacy & Data Sovereignty
Working with sensitive information (client data, strategic planning, personal notes) through cloud APIs means:
- All data passes through third-party servers
- - Content subject to provider terms of service
- - No control over data retention or processing
- - Dependency on external service availability
Goal: Complete data sovereignty—all processing happens locally, zero data leaves the machine.
2. Context Loss Across Platforms
Each AI platform (Claude, ChatGPT, local tools) maintains separate conversation histories:
- Re-explaining context every new session
- - No shared memory across tools
- - Losing decision rationale over time
- - Inability to reference past conversations months later
Goal: Unified memory system accessible across all AI platforms.
3. Learning Challenge: Understanding LLM Operations
Using cloud APIs is easy—but provides zero insight into how LLMs actually work:
- What happens during inference?
- - How do context windows and chunking work?
- - What's involved in model serving at scale?
- - How do you configure and tune production systems?
Goal: Hands-on operational experience with LLM infrastructure to enable future client deployments.
Technical Approach
Hardware Selection: Mac Mini M4 Pro
Rationale:
- 24GB unified memory: Sufficient for 14B parameter models with headroom
- - M4 Pro chip: Excellent performance per watt for inference
- - 4TB external storage: Room for multiple models and knowledge bases
- - Always-on capability: Reliable 24/7 operation
- - Quiet operation: Suitable for home office desk placement
Cost: ~$1,400 (Mac Mini) + storage
Alternative considered: Cloud GPU instances (rejected due to ongoing costs and complexity)
Architecture Overview
Three-Service Architecture
The system consists of three integrated services running on macOS:
1. Ollama (Port 11434)
- Role: Model inference backend
- - Function: Serves LLM completions via OpenAI-compatible API
- - Primary model: Qwen 2.5:14B-instruct
- - Embeddings: nomic-embed-text (for RAG)
2. Open WebUI (Port 3000)
- Role: Browser-based interface
- - Function: ChatGPT-like interface with advanced features
- - Deployment: Docker container with persistent data volumes
- - Features: RAG collections, workspaces, tool invocation, memory integration
3. mcpo MCP Bridge (Port 11620)
- Role: Model Context Protocol integration
- - Function: Translates MCP STDIO tools to OpenAPI for Open WebUI
- - Capability: Enables LLM filesystem read/write access
- - Integration: Connects RT-Assistant memory system to Open WebUI
Key Technical Features
- Complete Auto-Start Infrastructure
Challenge: Three services must start automatically on boot in correct order.
Solution: macOS LaunchAgents with dependency management
Boot sequence:
- User logs in
- 2. Docker Desktop starts (if auto-start enabled)
- 3. Ollama starts via LaunchAgent
- 4. Open WebUI waits for Docker readiness, then starts container
- 5. mcpo starts MCP bridge
- 6. All services ready in ~30-60 seconds
Management CLI: rtai
Created unified management tool for all services:
rtai start # Start all services
rtai stop # Stop all services
rtai restart # Restart all services
rtai logs # View recent logs
Impact: Zero-touch operation—system runs reliably without manual intervention.
- Model Context Protocol (MCP) Integration
The Innovation: Giving the LLM active filesystem access via Model Context Protocol.
What is MCP?
- New protocol for connecting AI systems to external tools
- - Enables LLMs to read/write files, execute commands, access APIs
- - STDIO-based communication between LLM and tool servers
The Challenge: Open WebUI doesn't natively support MCP STDIO tools.
The Solution: mcpo Bridge
mcpo translates MCP STDIO tools into OpenAPI format that Open WebUI understands.
What This Enables:
- LLM can read files from RT-Assistant knowledge base
- - LLM can write new entries to memory ledger
- - Direct access to 300+ articles in RAG collection
- - Active memory system integration (not just passive context)
Impact: Cross-platform memory system becomes actively accessible during conversations.
- RAG Implementation with Knowledge Collections
Open WebUI RAG Collection:
- Collection ID: RT_Articles (300+ Substack posts)
- - Embeddings: nomic-embed-text model
- - Search: Semantic similarity search during conversations
Article Sync Automation:
Watcher script: rt_watch_poll.sh
- Monitors
/articles/directory for changes - - Detects new/modified markdown files
- - Automatically uploads to Open WebUI
- - Adds to RT_Articles collection
- - Runs continuously in background
Impact: RAG collection stays current with latest writing automatically.
- Cross-Platform Memory System Integration
RT-Assistant Memory Ledger:
Format: JSONL (JSON Lines) for human-readable, AI-compatible entries
Cross-Platform Access:
- Claude Code: Via MCP filesystem integration
- - ChatGPT: Manual session rendering
- - Local LLM (Open WebUI): Via mcpo MCP bridge
Automation: Nightly Compaction
LaunchAgent: com.offlineai.memorycompact.v2.plist
- Schedule: Daily at 02:05 AM + on system load
- - Script:
compact_memory.py - - Function: Consolidates entries, removes duplicates, creates summary
- - Output:
memory_compact.jsonl(lightweight version for loading)
Impact: Persistent memory across all AI interactions, automatically maintained.
Development Process
Timeline
Planning & Research: August 2025 (2-3 weeks)
Initial Setup: September 2025 (1 week)
Open WebUI Configuration: September 2025 (1 week)
MCP Integration: September-October 2025 (2 weeks)
Automation Development: October 2025 (2 weeks)
Total Time: ~6-8 weeks
Effort: 10-20 hours/week (part-time alongside other projects)
Configuration Challenges Overcome
- Open WebUI Learning Curve (~1 week)
Challenge: Documentation unclear for advanced settings
Approach:
- Trial-and-error experimentation
- - Community forum research
- - Testing different configurations
Outcome: Mastered configuration, but time-intensive
- Model Context Protocol Integration (~2 weeks)
Challenges:
- MCP: New technology with limited documentation
- - mcpo: STDIO to OpenAPI translation not intuitive
- - Relative path conventions confusing
- - Tool invocation failures difficult to debug
Solution Process:
- Read mcpo source code
- - Experiment with different path configurations
- - Monitor logs for errors
- - Iterate until filesystem access working
Outcome: Fully functional MCP integration enabling filesystem R/W
- LaunchAgent Auto-Start (~1 week)
Challenge: Three services must start in correct order with dependencies
Solution:
- Created three separate LaunchAgent plists
- - Added KeepAlive for auto-restart
- - Configured startup delays for dependencies
- - Tested boot sequence reliability
Outcome: Reliable auto-start with zero manual intervention
Results & Validation
Technical Success ✅
Achieved:
- ✅ Fully functional self-hosted AI infrastructure
- - ✅ Complete automation (boot → ready in 60 seconds)
- - ✅ Cross-platform memory integration working
- - ✅ RAG collection with 300+ articles searchable
- - ✅ Model Context Protocol filesystem access
- - ✅ Zero ongoing API costs for local inference
- - ✅ Production-ready deployment practices
Performance Metrics:
- Inference speed: Medium-slow compared to cloud APIs
- - RAM usage: Comfortable headroom with 24GB (14B model uses ~18GB)
- - Concurrent usage: Multiple sessions work seamlessly
- - Uptime: Reliable 24/7 operation
Cost Reality: No Savings (Yet) 💰
Initial Goal: Reduce cloud API spending
Reality: Cloud API costs actually increased
- ChatGPT usage: Decreased significantly
- - Claude Code usage: Increased substantially (heavy development work)
- - Local LLM usage: Decreased from initial period
Net effect: No immediate cost savings achieved
BUT: Future value unlocked:
- Foundation to deploy cost-effective production alternatives
- - Capability to prototype without API billing concerns
- - Skills to set up systems for clients without recurring costs
Conclusion: This was never about immediate ROI—it was about capability building.
Capabilities Unlocked 🚀
1. Infrastructure Deployment Skills
Can now confidently:
- Set up LLM systems locally or in cloud
- - Deploy for personal use or client projects
- - Configure and tune model serving infrastructure
- - Troubleshoot production AI deployments
2. Prototype Development Without API Costs
Value: Test AI ideas without worrying about API billing
3. Decision Tracking & Organizational Memory
Memory system provides:
- Capture of decision rationale across projects
- - Ability to review past choices months later
- - Foundation for potential enterprise knowledge management
4. Foundation for Enterprise Implementation
Skills gained position me for:
- Client deployments of AI infrastructure
- - Enterprise knowledge management solutions
- - Organizational decision tracking systems
- - Privacy-preserving AI implementations
5. Preparation for Agentic AI Systems
Local LLM experience provides foundation for:
- Understanding agentic system implementation
- - Configuring agents securely
- - Ensuring proper operation and constraints
- - Managing autonomous AI workflows
Technical Learnings
What Worked Well
1. Learning Experience
- Exceeded expectations for educational value
- - Hands-on knowledge more valuable than theoretical study
- - Foundation for future AI implementation work
2. Open Source Ecosystem
- Ollama: Easy to get started, powerful capabilities
- - Open WebUI: Feature-rich interface once configured
- - Community support: Active development and documentation
3. Hardware Choice
- M4
No spam, no sharing to third party. Only you and me.