System Health Monitoring
Mission Control provides comprehensive system health monitoring across agents, infrastructure, and automated processes. The health system gives Apple and the team real-time visibility into operational status and potential issues.
Health Indicators
Status Levels
Mission Control uses a three-tier health system:
- 🟢 Green — Operational, no issues
- 🟡 Yellow — Minor issues or warnings, monitoring required
- 🔴 Red — Critical issues requiring immediate attention
Home Dashboard Health Strip
The main dashboard displays three key health categories:
🤖 Agents
Monitors: Ari (main agent), Arlo (sales agent), Axel (dev agent)
Green indicators:
- Recent successful task completion
- No error messages in logs
- Responsive to commands
- Normal processing times
Yellow indicators:
- Occasional errors or timeouts
- Slower than normal response times
- Minor configuration warnings
- Non-critical failures in secondary tasks
Red indicators:
- Agent completely unresponsive
- Critical task failures
- Authentication errors
- System crashes or exceptions
⏰ Crons
Monitors: All scheduled background jobs
Health calculation:
- Success rate over last 24 hours
- Average execution time vs. baseline
- Number of failed jobs
- Critical vs. non-critical job status
Green indicators:
- 95%+ success rate
- All critical jobs completing
- Normal execution times
- No stuck or hanging processes
Yellow indicators:
- 85-95% success rate
- Some non-critical job failures
- Slower than normal execution
- Occasional timeouts
Red indicators:
- <85% success rate
- Critical job failures
- Multiple consecutive failures
- System-wide cron problems
🏗️ Infrastructure
Monitors: Server resources, network, and core services
Tracked metrics:
- CPU usage and load average
- Memory utilization
- Disk space availability
- Network connectivity
- Service uptime
Green indicators:
- CPU < 70%, Memory < 80%
- All services responding
- Network latency normal
- No disk space warnings
Yellow indicators:
- CPU 70-85%, Memory 80-90%
- Intermittent service issues
- Elevated network latency
- Disk space 85-95% full
Red indicators:
- CPU > 85%, Memory > 90%
- Services down or unresponsive
- Network connectivity issues
- Disk space > 95% full
Detailed Health Pages
Infrastructure Page (/infrastructure)
Comprehensive server monitoring with:
Top Processes Table:
- Process name and PID
- CPU usage percentage
- Memory consumption
- Runtime duration
- Process status
System Metrics:
- Server uptime
- Load averages (1m, 5m, 15m)
- Memory usage breakdown
- Disk usage by volume
- Network interface status
Resource Alerts:
- Processes consuming excessive resources
- Memory leaks or runaway processes
- Disk space warnings
- Network connectivity issues
Crons Page (/crons)
Detailed scheduled job monitoring:
Job Table Columns:
- Job name and description
- Schedule frequency
- Last run timestamp
- Status (success/failure/running)
- Execution duration
- Next scheduled run
- Error messages (if any)
Filtering Options:
- Status filtering (all/success/failed/running)
- Search by job name
- Sort by last run, duration, or status
Job Categories:
- Revenue collection — Daily financial data aggregation
- Data pipeline — Creator and brand data processing
- Health checks — System monitoring and alerting
- Backup and maintenance — Data backup and cleanup
- Notifications — Alert and report generation
Agents Page (/agents)
Individual agent health monitoring:
Per Agent Display:
- Current operational status
- Recent task completion rate
- Error log highlights
- Assigned cron jobs
- Performance metrics
Agent-Specific Metrics:
Ari (Main Agent):
- Email processing rate
- API call success rate
- Data pipeline completion
- Integration health
Arlo (Sales Agent):
- CRM sync success
- Proposal generation rate
- Client communication metrics
- Revenue tracking accuracy
Axel (Dev Agent):
- Build success rate
- Deployment frequency
- Error resolution time
- Code quality metrics
Health Data Sources
Real-Time Monitoring
System logs: /var/log/ and application logs Process monitoring: ps, top, htop system commands Network status: ping, curl connectivity tests Service status: systemctl and service-specific health endpoints
Cron Job Monitoring
Execution logs: Individual job output and error logs Timing data: Start time, duration, completion status Resource usage: CPU and memory consumption during execution Dependencies: Service and data dependencies status
Agent Health Checks
Heartbeat monitoring: Regular ping/pong with agents Task completion tracking: Success/failure rates for assigned tasks Error log analysis: Parsing error messages for patterns Performance benchmarking: Response time and throughput metrics
Alerting System
Alert Levels
Info: General status updates, successful completions Warning: Non-critical issues requiring attention Error: Failures that impact functionality Critical: System-wide issues requiring immediate response
Alert Channels
- Mission Control dashboard — Visual indicators and notifications
- Slack alerts — Automated messages to #operations channel
- Email notifications — Critical alerts to Apple and team
- SMS alerts — Emergency notifications for critical failures
Alert Rules
CPU > 90% for 10 minutes → Critical infrastructure alert Cron job failed 3 consecutive times → Error alert Agent unresponsive for 30 minutes → Critical agent alert Disk space < 5% → Critical infrastructure alert Revenue collection job failed → Error alert
Historical Health Data
Metrics Storage
Health metrics are stored in Convex for historical analysis:
- System metrics — CPU, memory, disk usage over time
- Cron performance — Success rates, execution times, failure patterns
- Agent metrics — Task completion rates, error frequencies
- Uptime tracking — Service availability percentages
Trend Analysis
- Performance trends — Identify degradation patterns
- Failure correlation — Link failures to system changes
- Capacity planning — Resource usage growth patterns
- Optimization opportunities — Identify inefficient processes
Health Automation
Auto-Remediation
Certain issues trigger automatic fixes:
- Disk cleanup — Automatic log rotation and temp file removal
- Process restart — Restart failed critical services
- Memory cleanup — Clear caches when memory usage high
- Network retry — Retry failed network operations
Health Check Automation
System health cron runs every 15 minutes:
- Collect system metrics
- Check service status
- Analyze recent logs
- Update health indicators
- Trigger alerts if needed
- Store metrics in database
Predictive Monitoring
- Resource exhaustion warnings — Alert before disk/memory full
- Performance degradation detection — Identify slowdowns early
- Failure pattern recognition — Predict likely failures based on history
- Capacity threshold alerts — Warn when approaching limits
Integration Points
With Mission Control Dashboard
- Real-time status — Live health indicators on home page
- Detailed views — Drill-down pages for each health category
- Historical charts — Graphs showing health trends over time
- Alert integration — Notifications within dashboard UI
With External Systems
- Slack integration — Health alerts in team channels
- Email alerts — Critical notifications via email
- SMS alerting — Emergency notifications for critical issues
- PagerDuty integration — On-call escalation for critical failures
With Other Tools
- GitHub Actions — Build and deployment health
- Vercel monitoring — Application performance and uptime
- Database monitoring — Query performance and connection health
- CDN monitoring — Content delivery performance
Troubleshooting Guides
Common Issues
High CPU usage:
- Check top processes in infrastructure page
- Identify resource-intensive jobs
- Consider job scheduling optimization
- Scale infrastructure if needed
Cron job failures:
- Check specific job logs in crons page
- Verify dependencies and credentials
- Test job manually if possible
- Check system resources during failure time
Agent unresponsiveness:
- Check agent logs for error messages
- Verify network connectivity
- Check resource availability
- Restart agent if necessary
Infrastructure alerts:
- Verify alert accuracy with manual checks
- Identify root cause (resource, network, service)
- Apply appropriate remediation
- Monitor for resolution
This health monitoring system ensures proactive identification and resolution of issues before they impact agency operations or client services.