System Health Monitoring

Mission Control provides comprehensive system health monitoring across agents, infrastructure, and automated processes. The health system gives Apple and the team real-time visibility into operational status and potential issues.

Health Indicators

Status Levels

Mission Control uses a three-tier health system:

🟢 Green — Operational, no issues
🟡 Yellow — Minor issues or warnings, monitoring required
🔴 Red — Critical issues requiring immediate attention

Home Dashboard Health Strip

The main dashboard displays three key health categories:

🤖 Agents

Monitors: Ari (main agent), Arlo (sales agent), Axel (dev agent)

Green indicators:

Recent successful task completion
No error messages in logs
Responsive to commands
Normal processing times

Yellow indicators:

Occasional errors or timeouts
Slower than normal response times
Minor configuration warnings
Non-critical failures in secondary tasks

Red indicators:

Agent completely unresponsive
Critical task failures
Authentication errors
System crashes or exceptions

⏰ Crons

Monitors: All scheduled background jobs

Health calculation:

Success rate over last 24 hours
Average execution time vs. baseline
Number of failed jobs
Critical vs. non-critical job status

Green indicators:

95%+ success rate
All critical jobs completing
Normal execution times
No stuck or hanging processes

Yellow indicators:

85-95% success rate
Some non-critical job failures
Slower than normal execution
Occasional timeouts

Red indicators:

<85% success rate
Critical job failures
Multiple consecutive failures
System-wide cron problems

🏗️ Infrastructure

Monitors: Server resources, network, and core services

Tracked metrics:

CPU usage and load average
Memory utilization
Disk space availability
Network connectivity
Service uptime

Green indicators:

CPU < 70%, Memory < 80%
All services responding
Network latency normal
No disk space warnings

Yellow indicators:

CPU 70-85%, Memory 80-90%
Intermittent service issues
Elevated network latency
Disk space 85-95% full

Red indicators:

CPU > 85%, Memory > 90%
Services down or unresponsive
Network connectivity issues
Disk space > 95% full

Detailed Health Pages

Infrastructure Page (`/infrastructure`)

Comprehensive server monitoring with:

Top Processes Table:

Process name and PID
CPU usage percentage
Memory consumption
Runtime duration
Process status

System Metrics:

Server uptime
Load averages (1m, 5m, 15m)
Memory usage breakdown
Disk usage by volume
Network interface status

Resource Alerts:

Processes consuming excessive resources
Memory leaks or runaway processes
Disk space warnings
Network connectivity issues

Crons Page (`/crons`)

Detailed scheduled job monitoring:

Job Table Columns:

Job name and description
Schedule frequency
Last run timestamp
Status (success/failure/running)
Execution duration
Next scheduled run
Error messages (if any)

Filtering Options:

Status filtering (all/success/failed/running)
Search by job name
Sort by last run, duration, or status

Job Categories:

Revenue collection — Daily financial data aggregation
Data pipeline — Creator and brand data processing
Health checks — System monitoring and alerting
Backup and maintenance — Data backup and cleanup
Notifications — Alert and report generation

Agents Page (`/agents`)

Individual agent health monitoring:

Per Agent Display:

Current operational status
Recent task completion rate
Error log highlights
Assigned cron jobs
Performance metrics

Agent-Specific Metrics:

Ari (Main Agent):
- Email processing rate
- API call success rate
- Data pipeline completion
- Integration health
Arlo (Sales Agent):
- CRM sync success
- Proposal generation rate
- Client communication metrics
- Revenue tracking accuracy
Axel (Dev Agent):
- Build success rate
- Deployment frequency
- Error resolution time
- Code quality metrics

Health Data Sources

Real-Time Monitoring

System logs: /var/log/ and application logs Process monitoring: ps, top, htop system commands Network status: ping, curl connectivity tests Service status: systemctl and service-specific health endpoints

Cron Job Monitoring

Execution logs: Individual job output and error logs Timing data: Start time, duration, completion status Resource usage: CPU and memory consumption during execution Dependencies: Service and data dependencies status

Agent Health Checks

Heartbeat monitoring: Regular ping/pong with agents Task completion tracking: Success/failure rates for assigned tasks Error log analysis: Parsing error messages for patterns Performance benchmarking: Response time and throughput metrics

Alerting System

Alert Levels

Info: General status updates, successful completions Warning: Non-critical issues requiring attention Error: Failures that impact functionality Critical: System-wide issues requiring immediate response

Alert Channels

Mission Control dashboard — Visual indicators and notifications
Slack alerts — Automated messages to #operations channel
Email notifications — Critical alerts to Apple and team
SMS alerts — Emergency notifications for critical failures

Alert Rules

CPU > 90% for 10 minutes → Critical infrastructure alert Cron job failed 3 consecutive times → Error alert Agent unresponsive for 30 minutes → Critical agent alert Disk space < 5% → Critical infrastructure alert Revenue collection job failed → Error alert

Historical Health Data

Metrics Storage

Health metrics are stored in Convex for historical analysis:

System metrics — CPU, memory, disk usage over time
Cron performance — Success rates, execution times, failure patterns
Agent metrics — Task completion rates, error frequencies
Uptime tracking — Service availability percentages

Trend Analysis

Performance trends — Identify degradation patterns
Failure correlation — Link failures to system changes
Capacity planning — Resource usage growth patterns
Optimization opportunities — Identify inefficient processes

Health Automation

Auto-Remediation

Certain issues trigger automatic fixes:

Disk cleanup — Automatic log rotation and temp file removal
Process restart — Restart failed critical services
Memory cleanup — Clear caches when memory usage high
Network retry — Retry failed network operations

Health Check Automation

System health cron runs every 15 minutes:

Collect system metrics
Check service status
Analyze recent logs
Update health indicators
Trigger alerts if needed
Store metrics in database

Predictive Monitoring

Resource exhaustion warnings — Alert before disk/memory full
Performance degradation detection — Identify slowdowns early
Failure pattern recognition — Predict likely failures based on history
Capacity threshold alerts — Warn when approaching limits

Integration Points

With Mission Control Dashboard

Real-time status — Live health indicators on home page
Detailed views — Drill-down pages for each health category
Historical charts — Graphs showing health trends over time
Alert integration — Notifications within dashboard UI

With External Systems

Slack integration — Health alerts in team channels
Email alerts — Critical notifications via email
SMS alerting — Emergency notifications for critical issues
PagerDuty integration — On-call escalation for critical failures

With Other Tools

GitHub Actions — Build and deployment health
Vercel monitoring — Application performance and uptime
Database monitoring — Query performance and connection health
CDN monitoring — Content delivery performance

Troubleshooting Guides

Common Issues

High CPU usage:

Check top processes in infrastructure page
Identify resource-intensive jobs
Consider job scheduling optimization
Scale infrastructure if needed

Cron job failures:

Check specific job logs in crons page
Verify dependencies and credentials
Test job manually if possible
Check system resources during failure time

Agent unresponsiveness:

Check agent logs for error messages
Verify network connectivity
Check resource availability
Restart agent if necessary

Infrastructure alerts:

Verify alert accuracy with manual checks
Identify root cause (resource, network, service)
Apply appropriate remediation
Monitor for resolution

This health monitoring system ensures proactive identification and resolution of issues before they impact agency operations or client services.

System Health Monitoring ​

Health Indicators ​

Status Levels ​

Home Dashboard Health Strip ​

🤖 Agents ​

⏰ Crons ​

🏗️ Infrastructure ​

Detailed Health Pages ​

Infrastructure Page (/infrastructure) ​

Crons Page (/crons) ​

Agents Page (/agents) ​

Health Data Sources ​

Real-Time Monitoring ​

Cron Job Monitoring ​

Agent Health Checks ​

Alerting System ​

Alert Levels ​

Alert Channels ​

Alert Rules ​

Historical Health Data ​

Metrics Storage ​

Trend Analysis ​

Health Automation ​

Auto-Remediation ​

Health Check Automation ​

Predictive Monitoring ​

Integration Points ​

With Mission Control Dashboard ​

With External Systems ​

With Other Tools ​

Troubleshooting Guides ​

Common Issues ​

System Health Monitoring

Health Indicators

Status Levels

Home Dashboard Health Strip

🤖 Agents

⏰ Crons

🏗️ Infrastructure

Detailed Health Pages

Infrastructure Page (`/infrastructure`)

Crons Page (`/crons`)

Agents Page (`/agents`)

Health Data Sources

Real-Time Monitoring

Cron Job Monitoring

Agent Health Checks

Alerting System

Alert Levels

Alert Channels

Alert Rules

Historical Health Data

Metrics Storage

Trend Analysis

Health Automation

Auto-Remediation

Health Check Automation

Predictive Monitoring

Integration Points

With Mission Control Dashboard

With External Systems

With Other Tools

Troubleshooting Guides

Common Issues