Skip to content

System Health Monitoring

Mission Control provides comprehensive system health monitoring across agents, infrastructure, and automated processes. The health system gives Apple and the team real-time visibility into operational status and potential issues.

Health Indicators

Status Levels

Mission Control uses a three-tier health system:

  • 🟢 Green — Operational, no issues
  • 🟡 Yellow — Minor issues or warnings, monitoring required
  • 🔴 Red — Critical issues requiring immediate attention

Home Dashboard Health Strip

The main dashboard displays three key health categories:

🤖 Agents

Monitors: Ari (main agent), Arlo (sales agent), Axel (dev agent)

Green indicators:

  • Recent successful task completion
  • No error messages in logs
  • Responsive to commands
  • Normal processing times

Yellow indicators:

  • Occasional errors or timeouts
  • Slower than normal response times
  • Minor configuration warnings
  • Non-critical failures in secondary tasks

Red indicators:

  • Agent completely unresponsive
  • Critical task failures
  • Authentication errors
  • System crashes or exceptions

⏰ Crons

Monitors: All scheduled background jobs

Health calculation:

  • Success rate over last 24 hours
  • Average execution time vs. baseline
  • Number of failed jobs
  • Critical vs. non-critical job status

Green indicators:

  • 95%+ success rate
  • All critical jobs completing
  • Normal execution times
  • No stuck or hanging processes

Yellow indicators:

  • 85-95% success rate
  • Some non-critical job failures
  • Slower than normal execution
  • Occasional timeouts

Red indicators:

  • <85% success rate
  • Critical job failures
  • Multiple consecutive failures
  • System-wide cron problems

🏗️ Infrastructure

Monitors: Server resources, network, and core services

Tracked metrics:

  • CPU usage and load average
  • Memory utilization
  • Disk space availability
  • Network connectivity
  • Service uptime

Green indicators:

  • CPU < 70%, Memory < 80%
  • All services responding
  • Network latency normal
  • No disk space warnings

Yellow indicators:

  • CPU 70-85%, Memory 80-90%
  • Intermittent service issues
  • Elevated network latency
  • Disk space 85-95% full

Red indicators:

  • CPU > 85%, Memory > 90%
  • Services down or unresponsive
  • Network connectivity issues
  • Disk space > 95% full

Detailed Health Pages

Infrastructure Page (/infrastructure)

Comprehensive server monitoring with:

Top Processes Table:

  • Process name and PID
  • CPU usage percentage
  • Memory consumption
  • Runtime duration
  • Process status

System Metrics:

  • Server uptime
  • Load averages (1m, 5m, 15m)
  • Memory usage breakdown
  • Disk usage by volume
  • Network interface status

Resource Alerts:

  • Processes consuming excessive resources
  • Memory leaks or runaway processes
  • Disk space warnings
  • Network connectivity issues

Crons Page (/crons)

Detailed scheduled job monitoring:

Job Table Columns:

  • Job name and description
  • Schedule frequency
  • Last run timestamp
  • Status (success/failure/running)
  • Execution duration
  • Next scheduled run
  • Error messages (if any)

Filtering Options:

  • Status filtering (all/success/failed/running)
  • Search by job name
  • Sort by last run, duration, or status

Job Categories:

  • Revenue collection — Daily financial data aggregation
  • Data pipeline — Creator and brand data processing
  • Health checks — System monitoring and alerting
  • Backup and maintenance — Data backup and cleanup
  • Notifications — Alert and report generation

Agents Page (/agents)

Individual agent health monitoring:

Per Agent Display:

  • Current operational status
  • Recent task completion rate
  • Error log highlights
  • Assigned cron jobs
  • Performance metrics

Agent-Specific Metrics:

  • Ari (Main Agent):

    • Email processing rate
    • API call success rate
    • Data pipeline completion
    • Integration health
  • Arlo (Sales Agent):

    • CRM sync success
    • Proposal generation rate
    • Client communication metrics
    • Revenue tracking accuracy
  • Axel (Dev Agent):

    • Build success rate
    • Deployment frequency
    • Error resolution time
    • Code quality metrics

Health Data Sources

Real-Time Monitoring

System logs: /var/log/ and application logs Process monitoring: ps, top, htop system commands Network status: ping, curl connectivity tests Service status: systemctl and service-specific health endpoints

Cron Job Monitoring

Execution logs: Individual job output and error logs Timing data: Start time, duration, completion status Resource usage: CPU and memory consumption during execution Dependencies: Service and data dependencies status

Agent Health Checks

Heartbeat monitoring: Regular ping/pong with agents Task completion tracking: Success/failure rates for assigned tasks Error log analysis: Parsing error messages for patterns Performance benchmarking: Response time and throughput metrics

Alerting System

Alert Levels

Info: General status updates, successful completions Warning: Non-critical issues requiring attention Error: Failures that impact functionality Critical: System-wide issues requiring immediate response

Alert Channels

  • Mission Control dashboard — Visual indicators and notifications
  • Slack alerts — Automated messages to #operations channel
  • Email notifications — Critical alerts to Apple and team
  • SMS alerts — Emergency notifications for critical failures

Alert Rules

CPU > 90% for 10 minutes → Critical infrastructure alert Cron job failed 3 consecutive times → Error alert Agent unresponsive for 30 minutes → Critical agent alert Disk space < 5% → Critical infrastructure alert Revenue collection job failed → Error alert

Historical Health Data

Metrics Storage

Health metrics are stored in Convex for historical analysis:

  • System metrics — CPU, memory, disk usage over time
  • Cron performance — Success rates, execution times, failure patterns
  • Agent metrics — Task completion rates, error frequencies
  • Uptime tracking — Service availability percentages

Trend Analysis

  • Performance trends — Identify degradation patterns
  • Failure correlation — Link failures to system changes
  • Capacity planning — Resource usage growth patterns
  • Optimization opportunities — Identify inefficient processes

Health Automation

Auto-Remediation

Certain issues trigger automatic fixes:

  • Disk cleanup — Automatic log rotation and temp file removal
  • Process restart — Restart failed critical services
  • Memory cleanup — Clear caches when memory usage high
  • Network retry — Retry failed network operations

Health Check Automation

System health cron runs every 15 minutes:

  1. Collect system metrics
  2. Check service status
  3. Analyze recent logs
  4. Update health indicators
  5. Trigger alerts if needed
  6. Store metrics in database

Predictive Monitoring

  • Resource exhaustion warnings — Alert before disk/memory full
  • Performance degradation detection — Identify slowdowns early
  • Failure pattern recognition — Predict likely failures based on history
  • Capacity threshold alerts — Warn when approaching limits

Integration Points

With Mission Control Dashboard

  • Real-time status — Live health indicators on home page
  • Detailed views — Drill-down pages for each health category
  • Historical charts — Graphs showing health trends over time
  • Alert integration — Notifications within dashboard UI

With External Systems

  • Slack integration — Health alerts in team channels
  • Email alerts — Critical notifications via email
  • SMS alerting — Emergency notifications for critical issues
  • PagerDuty integration — On-call escalation for critical failures

With Other Tools

  • GitHub Actions — Build and deployment health
  • Vercel monitoring — Application performance and uptime
  • Database monitoring — Query performance and connection health
  • CDN monitoring — Content delivery performance

Troubleshooting Guides

Common Issues

High CPU usage:

  1. Check top processes in infrastructure page
  2. Identify resource-intensive jobs
  3. Consider job scheduling optimization
  4. Scale infrastructure if needed

Cron job failures:

  1. Check specific job logs in crons page
  2. Verify dependencies and credentials
  3. Test job manually if possible
  4. Check system resources during failure time

Agent unresponsiveness:

  1. Check agent logs for error messages
  2. Verify network connectivity
  3. Check resource availability
  4. Restart agent if necessary

Infrastructure alerts:

  1. Verify alert accuracy with manual checks
  2. Identify root cause (resource, network, service)
  3. Apply appropriate remediation
  4. Monitor for resolution

This health monitoring system ensures proactive identification and resolution of issues before they impact agency operations or client services.