Security Monitoring Best Practices: Alerts, Dashboards, and Incident Detection

Share

TL;DR

The #1 monitoring security best practice is setting up alerts for authentication failures and unusual patterns before they become incidents. Monitor authentication failures, authorization denials, error rates, and unusual patterns. Set up alerts for brute force attempts, account takeover indicators, and system anomalies. Have runbooks ready for common alerts. False positives are better than missing real attacks.

"You cannot defend what you cannot see. Security monitoring transforms invisible attacks into actionable alerts."

Best Practice 1: Monitor Security-Critical Metrics 10 min

Track these metrics for security visibility:

MetricNormal RangeAlert ThresholdIndicates
Failed logins/min0-10>50Brute force attack
403 responses/minBaseline>3x baselinePrivilege escalation
Account lockouts/hour0-5>20Credential stuffing
Password resets/hourBaseline>5x baselineAccount takeover attempt
New admin users/day0-1>2Privilege abuse
Prometheus metrics for security monitoring
import { Counter, Histogram } from 'prom-client';

// Authentication metrics
const authAttempts = new Counter({
  name: 'auth_attempts_total',
  help: 'Total authentication attempts',
  labelNames: ['result', 'method'],  // success/failure, password/oauth/mfa
});

const authLatency = new Histogram({
  name: 'auth_latency_seconds',
  help: 'Authentication request latency',
  buckets: [0.1, 0.5, 1, 2, 5],
});

// Authorization metrics
const authzDenials = new Counter({
  name: 'authorization_denials_total',
  help: 'Authorization denials',
  labelNames: ['resource', 'action', 'reason'],
});

// Usage in middleware
authAttempts.inc({ result: 'failure', method: 'password' });
authzDenials.inc({ resource: 'admin', action: 'delete', reason: 'insufficient_role' });

Best Practice 2: Set Up Meaningful Alerts 10 min

Alerts should be actionable, not noisy:

Datadog/Prometheus alert rules
# Brute force detection
- alert: BruteForceAttempt
  expr: rate(auth_attempts_total{result="failure"}[5m]) > 10
  for: 2m
  labels:
    severity: high
  annotations:
    summary: "Possible brute force attack"
    runbook: "https://wiki.example.com/runbooks/brute-force"

# Unusual admin activity
- alert: UnusualAdminActivity
  expr: |
    sum(rate(admin_actions_total[1h])) >
    2 * avg_over_time(sum(rate(admin_actions_total[1h]))[7d:1h])
  labels:
    severity: medium
  annotations:
    summary: "Unusual admin activity detected"

# Error rate spike
- alert: ErrorRateSpike
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) /
    sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "Error rate above 5%"
  • Start with high-severity alerts only
  • Include runbook links in every alert
  • Set appropriate thresholds to minimize false positives
  • Route alerts to the right team (security vs ops)
  • Review and tune alerts weekly

Best Practice 3: Build Security Dashboards 15 min

Visualize security posture at a glance:

Key dashboard panels
// Dashboard sections to include:

// 1. Authentication Overview
- Login success/failure rate (time series)
- Top IPs by failed logins (table)
- Geographic login distribution (map)
- MFA adoption rate (gauge)

// 2. Authorization & Access
- Permission denials by resource (bar chart)
- Unusual access patterns (heatmap by hour)
- Privilege escalation attempts (counter)

// 3. Application Security
- Error rates by endpoint (time series)
- Rate limit hits (counter)
- Blocked requests by WAF (time series)

// 4. User Behavior
- New account registrations (time series)
- Password reset requests (time series)
- Session anomalies (list)

// Grafana dashboard JSON structure example
{
  "panels": [
    {
      "title": "Failed Logins by IP",
      "type": "table",
      "targets": [{
        "expr": "topk(10, sum by (ip) (increase(auth_failures_total[1h])))"
      }]
    }
  ]
}

Best Practice 4: Implement Anomaly Detection 20 min

Detect unusual patterns automatically:

User behavior anomaly detection
// Track user behavior baselines
async function checkUserAnomaly(userId, action) {
  const baseline = await getUserBaseline(userId);
  const current = await getCurrentBehavior(userId);

  const anomalies = [];

  // Unusual time
  if (!baseline.activeHours.includes(current.hour)) {
    anomalies.push({
      type: 'unusual_time',
      severity: 'low',
      detail: `Login at ${current.hour}:00, usually active ${baseline.activeHours.join(', ')}`,
    });
  }

  // New location
  if (!baseline.locations.includes(current.location)) {
    anomalies.push({
      type: 'new_location',
      severity: 'medium',
      detail: `Login from ${current.location}, new location`,
    });
  }

  // Unusual velocity
  if (current.actionsPerMinute > baseline.avgActionsPerMinute * 3) {
    anomalies.push({
      type: 'high_velocity',
      severity: 'high',
      detail: `${current.actionsPerMinute} actions/min, baseline is ${baseline.avgActionsPerMinute}`,
    });
  }

  if (anomalies.length > 0) {
    await logSecurityEvent('user_anomaly', {
      userId,
      action,
      anomalies,
    });

    // High severity triggers immediate alert
    if (anomalies.some(a => a.severity === 'high')) {
      await sendSecurityAlert('user_anomaly', { userId, anomalies });
    }
  }

  return anomalies;
}

Best Practice 5: Create Incident Response Runbooks 15 min

Document response procedures before you need them:

Example runbook structure
# Runbook: Brute Force Attack Response

## Detection
- Alert: BruteForceAttempt fired
- Metric: auth_failures > 10/min for 2+ minutes

## Triage (5 minutes)
1. Check affected accounts: SELECT user_id, count(*) FROM auth_logs
   WHERE success = false AND time > now() - interval '10 minutes'
2. Check source IPs: Are they from known bad ranges?
3. Check if MFA is protecting affected accounts

## Containment (15 minutes)
- If single IP: Add to WAF blocklist
- If distributed: Enable CAPTCHA for login
- If targeting specific accounts: Force password reset

## Investigation
- Pull full logs for attack period
- Check for successful logins from attack IPs
- Review for account compromise indicators

## Recovery
- Remove temporary blocks after attack stops
- Reset passwords for compromised accounts
- Notify affected users

## Post-Incident
- Update alert thresholds if needed
- Document lessons learned
- Schedule security review

Best Practice 6: Monitor Third-Party Dependencies 5 min

Your security depends on your dependencies:

  • Monitor dependency vulnerability databases
  • Alert on new critical CVEs affecting your stack
  • Track third-party service status pages
  • Monitor for credential leaks mentioning your domain
  • Set up alerts for SSL certificate expiration

Alert Fatigue: Too many alerts leads to ignored alerts. Start with a few high-confidence alerts and expand. Every alert should have a clear action. If you are ignoring alerts, fix them or remove them.

Official Resources: For comprehensive monitoring guidance, see Prometheus Alerting Best Practices, Grafana Dashboard Documentation, and OWASP Logging Cheat Sheet.

What monitoring tools should I use?

For metrics, Prometheus with Grafana or Datadog. For logs, ELK stack, Loki, or CloudWatch. For alerts, PagerDuty or Opsgenie. For SIEM, Splunk or Elastic Security. Choose based on your scale and budget.

How do I reduce false positives?

Start with high thresholds and lower them over time. Use rate-based alerts instead of absolute numbers. Correlate multiple signals before alerting. Regularly review and tune alert rules based on false positive rates.

Should I monitor in real-time or batch?

Security-critical events (auth failures, errors) need real-time monitoring. Less urgent metrics (usage trends, compliance checks) can be batch processed. Aim for alerting within 1-2 minutes for security events.

Check Your Monitoring Coverage

Identify gaps in your security monitoring setup.

Start Free Scan
Best Practices

Security Monitoring Best Practices: Alerts, Dashboards, and Incident Detection