Infrastructure & DevOps Issues
The following examples demonstrate common infrastructure and DevOps problems:
1. No CI/CD Pipeline CI/CD
Manual deployment process:
# VIOLATION: No CI/CD
# Deployment process:
1. Developer makes changes
2. Manually runs tests (maybe)
3. Manually builds application
4. Manually copies files to server
5. Manually restarts services
6. Manually checks if it works
# Problems:
# - Inconsistent deployments
# - Human error
# - No automated testing
# - No rollback capability
# - Slow deployment process
Problem: Manual process, Error-prone, Slow
Deployments are inconsistent and risky
2. No Infrastructure as Code IaC
Infrastructure configured manually:
# VIOLATION: No Infrastructure as Code
# Infrastructure setup:
1. Manually create servers in AWS console
2. Manually configure security groups
3. Manually install software
4. Manually configure databases
5. Manually set up load balancers
6. No version control
7. No reproducibility
# Problems:
# - Can't reproduce environments
# - Configuration drift
# - No audit trail
# - Hard to scale
# - Manual errors
Problem: Manual config, Not reproducible, No version control
Can't recreate or scale infrastructure reliably
3. No Monitoring or Logging Monitoring
No visibility into system health:
# VIOLATION: No monitoring
# No monitoring tools:
# - No application performance monitoring
# - No error tracking
# - No log aggregation
# - No metrics collection
# - No alerting
# - No dashboards
# Problems:
# - Don't know when system fails
# - Can't debug issues
# - No performance visibility
# - Reactive instead of proactive
Problem: No visibility, Blind to issues
Problems discovered by users, not monitoring
4. No Backup Strategy Backup
No backups or unreliable backups:
# VIOLATION: No backup strategy
# Backup situation:
# - No automated backups
# - Manual backups (if remembered)
# - Backups not tested
# - No backup retention policy
# - No disaster recovery plan
# - Backups stored on same server
# Problems:
# - Data loss risk
# - Can't recover from disasters
# - No recovery time objective
# - No recovery point objective
Problem: No backups, Data loss risk
One failure could mean permanent data loss
5. Hardcoded Configuration Config
Configuration values hardcoded in code:
// VIOLATION: Hardcoded configuration
const config = {
database: {
host: 'production-db.example.com',
port: 5432,
username: 'admin',
password: 'hardcoded-password',
database: 'production'
},
api: {
url: 'https://api.production.com',
key: 'hardcoded-api-key'
}
};
// Can't change without code changes
// Same config for all environments
// Security risk
Problem: Hardcoded values, Security risk, Not flexible
Can't use different configs for different environments
6. No Environment Separation Environments
Development and production use same resources:
# VIOLATION: No environment separation
# All environments share:
# - Same database
# - Same API keys
# - Same servers
# - Same configuration
# Problems:
# - Development breaks production
# - Can't test safely
# - Data mixing
# - Security issues
# - No staging environment
Problem: Shared resources, Risk to production
Testing could break production data
7. No Containerization Containers
Applications deployed without containers:
# VIOLATION: No containerization
# Deployment:
# - Install dependencies on server
# - Configure environment manually
# - Hope it works the same everywhere
# - "Works on my machine" problems
# - Can't scale easily
# - Environment inconsistencies
# Problems:
# - Environment drift
# - Hard to reproduce
# - Difficult to scale
# - Deployment inconsistencies
Problem: Environment drift, Not portable
Application behavior differs across environments
8. No Auto-Scaling Scaling
Manual scaling or no scaling capability:
# VIOLATION: No auto-scaling
# Scaling process:
1. Monitor traffic manually
2. Notice high load
3. Manually provision new servers
4. Manually configure load balancer
5. Manually deploy to new servers
6. Hope it works
# Problems:
# - Slow response to traffic spikes
# - Over-provisioning (waste money)
# - Under-provisioning (poor performance)
# - Manual intervention required
Problem: Manual scaling, Slow response
System can't handle traffic spikes automatically
9. No Health Checks Health
No way to verify system health:
# VIOLATION: No health checks
# No health endpoints:
# - No /health endpoint
# - No /ready endpoint
# - No /live endpoint
# - Load balancer doesn't know if service is healthy
# - Can't detect failures automatically
# - Unhealthy instances serve traffic
# Problems:
# - Traffic routed to broken instances
# - No automatic recovery
# - Poor user experience
Problem: No health checks, No failure detection
Broken instances continue serving traffic
10. No Secrets Management Secrets
Secrets stored in code or config files:
// VIOLATION: Secrets in code
const secrets = {
apiKey: 'sk_live_1234567890abcdef',
dbPassword: 'super-secret-password',
jwtSecret: 'my-secret-key',
awsAccessKey: 'AKIAIOSFODNN7EXAMPLE',
awsSecretKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
};
// Committed to git
// Visible in code
// Security risk
Problem: Secrets in code, Security risk, Version controlled
Secrets exposed in repository history
11. No Rollback Strategy Rollback
No way to rollback deployments:
# VIOLATION: No rollback
# Deployment process:
1. Deploy new version
2. If something breaks:
- Panic
- Manually fix code
- Redeploy
- Hope it works
- Or restore from backup (slow)
# Problems:
# - Can't quickly revert
# - Long downtime
# - Manual intervention required
# - No blue-green deployment
# - No canary releases
Problem: No rollback, Slow recovery
Broken deployments cause extended downtime
12. No Disaster Recovery Plan DR
No plan for handling disasters:
# VIOLATION: No disaster recovery
# No DR plan:
# - No backup data center
# - No failover strategy
# - No RTO (Recovery Time Objective)
# - No RPO (Recovery Point Objective)
# - No tested recovery procedures
# - Single point of failure
# Problems:
# - Extended downtime
# - Data loss
# - No recovery procedures
# - Business continuity risk
Problem: No DR plan, Business risk
Disaster could mean permanent service loss
13. No Dependency Management Dependencies
Dependencies not managed or tracked:
# VIOLATION: No dependency management
# Dependencies:
# - Manually installed on servers
# - No version control
# - No dependency scanning
# - No security updates
# - Outdated packages
# - Vulnerable dependencies
# Problems:
# - Security vulnerabilities
# - Inconsistent environments
# - Hard to update
# - No audit trail
Problem: No tracking, Security risk, Outdated
Vulnerable dependencies not identified or updated
14. No Logging Strategy Logging
No centralized logging or log management:
# VIOLATION: No logging strategy
# Logging situation:
# - Logs only on local files
# - No log aggregation
# - No log retention policy
# - Can't search logs
# - No structured logging
# - Logs lost when server restarts
# Problems:
# - Can't debug issues
# - No audit trail
# - Logs not accessible
# - No correlation between logs
Problem: No aggregation, Hard to debug
Can't trace issues across services
15. No Security Scanning Security
No automated security scanning:
# VIOLATION: No security scanning
# No security tools:
# - No vulnerability scanning
# - No dependency scanning
# - No container scanning
# - No infrastructure scanning
# - No penetration testing
# - No security audits
# Problems:
# - Vulnerabilities go undetected
# - Security issues in production
# - Compliance issues
# - No security posture visibility
Problem: No scanning, Vulnerabilities undetected
Security issues discovered after exploitation
16. No Performance Testing in CI/CD Performance
Performance not tested before deployment:
# VIOLATION: No performance testing
# CI/CD pipeline:
1. Run unit tests
2. Build application
3. Deploy to production
# No performance tests
# No load tests
# No stress tests
# Performance issues discovered in production
# Problems:
# - Slow deployments
# - Performance regressions
# - No performance baselines
# - Production performance issues
Problem: No performance tests, Regressions undetected
Performance issues discovered by users
Infrastructure & DevOps Best Practices
The following examples demonstrate proper infrastructure and DevOps practices:
1. Automated CI/CD Pipeline CI/CD
# Compliant: CI/CD pipeline
# .github/workflows/deploy.yml
name: Deploy
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm install
- run: npm test
- run: npm run lint
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: docker build -t app:${{ github.sha }} .
- run: docker push app:${{ github.sha }}
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- run: kubectl set image deployment/app app=app:${{ github.sha }}
# Automated, consistent, reliable deployments
✓ Benefits: Automated, Consistent, Fast
2. Infrastructure as Code IaC
# Compliant: Terraform Infrastructure as Code
# infrastructure/main.tf
resource "aws_instance" "app_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
tags = {
Name = "app-server"
Environment = "production"
}
}
resource "aws_security_group" "app_sg" {
name = "app-security-group"
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
# Version controlled, reproducible, auditable
✓ Benefits: Version controlled, Reproducible, Auditable
3. Comprehensive Monitoring Monitoring
# Compliant: Monitoring stack
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
alertmanager:
image: prom/alertmanager
loki:
image: grafana/loki
# Monitoring:
# - Metrics (Prometheus)
# - Logs (Loki)
# - Dashboards (Grafana)
# - Alerts (Alertmanager)
# Full visibility into system health
✓ Benefits: Full visibility, Proactive alerts, Performance tracking
4. Automated Backup Strategy Backup
# Compliant: Automated backups
# backup-policy.yml
backup:
schedule: "0 2 * * *" # Daily at 2 AM
retention: 30 days
destinations:
- s3://backups/database/
- s3://backups/files/
verification: true
restore_testing: weekly
disaster_recovery:
rto: 4 hours # Recovery Time Objective
rpo: 1 hour # Recovery Point Objective
procedures:
- automated_failover
- data_restore
# Automated, tested, reliable backups
✓ Benefits: Automated, Tested, Reliable
5. Environment-Based Configuration Config
// Compliant: Environment-based config
const config = {
database: {
host: process.env.DB_HOST,
port: parseInt(process.env.DB_PORT || '5432'),
username: process.env.DB_USER,
password: process.env.DB_PASSWORD,
database: process.env.DB_NAME
},
api: {
url: process.env.API_URL,
key: process.env.API_KEY
}
};
// Different configs for dev, staging, production
// No secrets in code
✓ Benefits: Environment-specific, Secure, Flexible
6. Environment Separation Environments
# Compliant: Environment separation
Environments:
- Development: dev.example.com
- Staging: staging.example.com
- Production: example.com
Each environment has:
- Separate database
- Separate API keys
- Separate servers/resources
- Separate configuration
- Isolated network
# Benefits:
# - Safe testing
# - No production risk
# - Independent scaling
# - Security isolation
✓ Benefits: Isolated, Safe testing, Independent
7. Containerization Containers
# Compliant: Docker containerization
# Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/index.js"]
# Benefits:
# - Consistent environments
# - Portable
# - Easy to scale
# - Reproducible
✓ Benefits: Consistent, Portable, Scalable
8. Auto-Scaling Scaling
# Compliant: Auto-scaling configuration
# kubernetes/autoscaling.yml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Automatically scales based on load
✓ Benefits: Automatic, Cost-effective, Responsive
9. Health Checks Health
// Compliant: Health check endpoints
app.get('/health', (req, res) => {
res.json({ status: 'ok' });
});
app.get('/ready', async (req, res) => {
const dbHealthy = await checkDatabase();
const cacheHealthy = await checkCache();
if (dbHealthy && cacheHealthy) {
res.json({ status: 'ready' });
} else {
res.status(503).json({ status: 'not ready' });
}
});
app.get('/live', (req, res) => {
res.json({ status: 'alive' });
});
// Load balancer can check health and route traffic
✓ Benefits: Automatic failure detection, Traffic routing
10. Secrets Management Secrets
# Compliant: Secrets management
# Using Kubernetes secrets or AWS Secrets Manager
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
data:
api-key:
db-password:
# Or use AWS Secrets Manager
# secrets = await secretsManager.getSecretValue({
# SecretId: 'production/secrets'
# }).promise();
# Secrets not in code, encrypted, rotated
✓ Benefits: Secure, Encrypted, Rotatable
11. Rollback Strategy Rollback
# Compliant: Blue-green deployment
# deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: app:v1.0.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: app:v1.1.0
# Can instantly switch between blue/green
# Quick rollback if issues detected
✓ Benefits: Instant rollback, Zero downtime, Safe deployments
12. Disaster Recovery Plan DR
# Compliant: Disaster recovery plan
disaster_recovery:
rto: 1 hour # Recovery Time Objective
rpo: 15 minutes # Recovery Point Objective
backup_data_center:
location: us-west-2
replication: real-time
failover_procedures:
- automated_dns_failover
- database_replication_switch
- load_balancer_redirect
testing:
frequency: monthly
last_test: 2024-12-01
result: passed
contacts:
on_call_engineer: +1-555-0100
escalation: +1-555-0101
# Tested, documented, automated DR plan
✓ Benefits: Tested, Documented, Automated
13. Dependency Management Dependencies
# Compliant: Dependency management
# CI/CD pipeline includes:
- Dependency scanning (Snyk, Dependabot)
- Security vulnerability checks
- License compliance checks
- Automated updates (with tests)
- Dependency lock files (package-lock.json)
# Automated workflow:
1. Scan dependencies for vulnerabilities
2. Alert on high-severity issues
3. Create PR for security updates
4. Run tests on updates
5. Auto-merge if tests pass
# Automated, secure, up-to-date dependencies
✓ Benefits: Automated scanning, Security updates, Compliance
14. Centralized Logging Logging
# Compliant: Centralized logging
# ELK Stack or similar
logging:
aggregation: elasticsearch
visualization: kibana
collection: filebeat
retention: 90 days
indexing: daily
search: full-text
structured_logging: true
log_levels:
- error
- warn
- info
- debug
correlation: trace_id
# Centralized, searchable, structured logs
✓ Benefits: Centralized, Searchable, Correlated
15. Security Scanning Security
# Compliant: Security scanning
# CI/CD security pipeline:
security_scanning:
- dependency_scanning: snyk
- container_scanning: trivy
- infrastructure_scanning: checkov
- secret_scanning: gitguardian
- sast: sonarqube
- dast: owasp_zap
frequency: on_every_commit
blocking: true # Block deployment on high-severity issues
reporting:
- security_dashboard
- slack_alerts
- jira_tickets
# Comprehensive, automated security scanning
✓ Benefits: Comprehensive, Automated, Early detection
16. Performance Testing in CI/CD Performance
# Compliant: Performance testing
# CI/CD pipeline includes:
performance_tests:
- load_testing: k6
- stress_testing: artillery
- performance_baseline: lighthouse
- regression_detection: automated
thresholds:
- response_time: < 200ms (p95)
- error_rate: < 0.1%
- throughput: > 1000 req/s
blocking: true # Block if performance degrades
reporting:
- performance_dashboard
- trend_analysis
# Performance tested before deployment
✓ Benefits: Performance verified, Regression detection, Baseline tracking