Responsibilities
Position:
- Junior DevSecOps Engineer/Junior Site Reliability Engineer
- Focus: day-to-day infrastructure operations, reliability, and baseline security for production environments.
Role Purpose
Operate and maintain production systems to ensure availability, performance, and security hygiene across:
- Linux-hosted web applications (primarily Python-based, plus supporting API/services)
- Integration services (service-based components, e.g., compiled services such as Go/Java) and workflow orchestration/scheduling
- Relational databases (e.g., MySQL/MariaDB and PostgreSQL)
- Observability tooling (dashboards, metrics, alerts, and logse.g., Grafana or equivalent)
This is an execution-oriented role: monitoring, incident response, routine maintenance, safe changes, and continuous operational improvement.
Key ResponsibilitiesRequirements
- Production Reliability & Operations
- Perform daily health checks for applications, services, and infrastructure (CPU/memory/disk/network).
- Monitor metrics and alerts; identify anomalies and take first-response actions based on runbooks.
- Troubleshoot incidents using a structured approach (symptom scope evidence mitigation escalation).
- Maintain service availability by applying safe operational actions (restart, rollback, failover steps where applicable).
- Observability (Metrics, Logs, Alerts)
- Read and interpret dashboards (latency, throughput, error rate, saturation, DB connections, queue depth).
- Investigate issues using common log sources and system logs; collect evidence for post-incident review.
- Maintain alert hygiene: reduce noise, validate thresholds, ensure alerts map to actionable playbooks.
- Support creation/updating of operational dashboards and basic SLO/SLI tracking (where required).
- Platform/Application Operations
- Support application runtimes and background processing:
- process/worker health, schedulers, job queues, cron/systemd services
- configuration verification and environment consistency checks
- Assist with deployments in collaboration with developers:
- pre-checks, smoke tests, rollback readiness, and post-deploy monitoring
- Perform routine maintenance: log rotation validation, cleanup tasks, certificate checks, and capacity housekeeping.
- Database Operational Support (MySQL/MariaDB & PostgreSQL)
- Perform operational checks:
- connectivity, replication/HA status (if used), storage growth, connection usage
- Verify and test backups:
- ensure backups run, validate restore procedures periodically (under supervision)
- Support troubleshooting:
- identify symptoms of locking/contention, slow queries, and connection exhaustion
- collect evidence (process list/activity, slow logs, relevant metrics)
- DevSecOps Baseline Security Hygiene
- Apply standard security practices:
- SSH key management, least privilege, secure access patterns, secrets hygiene
- Support patching routines:
- OS updates, package updates, vulnerability remediation scheduling
- Assist with hardening and exposure checks:
- firewall/security group rules, port exposure reviews, TLS/cert validity checks
- Ensure operational compliance basics:
- access logs, change tracking, and minimal audit readiness.
- Documentation & Continuous Improvement
- Maintain and follow runbooks/SOPs for recurring tasks and incidents.
- Write short incident notes (what happened, impact, mitigation, follow-up actions).
- Automate repetitive checks with scripts (Bash/Python) and simple tooling.
Must-Have Requirements
Technical
- Comfortable operating Linux servers via SSH (strong terminal usage is mandatory; Termius or similar SSH client experience preferred).
- Able to read dashboards and interpret metrics using Grafana or equivalent (Datadog/New Relic/Prometheus UI etc.).
- Basic understanding of web/service fundamentals:
- HTTP/HTTPS, TLS basics, reverse proxy concepts, ports, DNS basics
- Basic operational knowledge of relational databases:
- MySQL/MariaDB or PostgreSQL concepts (connections, queries, backups, locking symptoms)
- Basic scripting:
- Bash and/or Python for routine automation and checks
- Familiar with version control basics (Git) and disciplined change practices.
Behavioral
- Strong operational mindset: careful, systematic, and calm during incidents.
- Can follow runbooks, communicate clearly, and escalate with context.
- Willing to work with an on-call/standby rotation (if applicable).
Nice-to-Have (Preferred)
- Exposure to any orchestration/scheduling tool (Airflow, cron-based platforms, CI schedulers, etc.).
- Experience with containers (Docker) and/or virtualization (VMware/Proxmox).
- Familiarity with common components:
- Nginx/HAProxy, Redis/queues, message brokers, object storage
- Basic security tooling familiarity:
- Vulnerability scanning concepts, CIS-style hardening, MFA/SSO integration awareness
- Any experience building/maintaining monitoring/alert rules and dashboards.
Tools & Practical Skills We Expect
- SSH and Linux triage commands: systemctl, journalctl, top/htop, free, df/du, iostat, ss/netstat, curl, tail, grep
Working With Monitoring
- Understanding of latency, error rate, saturation, throughput, resource bottlenecks
Basic DB Checks
- Connection count, active queries, long-running queries, storage growth
- Communication and documentation:
- Incident updates, handover notes, minimal post-incident summary
Recommended Screening (Hands-on)
- Linux triage: given service down / high latency, show step-by-step checks and safe actions.
- Dashboard reading: interpret a scenario (latency spike + error increase) and propose likely causes + next checks.
- DB ops basics: how to detect connection exhaustion, locking symptoms, and what evidence to collect.
- Security hygiene: explain safe SSH access, key handling, secrets, and patching routines.
Experience & Education Guidance
- 02 years relevant experience (DevSecOps/SRE/Infra Support) or strong internship/homelab proof.
- Fresh graduates are acceptable if they demonstrate strong terminal skills + monitoring literacy.