About the role
The Senior Service Engineer (Lead) is the most senior technical operations role within the Service Engineering team. This individual leads production reliability strategy, defines operational standards, and provides technical and organizational leadership to the team. The Senior Service Engineer (Lead) bridges the gap between hands-on technical operations and business stakeholder communication, ensuring the data and analytics platform runs at the highest standards of reliability and efficiency.
Key Responsibilities:
Production Monitoring & Reliability
- Define and own monitoring standards, alerting thresholds, and reliability strategies across ADF pipelines and Databricks environments
- Lead continuous improvement initiatives to enhance system observability and reduce mean time to resolution (MTTR)
- Review and approve monitoring frameworks proposed by the team
Incident Management
- Lead resolution of major incidents (P1/P2), including coordination across teams and stakeholder communication during outages
- Own the post-mortem process: facilitate blameless reviews and drive preventive action plans
- Define escalation policies and incident severity classifications
Ticket & SLA Management (UV Desk)
- Manage and optimize the end-to-end ticket workflow from Business Units
- Define, monitor, and enforce SLA policies for all operational tickets
- Produce weekly/monthly SLA performance reports for management
Data Issue Support & Process Improvement
- Lead initiatives to reduce recurring data quality issues reported by Business Units
- Approve RCA findings and oversee corrective action implementation
- Collaborate with Data Engineering to introduce upstream fixes
ADF & Databricks Operations
- Define operational standards for ADF pipeline management and Databricks job governance
- Review and sign off on operational runbooks created by the team
- Drive adoption of best practices in pipeline monitoring and failure recovery
Automation Roadmap
- Define and own the automation roadmap for operational tasks (monitoring, alerting, self-healing)
- Evaluate and approve automation tools and frameworks built by the team
- Track automation ROI and report efficiency gains to leadership Power BI Operational Governance
- Define and enforce BI operational governance policies (dataset refresh SLAs, failure procedures)
- Coordinate with BI developers on refresh pipeline improvements
Documentation Standards
- Own and enforce operational documentation standards across the team
- Ensure all runbooks, playbooks, and knowledge base articles meet quality and completeness requirements
Stakeholder Communication
- Serve as the primary escalation point for Business Units and senior management on operational issues
- Communicate system status, incident updates, and platform reliability metrics to leadership
- Build and maintain relationships with Data Engineering, DevOps, and Analytics teams
Team Leadership & Development
- Lead, coach, and develop the Service Engineering team
- Assign tasks, set priorities, and manage team workload to ensure operational coverage
- Conduct performance reviews and provide career development guidance
- Define team goals, OKRs, and track progress against operational KPIs
Operational Strategy
- Define the multi-quarter roadmap for platform operations, reliability, and automation
- Identify and prioritize platform risks and drive remediation planning
- Provide strategic recommendations on tooling, processes, and team structure
Requirements
- 3+ years of experience in service engineering, site reliability engineering, or production operations
- 1+ years in a lead or team lead capacity in a technical operations environment
- Proven track record of managing high-severity incidents and driving post-mortem processes
- Technical skills: Azure Data Factory (ADF), Databricks, Monitoring & Observability, Automation/ Scripting, Power BI Operations and Ticket Management (UV Desk).
- Strong leadership and people management skills with the ability to motivate and develop a technical team
- Excellent communication skills for both technical and executive audiences
- Strategic thinking with the ability to translate operational insights into platform improvement roadmaps
- Conflict resolution and stakeholder management skills