Mitek Systems is seeking a Sr. Application Operations Engineer to join us in building our global Application Operations Team. The Application Operations team is responsible for ensuring Mitek's customer facing SaaS products meet our high standards for reliability and availability. The Sr. Application Operations Engineer will design, implement, and deliver operations activities including incident/outage management, problem management, and enterprise monitoring. This person will respond to incidents, conduct break-fix operations, act as the first responder, and support infrastructure/application change requests form our internal teams. As a part of this team, there will be opportunities to build and improve skills related to cloud infrastructure (AWS), Docker, monitoring, and automation.
What you will do:
- Train and mentor, in a coach/player role, a team of first responders as Application Operations Specialists.
- Monitor and respond to incidents of relating to all Mitek SaaS Products and Critical services.
- Escalate incidents and issues, and take ownership of the escalation process, outside of the Application Operations Team.
- Assist in implementing, modifying, and tuning application monitoring based on Cloud Engineering or Software Engineering recommendations.
- Assist with production deployments and system upgrades.
- Monitor systems and applications to proactively identify problems and perform periodical health checks.
- Communicate problem and incident management updates to impacted business users including action taken to resolve.
- Maintain a knowledge base of common resolution and recovery actions for all critical systems and applications.
- Provide responses to internal customers' trouble, request, or break/fix tickets in a timely fashion and in compliance with NOC standards and Cloud Operations team.
- Create/develop automation or procedures to address incidents or requests.
- Assist in development, improvement and implementation of the processes for Problem and Incident Management consistent with ITIL and COBIT best practices.
- Measure & report on production metrics including "Uptime" but not limited to using metrics and SLAs for each technology area monthly.
- Establish minimum Runbook requirements for all critical systems and applications and establishes a process to keep Runbooks current.
- Provide support for root cause analysis and preventative analysis of incidents.
- Assist leadership in the development of training documents and tutorials.
- 5-8 years of IT/Development experience including Network Operations Center and 24/7.
- Bachelor's Degree in Computer Science, Engineering, Information Technology, or related field preferred.
- Excellent written and verbal English communication skills.
- Ability to lead complex troubleshooting efforts including evidence-based.
- Excellent documentation skills regarding system issues, troubleshooting steps, resolution, and communication with stakeholders.
- Experience with Software Change Management, Production Incident Management, Problem Management, System & Application Monitoring and Logging.
- Willing to work flexible hours including night and/or swing shifts and to be part of an on-call rotation.
Skills you bring:
- Experience with both Linux and Windows operating systems administration.
- Experience with system and application health monitoring and alerting such as Grafana, Zabbix, ElasticSearch, Nagios, and Kibana.
- Working knowledge of basic network and routing concepts.
- Experience in a scripting language such as Bash, Python or Powershell.
- Experience and proven success working in a highly collaborative environment.
- May be required to lift up to 30 pounds.
Nice skills to bring:
- Knowledge of ITIL and COBIT reference frameworks
- Experience with Configuration Management tools such as Chef, Ansible, Puppet
- Experience with Cloud Service Providers such as AWS
- Event Log Correlation / Security Event & Incident Management
- Knowledge of REST API's
- Experience in operating SaaS