
Job Information
The Walt Disney Company System Reliability Operations Engineer in Lake Buena Vista, Florida
Within Disney Enterprise Technology, the Disney Technology Operations Command Center (DTOC) is a 24x7x365 critical services operation center responsible for service availability, with main focus to rapidly respond to, correlate for, and reduce impact of outages. We are accountable for identifying and facilitating the resolution of service impacting events, and collaborating with other technology teams to prevent future impact through proactive event management, incident and problem analysis. DTOC drives the execution of the major incident process including communication to executives and key partners, including owning and implementing Crisis Management plans and processes. DTOC also provides ongoing first and second-level technical support of requests, performs validation procedures for routine system/service checks, and fulfills proactive monitoring of significant business events.
System Reliability Operations (SRO) Engineers ensure all processes and functions within our environment operate correctly and efficiently – monitoring, identifying, and coordinating with other technologists across segments to fine-tune system operations and resolve service interruptions. This role is responsible for the end-to-end reliability and operations of IT services and performing consultations and training to other clients and segments across Disney. SROs consistently and reliably triage reported or automated incidents, apply recovery procedures, and engage domain experts to restore steady-state operations. Additionally, this position will drive service improvement initiatives through proactive monitoring and enhancement actions from gaps identified through analytics and problem management.
Responsibilities:
Supervise the performance and availability of enterprise applications, systems, and infrastructure, ensuring they meet or exceed established service level objectives (SLOs)
Proactively identify, diagnose, fix, and resolve infrastructure, application, and IT operations issues in collaboration with other IT support teams
Develop, implement, and maintain automation tools and scripts to improve the efficiency and reliability of IT operations and infrastructure
Implement and maintain technology observability and alerting solutions to provide real-time insights into system health, performance, and compliance
Effectively apply Problem & Incident Analysis techniques during an incident and post-incident
Address outages in a timely fashion, ensuring work streams towards resolution following department procedures while presenting business impacts
Analyze and publish operational utilization and service performance metrics
Identify and drive service availability improvement opportunities by driving leading practices
Ensure that all DTOC services are designed to deliver the levels of availability required by the business
Perform DR/BCP activities for critical events and emergency onsite response
Identify service improvement opportunities through trend analysis, proactive techniques, and after-action reviews
Required
2+ years experience supporting converged infrastructure stacks including application, compute, storage, and networking
2+ years incident recovery with demonstrated experience with Service and Event Management tools
Proficiency in one or more scripting/automation languages (ex. Python, PowerShell, Bash, Ruby)
Experience within network technologies (WAN/LAN, wireless infrastructure, DNS/DHCP, Load-Balancers, Accelerators)
Solid understanding of observability, monitoring, and alerting tools (ex. Splunk, New Relic, Grafana, ELK Stack, Datadog)
Demonstrated experience in systems integration, application infrastructure support, and middleware operations.
Experience with hands-on support of cloud operations (AWS, Google Cloud, Azure)
Experience with x86 hardware technology, Windows, Linux, RISC operating systems, P-Series hardware, SAN, NAS, and data protection technologies
Experience in enterprise IT operations including system administration, application platforms, infrastructure, networking fundamentals, and IT service management
Experience working in a 24x7 IT operations environment
Strong technology problem-solving and analytical skills, with the ability to quickly diagnose and resolve technical issues.
BA/BS in Computer Science, Engineering or related field; or equivalent work experience
Preferred
Master’s degree in a technical field
Certification/s within Kepner-Tregoe, ITIL Foundations (V3), operating systems, visualization, and/or hardware platforms
Job ID: 10050587
Location: Lake Buena Vista,Florida
Job Posting Company: The Walt Disney Company (Corporate)
The Walt Disney Company and its Affiliated Companies are Equal Employment Opportunity employers and welcome all job seekers including individuals with disabilities and veterans with disabilities. If you have a disability and believe you need a reasonable accommodation in order to search for a job opening or apply for a position, email Candidate.Accommodations@Disney.com with your request. This email address is not for general employment inquiries or correspondence. We will only respond to those requests that are related to the accessibility of the online application system due to a disability.