Overview
The Internet Computer is the fastest and only infinitely scalable general-purpose blockchain — incubated and launched by the DFINITY Foundation in May 2021. A team of over 200 world-renowned cryptographers, distributed systems engineers, and programming language experts have taken on the massive technological challenge of building, maintaining, and continuously improving a ‘world computer’ powerful enough to host Web3 dApps, DeFi, games, NFTs, social media, and metaverse projects.
DFINITY was founded in 2016 by entrepreneur and crypto theoretician, Dominic Williams, and attracted interest and financial contributions from early members of the Ethereum community. Later, top-tier institutions such as Andreessen Horowitz, Polychain Capital, and SV Angel backed the Internet Computer in a collective effort to help build out Web3.
The SRE team at DFINITY is charged with creating tools, processes, and frameworks that ensure the stability of the Internet Computer, which is distributed and scalable. As a member of the team you will work with engineering, infrastructure, and security teams to bake reliability and operability into the product from the start, by participating in design and code reviews, identifying risks, problems, and mitigations. This is not a team that exists to be on-call; this is a team that elects to be on-call because it helps do the job better.
Responsibilities:
- Implement tools that ensure high availability of DFINITY’s product
- Gain deep knowledge of DFINITY’s complex applications
- Identify opportunities to automate or improve processes and then implement the automation
- Coordinate incident response across multiple teams — clearly understanding and communicating what is going on, next steps, who is responsible for what, and so on
- Implement observability tools to ensure visibility into service stability and performance
- Be on-call for production services
- Operating, troubleshooting, and deploying software to Unix systems
- Thinking about things in a systemic, methodical way, especially when troubleshooting
Required Skills:
- Expertise in observability and monitoring of applications, services, and networks, using tools such as Prometheus/Grafana and ELK logging
- Strong Unix/Linux experience, including application installation, configuration, and maintenance
- Significant experience with site reliability, developer productivity, devops, or server infrastructure engineering (including on call incident response)
- Understanding of Internet networking protocols: TCP/IP, TLS, DNS, HTTP/S, SSH, SMTP
- Experience troubleshooting issues across the entire stack (hardware, software, network, etc)
- Experience writing automation scripts and utilities in a scripting language such as Python, Shell, etc
- Experience with incident and problem management
- Strong communication and interpersonal skills
Desired Skills
- Experience coding in Rust or C++
- Experience supporting large-scale, mission-critical services
- Experience with CI/CD pipelines
- Experience with Kubernetes
All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.