Site Reliability Engineer

Remote

Technical Support and SysAdmin

Our Technical Support department acts as a liaison between the essential components of our client’s businesses. We provide round the clock troubleshooting and debugging assistance so that our customers can rest assured knowing that their products are always taken care of.

Job Description

We are looking for a SRE who’s ready to help us improve the organization’s CI/CD implementations by building functional systems that elevate software delivery and observability.
The ideal candidate will be an energetic person who will drive how the organization deploys, verifies, and monitors applications and services.
This person will interface with all key stakeholders to define SRE practices and help shape the development culture.
This role also assists in ensuring that they have appropriate levels of observability (monitoring and alerting) set up for all applications across a mix of physical servers, Kubernetes clusters, and both private and AWS cloud.

Responsibilities:

Building a strong relationship with the development teams to understand the code, its dependencies, and the infrastructure on which it runs;
Strengthening and maintaining the monitoring and management of the build, deployment processes and infrastructure;
Assisting the development team with capacity planning across development, QA, staging, and production environments;
Building and supporting a reliable telemetry system to monitor the infrastructure, and application services and drive incident management;
Building and maintaining accurate, up-to-date documentation reflecting configuration;
Providing support for application and network latency issues, as well as various operational tasks such as management and deployment of the application;
Configuring, deploying, and monitoring applications and tools on edge devices and configuring connectivity to cloud-based platforms.

Requirements:

Strong communication skills and ability to explain protocol and processes to the team and management;
A strong understanding of observability frameworks, and knowledge of Prometheus is a must;
A clear understanding of SRE metrics like SLA/SLO and the golden signals;
Observability dashboarding skills with strong knowledge of Splunk, Grafana, PromQL, etc.;
Problem-solving, hacking, and debugging skills;
Experience in deploying and administering Continuous Integration tools such as Jenkins and TeamCity;
Experience with Infrastructure cloud tools such as Terraform and Docker;
Understanding strategies for providing high availability and security;
Experience with automated testing solutions for unit testing, integration testing, and system testing is a plus.

Team:

7 Engineers (3 in the US, 2 in India, 1 in Europe, 1 in Bytex) – 2 more incoming from Bytex

Methodology:

Two-week sprints but not necessarily following agile methodology.

Working schedule:

Daily standup is at 8:30 AM, but you can update via Slack if you can’t make it.
EU flexible schedule with PST/IST, at least 2-3 hours overlap

Recruitment process:

Short Bytex HR introductory discussion
1-2h Bytex technical discussion with one of our senior engineers
Apple interviews: 3 rounds of discussions with team members in India and the US
- Structure: a mix of technical questions, which would involve practical knowledge and application of the required skills. Questions will be tailored to the candidate’s background.
Offer presentation

WANNA WORK WITH US?

If you want to apply for this position, please fill in the form or send us an email at careers@bytex.ro.