Senior Software Engineer / Reliability Engineering - Real-time Data

Location

London

Business Area

Engineering and CTO

Ref #

10049067

Description & Requirements

Our department is responsible for efficiently distributing financial data from its source to interested users all around the world. This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time.

The group owns:

The distribution software and infrastructure
A range of different sources of data
Supporting services to administer and manage the system, including permissioning and metering

The team is also responsible for the Enterprise endpoint (“B-PIPE”), which allows end-users to programmatically consume data via our SDK. Data is also available through the Bloomberg Terminal and Microsoft Excel.

The main challenge faced by the group is one of scale. Data is sourced from more than 370 global exchanges, with a combined volume in excess of 60 billion messages each day. We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs. Handling this volume requires significant infrastructure, we manage multiple clusters in our main data centres, as well as a network of many thousands of servers around the world.

Group Overview

The RD Reliability Engineering group comprises three sub-teams located in Tokyo, London, and New York, providing follow-the-sun support.

Our mission is to ensure systems are reliable, scalable, and observable through software engineering, while continuously improving how systems behave under load and failure conditions. We work in an outcome-driven model, focusing on measurable improvements in availability, latency, capacity, and recovery. Our goal is to ensure systems meet defined service level objectives while minimising manual operational effort through automation and software solutions.

The systems we support must behave predictably under extreme load, recover quickly from failures, and continue to evolve without compromising stability - these are the core challenges we solve.

London Team Focus – Availability & Resiliency

The London team plays a key role in ensuring the availability and resiliency of RD infrastructure globally.

We focus on:

Detecting and preventing failures across large-scale distributed systems
Ensuring infrastructure demonstrates sufficient capacity and failover capability during site-loss scenarios
Reducing time to detect, diagnose, and recover from incidents
Ensuring systems behave predictably under both normal and adverse conditions

This role provides the opportunity to influence how reliability is engineered across the platform, working closely with teams globally to improve system behaviour and design.

What You’ll Do

Build and maintain production-grade software supporting Bloomberg’s global distribution infrastructure
Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation
Analyse system behaviour under real-world and failure scenarios to validate capacity, failover, and recovery meet resilience objectives
Identify bottlenecks, scaling limits, and reliability risks across distributed systems
Improve detection, diagnosis, and prevention of production issues
Build tools and frameworks to increase system visibility and reduce time to detect and resolve incidents
Automate operational workflows to reduce manual effort and improve system reliability
Partner with application and infrastructure teams to improve system design, resilience, and performance
Contribute to design discussions, incident reviews, and reliability improvements across the platform

Systems You’ll Work With

Configuration systems serving thousands of servers across the global network
Service discovery and clustering systems for distributed infrastructure
Monitoring and observability frameworks for large-scale server estates
Tooling for diagnosing data quality and distribution issues
Ownership of systems may evolve over time as the team focuses on areas of highest impact.

What Success Looks Like

Systems consistently meet defined reliability, latency, and capacity objectives
Issues are detected and mitigated before significant customer impact
Systems are demonstrably resilient, with proven failover capability and sufficient capacity under failure conditions
Operational processes are automated and scalable
Reliability is achieved through engineering improvements rather than manual intervention

What We’re Looking For

We're not a traditional SRE team. We engineer reliability through software, building solutions that automate operations and improve system resilience by design.

Experience with an object-oriented programming language (preferably Python or C++)
Strong focus on building reliable, observable distributed systems
Experience working with SLOs, SLIs, and production reliability metrics
Proven ability to triage and resolve live production problems
A mindset focused on automation and reducing operational toil
A strength in collaborating within an inclusive team environment
The ability to work across departments and build strong relationships with both technical and non-technical partners

Why Join Us

You’ll work on systems that sit at the core of Bloomberg’s real-time data platform, operating at global scale and under demanding performance and reliability requirements.

This is an opportunity to:

Solve complex distributed systems problems with real-world impact
Influence how reliability is engineered across a critical platform
Work with teams across multiple regions and technical domains
Build systems that are resilient by design and operate at massive scale

If indicated, please note that years of experience are a guide; we will consider applications from all candidates who can demonstrate the skills necessary for the role.

Discover what makes Bloomberg unique - watch our podcast series for an inside look at our culture, values, and the people behind our success.