Skip to content
Senior Software Engineer / SRE - Real-time Data
Location
London
Business Area
Engineering and CTO
Ref #
10049067

Description & Requirements

Our department is responsible for efficiently distributing financial data from its source to interested users all around the world.  This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time.

The group owns:
  • the distribution software and infrastructure
  • a range of different sources of data
  • the supporting services to be able to administer and manage the system, including permissioning and metering
The team is also responsible for the Enterprise endpoint (“B-PIPE”) which allows end-users to programmatically consume data via our SDK.  Data is also available through the Bloomberg Terminal or Microsoft Excel.

The main challenge faced by the group is one of scale - data is sourced from more than 370 global exchanges with a combined volume in excess of 60 billion messages each day.  We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs.  Handling this volume takes a lot of hardware!  We manage multiple clusters in our main data centres as well a network of many thousands of servers all around the world.

Team Overview

The SRE group comprises three sub-teams: one in Tokyo, one in London and one in New York.  This allows us to provide on-call coverage following the sun.  This role is in the London team.  The mission of the group as a whole is to ensure our systems are reliable, scalable and observable by using software engineering practices. The group’s responsibilities fall under five main pillars:

Latency Monitoring and Management
  • Defining, measuring and viewing service level indicators (SLIs) for latency.
  • Defining service level objectives (SLOs) for latency and alerting on breaches.
  • Building tools to accurately and quickly identify the sources of latency.
Capacity Management
  • Ensuring all subsystems can scale horizontally.
  • Maintaining sufficient capacity to withstand a disaster and demonstrating compliance.
  • Building tools to understand the current utilisation and capacity of the system and predict the impact of load increases and new use cases.
System Monitoring and Observability
  • Building tools to detect problems proactively, before they are reported by customers.
  • Providing information on the overall health of the system from a single, well-known, entry point.
  • Putting alarms in place, with actionable run-books, for all critical issues.
Production Risk Management
  • Ensuring business-as-usual changes are safely released to production.
  • Planning and / or executing the release of more complex changes.
  • Reviewing and re-architecting the infrastructure to improve resiliency and performance.
Incident Response
  • Managing live incidents to diagnose and remediate issues, mitigating customer impact as quickly as possible.
  • Building tools to diagnose issues and run manual operational responses safely and correctly.
  • Provide automated responses to standard problems.

The London team is currently responsible for two critical parts of the distribution system.

First of all, we own the system which serves configuration to the thousands of servers in the distribution network and the B-PIPEs. These servers “call home” when they start up and the system has the responsibility for delivering the proper settings to them. The wide reach of this system means that correctness is extremely important.

We also own the mechanism which allows servers to be grouped together in discoverable clusters of peers. This comprises a back-end service to query for peers and also the UI to manage the groupings.

Changes to these systems often include developing business functionality, as well as technical enhancements.

We have also built a framework to flexibly and regularly monitor our estate of servers to ensure that they are operating properly at all times.

Finally, we also own the main tool used to diagnose data quality issues in the distribution network.

Finally, in addition, the team also makes changes to other core subsystems (outside of our formal ownership) in order to improve the reliability of the wider system.

All of this gives us a very strong focus on software development in addition to our operational responsibilities.

You'll need to have:
  • Experience with an object-oriented programming language (preferably Python or C++).
  • A focus on delivering good quality, well-tested, software and safely releasing it to a mission-critical production environment.
  • A proven track record triaging and resolving live production problems.
  • A strength in cooperating in a collaborative and inclusive team environment.
  • The capability to work across departments, building good relationships with both technical and non-technical partners.

Discover what makes Bloomberg unique - watch our podcast series for an inside look at our culture, values, and the people behind our success.
Apply Now