Description & Requirements
Bloomberg’s Core Communications platforms power real-time messaging across the global financial industry. Systems like Instant Bloomberg (IB) and MSG handle billions of messages daily, supporting time-sensitive conversations where reliability and latency directly impact trading and decision making.
We are looking for software engineers to help improve the reliability of these systems at scale. This role focuses on building tools, improving system design, and ensuring our platforms behave predictably under load and failure.
What you’ll work on
You will work on large-scale distributed systems with high availability and low latency requirements. Our goal is to reduce friction for product engineers, improve system reliability, and provide confidence in how our platforms behave under real-world conditions.
You will:
Build tools and automation to improve how our distributed systems are operated and debugged
Defining and implementing service level objectives (SLOs) that reflect real user impact
Identify and continuously assess reliability risks across services, infrastructure, and workflows, helping teams prioritise work based on real impact
Improve development and deployment workflows, driving more consistent and reliable paths to production
Reduce time to recovery and triage effort by improving diagnostics, alerting, and system-level visibility
Design and validate failure scenarios and resilience testing practices, ensuring systems behave predictably under stress
You will collaborate closely with software engineers and product teams to influence how systems are designed, built, and operated.
Why this role
Work on systems operating at very high scale, with billions of messages processed daily
Tackle complex distributed systems challenges involving latency, consistency, and failure handling
Build tooling and frameworks used across multiple teams
Have direct impact on systems relied upon by the global financial industry
What we’re looking for
4+ years of experience in software engineering
Proficiency in Python and proven experience with C++.
Experience working with distributed systems
Strong Understanding of system reliability, observability, and performance
Familiarity with SLOs, SLIs, and SLAs, and how to relate system performance back to client impact.
Strong collaboration and communication skills
A degree in Computer Science, Engineering, or equivalent practical experience.
We would love to see
Experience with monitoring or tracing tools such as Grafana, Humio, distributed tracing
Familiarity with Kafka, Java, or large-scale data systems
Experience with chaos engineering, failure injection, or resilience testing frameworks.
Exposure to capacity planning and scaling analysis.
Contributions to open source or involvement in SRE communities.
Experience with big data technologies like Apache Spark, Amazon S3
We offer one of the most comprehensive and generous benefits plans available and offer a range of total rewards that may include merit increases, incentive compensation (exempt roles only), paid holidays, paid time off, medical, dental, vision, short and long term disability benefits, 401(k) +match, life insurance, and various wellness programs, among others. The Company does not provide benefits directly to contingent workers/contractors and interns.