SREcon24 Europe/Middle East Africa

AI & ML (1 videos)
Databases (1 videos)
Developer Experience (11 videos)
HPC (1 videos)
Keynote (1 videos)
Networking (2 videos)
Observability (19 videos)
Operations (4 videos)
Performance Engineering (3 videos)
Scaling (4 videos)
Scheduling (1 videos)
Security (6 videos)
Storage (2 videos)
Sustainability (1 videos)

AI & ML

SREcon24 Europe/Middle East/Africa - Generative AI: Beyond (Just) Hype

The speaker discusses the hype surrounding generative AI and machine learning, reflecting on their past predictions about the capabilities of these technologies in operational settings. They highlight a few promising use cases, such as knowledge base search and agent-based systems, while cautioning against exaggerated claims about AI's abilities to replace human operators.

Databases

SREcon24 Europe/Middle East/Africa - Survivor: MySQL Island – Outwit, Outplay, Outlast Metadata...

The presentation explores the complexities of metadata locks in MySQL, highlighting how they can impact application performance and availability, especially during schema changes. The speaker shares real-world scenarios, insights into the underlying mechanics, and practical strategies to proactively manage and mitigate issues related to metadata locks.

Developer Experience

SREcon24 Europe/Middle East/Africa - I Can OIDC You Clearly Now: How We Made Static Credentials a...

The talk discusses how Grafana Labs has addressed the challenge of accessing cloud resources from CI/CD workflows using OpenID Connect (OIDC) instead of static credentials. The presenters share their approach, which involves using workload identity federation in Google Cloud and AWS, and showcasing their open-source solutions to encourage adoption and collaboration.

SREcon24 Europe/Middle East/Africa - Why You’re (Probably) Doing Service Catalogs Wrong

The talk explores the challenges and pitfalls of implementing a service catalog, highlighting the importance of clarity, incremental value delivery, focused engagement, and data ownership and maintenance. The speaker shares insights and strategies to help organizations build effective and sustainable service catalogs that deliver real value to their teams.

SREcon24 Europe/Middle East/Africa - SRE Stakeholders: A Spotter’s Guide

This talk discusses the importance of effectively managing stakeholders in the context of Site Reliability Engineering (SRE) teams. The speaker emphasizes the need to clearly define the team's purpose, establish shared goals with peer teams, and regularly communicate the value of the SRE function to key stakeholders, including sponsors, consumers, and the SRE team itself.

SREcon24 Europe/Middle East/Africa - Selective Reliability Engineering: There Is No Single Source...

The talk explores the role of Selective Reliability Engineering, where engineers make design decisions that balance technical constraints, user needs, and organizational goals. The speaker emphasizes the importance of considering the human impact of these decisions and selecting for empathy in system design.

SREcon24 Europe/Middle East/Africa - Lessons from Unix History

The presentation explores the rich history and architectural evolution of the Unix operating system, highlighting key design principles and lessons that have shaped modern software development and IT infrastructure. The speaker delves into the origins of Unix at Bell Labs, its gradual development and adoption, and the enduring architectural decisions that have stood the test of time, offering valuable insights for building robust and extensible systems.

SREcon24 Europe/Middle East/Africa - Mnemonic Rules for Eponymous Laws or: There’s a Law for That!

The talk explores mnemonic rules for eponymous laws in the field of software engineering and systems design, providing practical insights and memorable associations to help attendees better understand and apply these principles. The speaker covers a range of laws, from Gall's Law and Conway's Law to Jevons' Paradox and Pareto Principle, offering a engaging and informative tour of the principles that underpin effective system design and development.

SREcon24 Europe/Middle East/Africa - Treat Your Code as a Crime Scene

The presentation explores how software developers can treat their code like a crime scene, using techniques from forensic psychology to identify and address technical debt and organizational risks. By analyzing the behavioral patterns of developers interacting with the codebase, the speaker demonstrates how to prioritize improvements and reduce the onboarding cost of complex, hard-to-understand code.

SREcon24 Europe/Middle East/Africa - Configuration Languages Are the Bane of Our Existence

The talk explores the challenges and shortcomings of configuration management, highlighting how configuration languages have evolved over time yet still remain a pain point for software engineers. The speaker proposes treating configuration as an integral part of the system design, emphasizing the need to approach it as an API design problem and implement robust testing and encapsulation strategies.

SREcon24 Europe/Middle East/Africa - A Brief History of Release Engineering

This talk provides a concise history of release engineering, tracing its evolution from the command line and punch cards to the sophisticated tooling and continuous integration/deployment practices of today. The speaker highlights the significant changes in the field, including the shift from physical media distribution to instant software downloads, and the impact of faster hardware and the internet on enabling new approaches to release engineering.

SREcon24 Europe/Middle East/Africa - Are We Really Engineers?

This talk explores the question of whether software development and site reliability engineering can be considered branches of engineering. The speaker interviews 17 professionals who have experience in both software and traditional engineering fields to understand the similarities and differences between the two disciplines.

SREcon24 Europe/Middle East/Africa - Get Your Non-SREs Oncall Ready!

The presentation discusses a scalable approach to preparing non-SREs for on-call duties by providing a hands-on, self-guided training experience using a safe, pre-broken application. The team has successfully implemented this approach at Google, leading to high satisfaction among participants and reduced overhead for the SRE team.

HPC

SREcon24 Europe/Middle East/Africa - Science Reliability Engineering for High Performance Computing

The speaker discusses how they applied SRE (Site Reliability Engineering) concepts to the domain of high-performance computing (HPC) to address the unique challenges and requirements of running supercomputers. They describe the development of an open-source software stack called OpenChami that aims to improve the reliability, scalability, and maintainability of HPC systems.

Keynote

SREcon24 Europe/Middle East/Africa - SRE Saga: The Song of Heroes and Villains

This talk explores the parallels between the journey of a Site Reliability Engineer (SRE) and the adventures of a Dungeons and Dragons (D&D) party. It emphasizes the importance of building a diverse, adaptable, and resilient team to tackle the challenges and adversities inherent in complex systems.

Networking

SREcon24 Europe/Middle East/Africa - Rock around the Clock (Synchronization): Improve Performance...

This talk explores how to improve performance and minimize tail latencies by accurately synchronizing clocks and detecting and controlling network congestion. It discusses the limitations of Network Time Protocol (NTP) and introduces Precision Time Protocol (PTP) and the Huygens clock synchronization algorithm as more accurate alternatives for high-precision time synchronization.

SREcon24 Europe/Middle East/Africa - Noisy Neighbors, through Networking

The talk discusses the challenges faced by Reddit's infrastructure team in dealing with noisy neighbors, particularly in the context of networking. It covers various incidents related to network traffic, CPU utilization, and connection tracking, and the solutions implemented to address these issues.

Observability

SREcon24 Europe/Middle East/Africa - Fixing Your Noisy Pager in 500 Easy Steps

This talk discusses techniques for reducing the noise and frequency of pager alerts, which can lead to on-call fatigue and reduced productivity. The speaker presents a three-step process of analyzing alert data, categorizing alerts, and implementing automated remediation strategies to address the most frequent and disruptive issues.

SREcon24 Europe/Middle East/Africa - Achieving Excellence: SLO Thresholds That Transform Service...

This talk explores how Netflix's Content Delivery Network (CDN) team defines and tracks Service Level Objectives (SLOs) to ensure exceptional quality of experience for their 230+ million members. The speaker discusses various approaches to setting SLO thresholds, from leveraging intuition to conducting surveys and A/B testing, and highlights the importance of proactively collaborating with internet service providers to address network degradations.

SREcon24 Europe/Middle East/Africa - You Depend on Time, This Is How It Works and You Won’t...

This talk explores the intricacies of timekeeping, from the history of calendars and clocks to the modern synchronization of time across computer systems and networks. It delves into the evolution of time measurement, the challenges of maintaining accurate time, and the various technologies and protocols used to ensure precise timekeeping in our digital age.

SREcon24 Europe/Middle East/Africa - Enhancing Elasticsearch Performance: Innovative Reindexing...

This talk discusses how Shopify, a leading global e-commerce platform, improved the performance and reliability of its Elasticsearch search infrastructure by using dedicated node pools for real-time and reindexing workloads, and leveraging Kubernetes-based autoscaling to dynamically scale the reindexing node pool. The solution resulted in significant improvements in real-time indexing performance and reduced infrastructure costs.

SREcon24 Europe/Middle East/Africa - Exploring the Unintended Consequences of Automation in Software

This talk explores the unintended consequences of automation in software systems, highlighting how automation can contribute to incidents and make them more complex to resolve. The speaker advocates a paradigm shift in how we view automation, proposing a joint cognitive systems approach where automation and humans work collaboratively to enhance resilience and safety.

SREcon24 Europe/Middle East/Africa - Incident Groundhog Day

This talk explores the challenges of incident response and how to learn from incidents through staged war exercises. The speaker discusses the cognitive load on incident responders, the importance of teamwork, and the need to move beyond heroism towards building resilient systems and teams.

SREcon24 Europe/Middle East/Africa - From PIDs to Pods: The Life Cycle of an eBPF-Autoinstrumented..

The presentation discusses the challenges of manual instrumentation and how eBPF, a virtual machine built into the Linux kernel, can be used for automatic instrumentation. The speaker also talks about the journey of implementing eBPF-based automatic instrumentation (BAA) in a Kubernetes environment, including the challenges of mapping process IDs to Kubernetes metadata and the future roadmap for improving BAA's performance and reducing the need for privileged containers.

SREcon24 Europe/Middle East/Africa - Anomaly Detection in Time Series from Scratch Using...

The talk discusses the journey of a team in implementing an anomaly detection solution for business metrics using basic statistical tools. The speaker highlights the challenges faced, such as handling past anomalies, daylight savings time changes, and scaling the solution to multiple metrics, and shares the approach they developed to address these issues.

SREcon24 Europe/Middle East/Africa - Finding the Capacity to Grieve Once More

This talk explores how the Wikipedia Foundation navigated the challenges of managing sudden spikes in traffic and activity following the deaths of prominent public figures. The speaker shares the technical and cultural changes the organization implemented to better prepare for and respond to these high-impact events, highlighting the importance of investing in infrastructure, improving incident response processes, and fostering a supportive community.

SREcon24 Europe/Middle East/Africa - How a Single API Endpoint Saved Us 3000 CPU

The talk discusses how a single API endpoint addition saved 3,000 CPU for the speaker's company, MK. The speaker explains how the issue was caused by the consistent hashing behavior of the Mimir time series database, and how the solution of conditionally leaving the ring on shutdown resolved the problem.

SREcon24 Europe/Middle East/Africa - How Snowflake Migrated All Alerts and Dashboards to a...

The talk discusses Snowflake's migration from a previous SaaS-based metrics vendor to a Prometheus-based solution, focusing on the process of migrating dashboards and alerts. The migration faced challenges around scaling, reliability, and cost, leading to the adoption of Prometheus and a codebase approach for managing alerts and dashboards, which enabled better testing, ownership, and flexibility.

SREcon24 Europe/Middle East/Africa - A Powerful Logs Management Solution We All Have and Use but...

This talk presents a comprehensive analysis of the strengths and weaknesses of various log management solutions, including Loki, Elasticsearch, and systemd Journal. The speaker highlights the unique features and tradeoffs of each solution, making a compelling case for the adoption of systemd Journal as a powerful and efficient log management tool.

SREcon24 Europe/Middle East/Africa - Riot Games: Evolution of Observability at the Gaming Company

The video discusses the evolution of observability at Riot Games, a leading gaming company. It highlights the challenges faced by Riot Games in maintaining observability across its growing portfolio of online, competitive games, and the solutions they implemented to address issues such as data fragmentation, cost, and governance.

SREcon24 Europe/Middle East/Africa - Red Tide Revert

This talk discusses the challenges of rapid iteration and deployment at scale, and the efforts to develop an AI-powered tool called 'Dr. Fix It' to help engineers manage and revert incidents efficiently. The speaker shares the learnings and insights gained from the development process, highlighting the importance of embracing non-determinism, choosing a single capable language model, and taking an iterative approach to building the system.

SREcon24 Europe/Middle East/Africa - Opening the Box: Diagnosing Operating-System Task-Scheduler...

This talk explores the impact of the Linux kernel's task scheduler on application performance, particularly in large multicore systems. The speaker presents several tools to help diagnose and understand the behavior of the task scheduler, allowing developers to identify and address performance issues related to task placement and load balancing.

SREcon24 Europe/Middle East/Africa - Lightning Talks

The video discusses the importance of engineer well-being and recovery time in maintaining highly available systems. It highlights the role of managers in creating sustainable on-call rotations, protecting engineer mental health, and building resilient teams alongside resilient systems.

SREcon24 Europe/Middle East/Africa - Synthetic Monitoring and E2E Testing: 2 Sides of the Same Coin

The presentation discusses how synthetic monitoring and end-to-end testing can be unified using a common automation tool, such as Playwright, to improve collaboration, visibility, and efficiency across software development and operations teams. By leveraging a shared artifact for both pre-production testing and production monitoring, organizations can shift left, enhance cross-team empathy, and streamline the identification and resolution of issues.

SREcon24 Europe/Middle East/Africa - Monitoring Systems as a Service – Walking the Line between...

This talk discusses the challenges and best practices for managing monitoring systems as a service, including strategies for optimizing costs, controlling access, and aligning monitoring capabilities with the evolving needs of a growing organization. The speaker shares their experience at Udemy, highlighting the journey from a small startup to a public company and the lessons learned in maintaining a cost-effective and efficient monitoring system.

SREcon24 Europe/Middle East/Africa - Transforming Production Readiness

The talk discusses the transformation of Elastic's production readiness process, including the shift in operational responsibilities, the development of new systems and processes, and the empowerment of engineering teams to own on-call duties. The speaker shares insights on the importance of aligning with leadership, simplifying communication, and providing supporting tools and systems to enable engineers to take on the operational responsibilities.

Operations

SREcon24 Europe/Middle East/Africa - Dude, You Forgot the Feedback: How Your Open Loop Control...

This talk discusses the importance of providing appropriate and timely user feedback in control planes to prevent outages. The speaker highlights various examples of how the lack of feedback in control plane design can lead to unexpected system behavior and user errors, and suggests strategies to improve feedback mechanisms and validation in control plane development.

SREcon24 Europe/Middle East/Africa - The Frontiers of Reliability Engineering

This talk explores the evolution of reliability engineering over the past decade, highlighting advancements in hardware provisioning, monitoring, observability, and principles. The speaker also discusses three frontiers for the next 5-10 years: managing for reliability, mobile observability, and data operations, emphasizing the importance of people, communication, and feedback loops in driving reliability at scale.

SREcon24 Europe/Middle East/Africa - Sailing the Database Seas: Applying SRE Principles at Scale

This talk presents how Booking.com has applied SRE principles to manage their large-scale distributed database systems, focusing on defining SLIs and SLOs, automating capacity planning, and embracing a postmortem culture to improve reliability and scalability. The speakers share their experiences and lessons learned in implementing these practices to effectively operate their MySQL fleet across multi-cloud and multi-region environments.

SREcon24 Europe/Middle East/Africa - Panel Discussion: Is Reliability a Luxury Good?

The panel discussion explores the concept of reliability becoming a luxury good, as technology companies face challenges in maintaining reliability in a post-zero interest rate world. The panelists discuss the need for regulation, the difficulty in communicating the value of reliability, and the hidden constraints that can hinder the advocacy for continuous investment in reliability.

Performance Engineering

SREcon24 Europe/Middle East/Africa - The Silent Performance Killers: BIOS and Firmware Updates

The presentation discusses the importance of testing BIOS and firmware updates to ensure they do not degrade the performance of computational infrastructure. The speaker proposes a framework for establishing performance baselines, conducting pre- and post-update testing, and monitoring performance changes over time to identify and address silent performance killers.

SREcon24 Europe/Middle East/Africa - Enabling Product Scalability through Load Testing

This talk discusses how the Bloomberg team utilized load testing to enable product scalability and release new features for their instant messaging application. The speakers highlight the importance of planning the testing process, identifying stakeholders, isolating test traffic, and iteratively improving the system based on distributed tracing and end-to-end testing.

SREcon24 Europe/Middle East/Africa - Taming Noisy Benchmark Results Using Change Point Detection

This talk discusses the challenges of obtaining reliable benchmark results due to the inherent noise in modern systems. It introduces the use of change point detection techniques as a powerful tool to identify significant shifts in benchmark performance, providing a way to tame the noisy benchmark results and gain meaningful insights.

Scaling

SREcon24 Europe/Middle East/Africa - How to Host a (Very) Popular Website for 30 Altairian...

The talk discusses the challenges and strategies of hosting a popular website with a large user base and limited resources, focusing on the technical and organizational decisions made by the volunteer-run organization behind the Archive of Our Own (AO3). It highlights the importance of simplicity, scalability, and resilience in the face of growing demand and limited budgets.

SREcon24 Europe/Middle East/Africa - Embrace Fleet Reboots and Make Them Boring

This talk discusses how Cloudflare automated the process of rebooting servers in their edge network, transitioning from a manual and disruptive process to a fully automated and 'boring' one. The speaker shares the technical details of the tools and processes they developed to enable reliable, scheduled, and minimally impactful reboots across their global network.

SREcon24 Europe/Middle East/Africa - AppStack: An Open Source Cloud Native Platform for Running...

The presentation discusses the development of AppStack, an open-source, cloud-native platform for running digital public services in Greece. The platform was built to address challenges such as faster development and deployment cycles, scalability, security, and public perception, and has enabled the efficient operation of critical government services during the COVID-19 pandemic.

SREcon24 Europe/Middle East/Africa - Blast Radius Reduction for Large-Scale Distributed Systems

This talk discusses techniques for reducing the blast radius of large-scale distributed systems, focusing on the use of cell-based architecture, self-healing mechanisms, and formal methods for reliability verification. The speaker shares insights from their experience working on highly reliable cloud services at Huawei, highlighting the importance of designing for failure and balancing blast radius reduction with cost and elasticity.

Scheduling

SREcon24 Europe/Middle East/Africa - Scheduling at Scale: eBPF Schedulers with Sched_ext

This talk provides an overview of building schedulers using eBPF, including the key building blocks like BPF helpers, K funcs, and structs. It also discusses the challenges of deploying eBPF schedulers at scale, such as kernel backports, complex hardware architectures, and testing and observability.

Security

SREcon24 Europe/Middle East/Africa - OMG WTF SSO: A Beginner’s Guide to Single Sign-On...

This talk provides a beginner's guide to the potential pitfalls of implementing single sign-on (SSO) from the perspective of a company purchasing software services. The speaker emphasizes the importance of asking the right questions to understand the vendor's capabilities and ensure the SSO implementation aligns with the company's security needs, drawing on real-world examples of SSO misconfigurations that can lead to significant security breaches.

SREcon24 Europe/Middle East/Africa - Just Buy the Printer: Resilience in Action

This talk discusses the resilience of a company's continuous deployment tool in the face of a surprising and time-sensitive issue with their code signing certificate. The speaker highlights the organization's ability to adapt and overcome the challenge through improvisation, cross-team collaboration, and a willingness to share stories of near-misses to foster collective learning.

SREcon24 Europe/Middle East/Africa - When Your SaaS Provider Goes out of Business – Lessons from...

When a SaaS provider goes out of business, organizations must act quickly to sustain operations, communicate with stakeholders, and integrate a new solution. This case study from Open Systems demonstrates how a well-coordinated crisis response, clear communication, and leveraging existing technical capabilities can help navigate such a challenging situation.

SREcon24 Europe/Middle East/Africa - When SRE and Security Teams Meet to Face a Crisis

The speaker shares their experiences on how security and site reliability engineering (SRE) teams can collaborate to effectively respond to security incidents. The talk highlights the importance of aligning priorities, developing common language, and building rapid response capabilities to address complex issues that arise at the intersection of security and reliability.

SREcon24 Europe/Middle East/Africa - What If We Ask Linux to Do Cryptography for Us?

This talk explores how the Linux kernel can be leveraged to handle cryptographic operations, providing a more secure alternative to relying on user-space libraries. The speaker discusses two key Linux subsystems - the Linux Kernel Key Retention Service and the Linux Crypto API - and demonstrates practical examples of integrating these systems into applications written in Go and Rust.

SREcon24 Europe/Middle East/Africa - Managing the Risk of Software Supply Chain Attacks

The talk discusses the growing threat of software supply chain attacks, highlighting the need for comprehensive security measures across the entire software development lifecycle, from code repositories to production systems and developer workstations. The speaker emphasizes the importance of understanding the breadth of the software supply chain, adopting security best practices, and staying vigilant against evolving attack vectors.

Storage

SREcon24 Europe/Middle East/Africa - NVMe/TCP Makes iSCSI Look like Fortran

The talk discusses the evolution of storage protocols, from the aging iSCSI to the more modern and efficient NVMe/TCP. The speaker highlights the performance and flexibility advantages of NVMe/TCP, making a compelling case for its adoption in various use cases, particularly in the database and virtualization domains.

SREcon24 Europe/Middle East/Africa - An Exploration in Storing Telemetry in Cloud Object Storage

This talk explores the challenges of storing and querying large volumes of telemetry data, such as logs, metrics, and traces, in cloud object storage. The speakers propose a data lake architecture that leverages efficient file formats like Parquet and table formats like Apache Iceberg to enable cost-effective and performant storage and analysis of this data.

Sustainability

SREcon24 Europe/Middle East/Africa - Energy Consumption of Datacenters

This talk examines the rapidly growing energy consumption of data centers, highlighting the unrealistic projections and challenges associated with meeting this demand through nuclear power or other conventional means. The speaker proposes alternative solutions, such as neuromorphic computing and optical fiber-based neural networks, as more sustainable approaches to addressing the energy and resource constraints faced by the AI industry.