SREcon24 Americas

Architecture (2 videos)
Developer Experience (6 videos)
Observability (23 videos)
Operations (2 videos)
Performance Engineering (1 videos)
Scaling (6 videos)
Security (4 videos)

Architecture

SREcon24 Americas - Scam or Savings? A Cloud vs. On-Prem Economic Slapfight

The talk explores the nuanced economic considerations surrounding the choice between cloud and on-premises infrastructure, highlighting that there is no one-size-fits-all solution and that the decision depends on the specific needs and constraints of the organization. The speaker emphasizes the importance of understanding the underlying drivers and tradeoffs, rather than relying on simplistic comparisons or industry hype.

SREcon24 Americas - What Can You See from Here?

The speaker discusses the importance of recognizing our own biases and perspectives, and how they can lead to unproductive fights and missed opportunities. They encourage the audience to seek out diverse viewpoints, stay adaptable, and make deliberate decisions that consider the broader impact of their work.

Developer Experience

SREcon24 Americas - Thawing the Great Code Slush

The talk discusses how Slack addressed the challenges of managing infrastructure changes through the creation of a 'Code Deputy' program. The program aimed to distribute the responsibility of code review, improve visibility and accountability, and ultimately restore engineering velocity while maintaining reliability and safety.

SREcon24 Americas - Lightning Talks

The talk discusses how cognitive models like Rasmussen's and Klein's can help understand how experts make decisions under pressure. It highlights the importance of planning, understanding systems, and developing metacognition skills to improve one's ability to handle novel and stressful situations.

SREcon24 Americas - The Art of SRE: Building People Networks to Amplify Impact

This talk explores how the art of building people networks can amplify the impact of Site Reliability Engineering (SRE). The speaker highlights the importance of soft skills, diverse career paths, and collaborative learning approaches, drawing parallels between engineering teams and choirs to demonstrate the power of collective practice and feedback in driving individual and organizational growth.

SREcon24 Americas - Frontend Design in SRE

The talk discusses the challenges of building intuitive and user-friendly front-end interfaces for Site Reliability Engineering (SRE) tools, which often require high-density data presentation and a deep understanding of complex systems. The speaker emphasizes the importance of designing for information density, explainability, and adaptability to cater to the diverse needs and perspectives of SREs and other stakeholders in the organization.

SREcon24 Americas - Teaching SRE

The speaker discusses his experience teaching an unconventional SRE course at a liberal arts college, focusing on hands-on learning, problem-solving, and developing systems thinking. The course aims to bridge the gap between academic computer science and real-world software engineering, challenging students with a series of practical assignments and unexpected challenges.

SREcon24 Americas - Measuring Reliability Culture to Optimize Tradeoffs: Perspectives from an...

The speaker, an anthropologist, discusses the importance of understanding and measuring reliability culture to optimize trade-offs between speed and stability in infrastructure development. She shares insights from interviews and surveys conducted at Meta, highlighting the need for incentives, clear processes, and a balance between reliability and innovation.

Observability

SREcon24 Americas - Product Reliability for Google Maps

The talk discusses Google Maps' approach to improving product reliability by shifting focus towards user-centric monitoring and incident prevention. The speakers share their journey in implementing a multi-layered defense system that measures feature availability, latency, and data quality to catch and mitigate issues that would have otherwise gone undetected by traditional server-side monitoring.

SREcon24 Americas - Synthesizing Sanity with, and in Spite of, Synthetic Monitoring

This talk discusses the benefits, challenges, and best practices of using synthetic monitoring to improve the reliability and visibility of complex applications like Jira. The speaker highlights the importance of balancing synthetic monitoring with other forms of testing, managing the flakiness of browser-based tests, and establishing clear ownership and maintenance of the synthetic monitoring system.

SREcon24 Americas - Build vs. Buy in the Midst of Armageddon

This talk explores the journey of a team at Elastic that initially tried to build an internal incident management tool, but faced challenges due to high turnover, fragmented culture, and technical debt. The team eventually pivoted to a vendor-provided solution, Blameless, which allowed them to focus on delivering value to stakeholders rather than maintaining their own codebase.

SREcon24 Americas - Is It Already Time To Version Observability? (Signs Point To Yes.)

The speaker argues that it is time to version observability, moving from observability 1.0 (based on metrics, logs, and traces) to observability 2.0 (a single source of truth with high-cardinality, high-dimensionality data). This transition enables a shift from an operations-focused approach to one that supports observability-driven development and tight feedback loops, ultimately improving software engineering practices.

SREcon24 Americas - The Ticking Time Bomb of Observability Expectations

This talk discusses the challenges of observability in modern distributed systems, including the unrealistic expectations, low-quality data, architectural complexity, and cognitive load associated with managing observability data. The speaker proposes solutions focused on constructing meaning from the data, managing cognitive load, and aligning observability costs with business value.

SREcon24 Americas - The Sins of High Cardinality

The talk discusses the challenges of high cardinality metrics, particularly in the context of Prometheus and Kubernetes. The speaker proposes strategies for managing high cardinality metrics, including using tiered metric infrastructure, aggregating rules, and leveraging stream processing to avoid burdening the underlying time series database.

SREcon24 Americas - Optimizing Resilience and Availability by Migrating from JupyterHub to the...

This presentation explores the migration from JupyterHub to CuFlow for optimizing resilience and availability in a machine learning platform. It highlights the trade-offs between backend agnosticism and scalability, the importance of modular and independently scalable components, and the need to make progressive architectural decisions based on the evolving ecosystem and requirements.

SREcon24 Americas - 99.99% of Your Traces Are (Probably) Trash

This talk explores the challenges of using distributed tracing effectively, focusing on the importance of sampling strategies to optimize the value of trace data while managing costs. The speaker discusses the trade-offs between head-based and tail-based sampling, providing examples and practical advice for implementing a sampling approach that balances data volume, interestingness, and business priorities.

SREcon24 Americas - Resilience in Action

This talk explores the concept of resilience engineering, emphasizing the importance of adapting to the unexpected and developing adaptive capacity in socio-technical systems. The speaker encourages the audience to focus on everyday work, identifying invisible patterns, and sharing expertise to enhance system resilience and prevent major outages.

SREcon24 Americas - Meeting the Challenge of Burnout

The presentation explores the concept of job burnout, highlighting its multidimensional nature and the need to shift the focus from individual coping strategies to addressing the underlying organizational factors. The speaker emphasizes the importance of creating a healthy job environment that takes care of both the workers and the workplace, fostering thriving individuals and successful organizations.

SREcon24 Americas - "Logs Told Us It Was Kernel – It Wasn't"

The presentation explores the common misconception of blaming the Linux kernel for poor application performance. It highlights the importance of a holistic approach, utilizing various performance tools and investigating potential bottlenecks in the application code to effectively diagnose and resolve performance issues.

SREcon24 Americas - What Is Incident Severity, but a Lie Agreed Upon?

The talk explores the concept of incident severity, highlighting that it is a subjective construct that requires agreement among teams. The speaker offers a series of questions to help organizations assess and improve their use of incident severity to better serve their needs and address underlying organizational issues.

SREcon24 Americas - Triage with Mental Models

This talk explores the use of mental models in system triage and troubleshooting. The speaker discusses how experts leverage a library of simple models to quickly match observed behavior to the closest model, make predictions, and iteratively refine their understanding of the system.

SREcon24 Americas - Hard Choices, Tight Timelines: A Closer Look at Skip-level Tradeoff Decisions...

This talk explores the complex nature of trade-off decisions in incident response, highlighting the emotional, organizational, and cross-boundary challenges faced by site reliability engineers. The presenters discuss their research methods, key findings, and recommendations for recognizing, developing, and practicing the skills needed to navigate these difficult decisions.

SREcon24 Americas - It Is OK to Be Metastable

The talk discusses the concept of metastable failures, which are a class of performance failures where systems can become trapped in a self-reinforcing loop of degradation even after the initial trigger is removed. The speaker outlines three key strategies for managing metastable failures: understanding the system's environment and workloads, designing for trigger resistance, and protecting vulnerable components.

SREcon24 Americas - Automating Disaster Recovery: The Ultimate Reliability Challenge

The talk discusses the challenges of automating system recovery, emphasizing that reliability should be a primary concern from the inception of a system's development. It highlights the importance of considering architectural decisions and their long-term implications, as well as the need for a cultural shift towards reliability-driven development.

SREcon24 Americas - The Invisible Door: Reliability Gaps in the Front End

The talk discusses the importance of front-end reliability, the challenges in measuring and improving it, and how SREs can collaborate with front-end teams to address these issues. It highlights the need for better observability, tracing, and sampling techniques to gain visibility into front-end performance and user experience.

SREcon24 Americas - From Chaos to Clarity: Deciphering Cache Inconsistencies in a Distributed...

The presentation describes the challenges faced by the Netflix engineering team in resolving a cache inconsistency issue that threatened the launch of a major feature. It highlights the team's efforts to debug the problem, isolate the root cause, and implement a robust solution to ensure the timely and successful rollout of the feature.

SREcon24 Americas - Cross-System Interaction Failures: Don't Fail through the Cracks

The talk presents research on addressing cross-system interaction failures, which can occur when multiple systems interact in complex ways. The presenters discuss tools and techniques they have developed to improve the reliability of such interactions, including automated testing and formal verification approaches.

SREcon24 Americas - Strengthening Apache Pinot's Query Processing Engine with Adaptive Server...

This talk presents two key approaches to strengthen the query processing engine of Apache Pinot at LinkedIn: Adaptive Server Selection and Automatic Query Killing. These techniques enhance the reliability and resilience of Pinot deployments by intelligently routing queries to the best-performing servers and proactively killing high-risk queries to prevent cascading failures.

SREcon24 Americas - Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

This talk discusses the problem of gray failures in cloud-scale systems, presenting an abstract model of differential observability and four principles for addressing this challenge. The speaker also shares case studies from Microsoft's Azure platform demonstrating how these principles have been applied to improve cloud reliability and availability.

SREcon24 Americas - Real Talk: What We Think We Know — That Just Ain’t So

The speaker discusses the importance of questioning common beliefs and misconceptions in the field of site reliability engineering (SRE). They present examples of assumptions that have been challenged and debunked over time, highlighting the need for critical thinking and a scientific approach to understanding our work.

SREcon24 Americas - Storytelling as an Incident Management Skill

Storytelling is a crucial incident management skill that helps communicate the logical progression of events in a system failure, enabling better analysis, mitigation, and prevention. This talk explores how to craft engaging narratives that convey the cause-and-effect relationships within an incident, enhancing incident response and post-mortem documentation.

Operations

SREcon24 Americas - 20 Years of SRE: Highs and Lows

This talk reflects on the 20-year history of Site Reliability Engineering (SRE), highlighting its evolution from startup roots to widespread adoption, the permeation of SRE ideas into the broader engineering and business landscape, and the ongoing challenges in areas such as the career pipeline, quantitative models, and the persistent perception of operations as low-status work.

SREcon24 Americas - Defence at the Boundary of Acceptable Performance

The talk explores the concept of the 'boundary of acceptable performance' in complex socio-technical systems, using a model developed by Yen Rasmussen. The speaker discusses how internal and external forces act on organizations, and how understanding these forces can help operational teams better anticipate and mitigate performance issues.

Performance Engineering

SREcon24 Americas - System Performance and Queuing Theory - Concepts and Application

This talk provides an overview of queuing theory and its application to understanding system performance. The speaker discusses key concepts, such as utilization, arrival rate, and service time, and how they impact system latency, and presents real-world examples to demonstrate the practical implications of queuing theory in software engineering.

Scaling

SREcon24 Americas - Sharding: Growing Systems from Node-scale to Planet-scale

The talk discusses the challenges of scaling systems by sharding workloads, and presents a set of patterns and strategies for effectively implementing sharding in both monolithic and microservices architectures. The speaker shares their experience in navigating the tradeoffs and complexities involved in transitioning from a single-node system to a planet-scale distributed system.

SREcon24 Americas - Capacity Constraints Unveiled: Navigating Cloud Scaling Realities

This talk discusses the capacity challenges faced by Elastic Cloud, a large-scale cloud platform, and the strategies they have developed to navigate cloud scaling realities. The presenters share insights on the importance of capacity planning, communication with cloud service providers, and designing for flexibility to address the limitations and constraints of the cloud infrastructure.

SREcon24 Americas - Migrating a Large Scale Search Dataset in Production in a Highly Available...

This talk discusses the challenges faced by Shopify's search platform team in migrating a large-scale search dataset in a highly available manner. The presentation covers the team's approach to addressing the challenges, including enabling cross-jurisdictional rights between regions, implementing automated shop backfills, and optimizing infrastructure to avoid unnecessary costs.

SREcon24 Americas - Navigating the Kubernetes Odyssey: Lessons from Early Adoption and Sustained...

The video chronicles the journey of Thousand Eyes, a network intelligence platform, as it navigated the transition from a humble on-premises infrastructure to a cloud-native, multi-region Kubernetes platform. The presentation highlights the challenges faced, the lessons learned, and the strategies employed to achieve a sustainable and scalable infrastructure while balancing day-to-day operations with modernization efforts.

SREcon24 Americas - Taming the Linux Distribution Sprawl: A Journey to Standardization and...

The talk discusses Quantcast's journey to standardize their Linux distribution across their infrastructure, driven by operational challenges and a desire for greater consistency. The speakers share their approach to identifying and addressing these challenges, including engaging with engineers, implementing a discovery process, and leveraging tools like Packer and Terraform to streamline the migration.

SREcon24 Americas - Patching Your Way to Compliance with a Small Team and a Pile of Technical Debt

This talk discusses how a small engineering team at Udi overcame technical debt and compliance challenges by focusing on minimizing the infrastructure they managed, decommissioning unnecessary systems, and leveraging cloud services and Kubernetes to simplify patching and upgrades. The team faced challenges such as lack of immediate results, resistance to change, and coordinating with dependent teams, but ultimately achieved a more sustainable and manageable infrastructure.

Security

SREcon24 Americas - Compliance & Regulatory Standards Are NOT Incompatible with Modern Development..

The talk explores how modern software development practices, such as fast feedback loops and continuous deployment, can be compatible with compliance and regulatory standards. It emphasizes the importance of building relationships, understanding constraints, and finding creative solutions to achieve both security and agility in software engineering.

SREcon24 Americas - When Your Open Source Turns to the Dark Side

The presentation explores the challenges of open-source software projects turning to the dark side, with case studies of projects like Elastic, Grafana, and npm packages. The speaker provides insights and recommendations for building, selecting, and using open-source software wisely to avoid potential pitfalls.

SREcon24 Americas - OIDC and CICD: Why Your CI Pipeline Is Your Greatest Security Threat

The presentation explores the security challenges posed by CI/CD pipelines and how OIDC (OpenID Connect) can be used to mitigate these risks. The speakers discuss the history of credential management in CI/CD, the benefits of OIDC over traditional access tokens, and provide practical guidance on implementing OIDC-based role-based access control in cloud infrastructure.

SREcon24 Americas - What We Want Is 90% the Same: Using Your Relationship with Security for Fun..

This talk explores the shared goals and interests between Site Reliability Engineering (SRE) and Security teams, and how leveraging these similarities can lead to more efficient and secure systems. The speaker highlights key areas such as access controls, observability, releases, and incident response where SRE and Security can collaborate effectively to build robust and reliable infrastructure.