So happy to have had the opportunity to speak at @devopsdaysnyc yesterday! I gave out 9 tips for folks starting to use SLOs and how to get more value out of the SLOs they already have. #sre #reliability #slos
the differential between what is written here and is lived by many companies is huge
Here are a few things that the term SRE means in the outside world in our personal experience:
- Expensive and good at on-call
- Distributed systems consultant
- Platform engineer
- Rebranded ops group member
from *SRE in the Real World: for Xooglers*
Up til 2am working on a mysql replication issue.
The combination of factors that led to the replica getting 24 hours behind could probably be summed up as "the system was left in the first configuration that barely worked". Row-based replication choked on batched deletes.
Always good to be reminded that any time you do something unusual, the system can behave in unexpected ways.
And however many prior failure conditions you account for, it can always find one more.
Uptime SLOs based on error %s are an easy place to start, but we need to go beyond that as developers.
I used to run a 99.95% uptime SLO system, everything green, we're all proud. But we learned from customers that a <0.01% failure was causing data loss with major real world implications for those affected.
Look at your golden flows, the most important parts that are important. You might not get the data from HTTP response codes, new measurements may be needed.
Check out my article at APM Digest about the growing complexity of #observability and rising MTTR (mean time to remediation), with interesting stats from the latest #DevOps Pulse survey.
#DevOps #SRE #APM
Introducing a tool for running diagnostic and administrative tools locally on your machine, but with outgoing network connectivity as if they're running in your k8s cluster.
Hello les rézo, @cooptilleuls cherche un·e Site Reliability Engineer pour étoffer son pôle #SRE composé de trois personnes.
Le détail est ici : https://les-tilleuls.coop/la-cooperative/on-recrute/site-reliability-engineer-h-f
On a trois offres qui s'appellent PtiKube, GroKube et Kustomize (et on se pense méga drôles et on est sympas 😄).
Vous l'avez peut-être deviné, on fait pas mal de Kube ! #k8s
On est une SCOP, une coopérative de 70 salarié·e·s où les décisions se prennent collectivement et démocratiquement :)
(N'hésitez pas à repouéter ! <3)
Over my morning coffee, I enjoyed @Ilovelemons #SLOconf talk examining SLO practice through the lens of the book "Seeing Like A State".
I keep seeing that book come up so it's probably time for me to read it... I think there's a lot we can untangle in the greater #SRE discipline using social and political thinking.
Dan's language around using #SLOs as an interface between the company and the team feels really powerful!
Vamos de artigos técnicos em PT-BR bem escritos da noite: https://lobocode.github.io/2023/06/01/chaos/ #dev #chaosengineering #devops #sre #sysadmin #linux #developer #bolhadev
Nervous and excited about my big trip to the UK and Singapore, I fly out on Monday.
If you're at #awssummit next week or #srecon APAC the following week, come say hi!
Unfortunately there is no #SlightReliability episode this week... So as is tradition, I have a haiku for you. #sre
Wrote a blog post on understanding Metrics, Logs, Traces, Events https://last9.io/blog/understanding-metrics-events-logs-traces-key-pillars-of-observability/
Optimizing a software delivery organization holistically is a complex endeavor.
Engaging teams and middle & upper management in #DataDrivenDecisionMaking enables holistic #ContinuousImprovement of a software delivery organization.
More on #InfoQ: https://bit.ly/43vQtot
Event report from the recently concluded #SRE Meetup in Bengaluru
You know, I say in my bio and #intro that I used to work in #tech, but I haven't really actually said very much about what that was aside from comments here and there on other people's toots. (Edit: oh, and boosts are appreciated but not requested of this.)
So, here's a brief summary: I worked on two things primarily in my former career. The first was observability (more traditionally known as monitoring) systems, and second was realtime stream processing systems. In both cases the systems were BIG. Like, big. No, bigger.
The #observability systems processed hundreds of millions of datapoints/second, and the stream processing systems did terabytes/second of data. They were both globally distributed with the core backends running on thousands of machines collecting data from fleets of millions of machines. Because of the scale of them, I used to joke that I didn't count in anything smaller than a petabyte, which was honestly more true than joke.
I did a lot of different things while working on them from core engineering (I'm even academically published despite never having attended college, which is quite point of pride for me even years later,) to product management for the user interfaces for them, to #SRE type work.
As I find ways to talk about what I did more obliquely, I hope to muse on this stuff more because I still find it really cool to think about even though I've left the industry to make porn instead.
We use #Prometheus Remote Write every day so we wrote a post on it
Streaming Aggregation vs Recording Rules
So apparently all I needed to get Incident Management product folks to pay attention to me is to become a customer of one instead of an employee of one. Who knew?
In related news, I just got a vendor signed that took me 3 months of difficult maneuvering, research, creative presentation work, and intense collaboration with others. Thanks to all who helped me, you are also helping me soak in a victory I so desperately needed.
I came here to kick ass and chew gum and tbh I don't like gum.
Event Driven Ansible is Here.
#Ansible #Automation #RedHat #OpenSource #DevOps #PlatformOps #SRE #RHEL #Linux #NeedMoreHashTags
[Job] Nous recrutons en CDI un·e Site Reliability Engineer (#SRE). Nous avons un bureau à Lille, Paris, Nantes, Lyon, Oujda et Tunis mais vous pouvez aussi nous rejoindre en remote.
Infos sur le job, grille et simulateur de salaire sur cette page ⬇️ https://les-tilleuls.coop/la-cooperative/on-recrute/site-reliability-engineer-h-f
Should We Run Databases In Kubernetes? CloudNativePG PostgreSQL
Have been busting my ass working on this for the last ~8 months and we're finally live!
If it's broken it ain't me :kubernetes: :Blobhaj_Innocent:
#Kubernetes #SRE #Max #HBOMax #HBO
Any recommendation on a service to host npm, nuget, pypi, rubygems and rpm private packages ? Something like Nexus or Artifactory but managed.
We found https://fury.co/ but checking for other options.
#developers #dotnet #nodejs #ruby #python #sre #linux
What's the best definition of uptime or availability?
I'm trying to crowd source other's thoughts in the space and looking for more ideas.
#Reliability #Update #Availability #SRE #PlatformEngineering
Just read an insightful article exploring the evolution of software and hardware strategies over the past two decades! It delves into the shift from scaling out to scaling up, highlighting the benefits of leveraging single-node architectures. If you're interested in simplicity, performance, and cost-effectiveness, this is a must-read. Kudos to the author for shedding light on this fascinating topic! #technology #scalingup #sre
Emergency procedures used in incident response help stabilize the system in a degraded state. When leveraged properly, they result in faster incident response and become a foundation for further resiliency improvements in your system.
It's certainly more fun to anticipate the ways that the system can fail and recover than inventing measures on the spot.
#SRE Story with Matthew Iselin from @replit
Had a blast speaking in my hometown at #KubeHuddle Toronto this week!
I had Covid-19 a couple of months ago for the first time, and while in my recovery state, I started watching Grey's Anatomy, which my friends have had a lot of fun teasing me about. I'm still going through it now (there are a lot of seasons), and it only just struck me this morning why I might still be guilty-pleasure watching it... The hospital is a lot like IT engineering groups. There are those that show empathy for the patients (users) than others. There are those that are almost single-mindedly focussed on the deep technical challenges (jargon, processes). There are those that take risks and are celebrated when they succeed. And, there's the need for being on-call and responding with calm in emergency situations. The list goes on... I guess I'm enjoying it as a parallel to the work that I've done for many years and the types of engineers I saw. (Note that I'm not comparing saving lives through tricky brain surgery as equal to ensuring that mail servers are running smoothly). It makes me ponder what each industry could learn from the other. One is that there appears to be a systemic and formalised path for juniors (interns) to be taught and grown to higher skill levels. I've done a lot of mentoring in my IT career, but I don't think it's nearly as common as it should be. #medical #engineering #sre #greysanatomy
If we bury our heads in the sand of technology we leave ourselves at the mercy of the wolves. #SRE is as much about changing people and organisations as it is about technology. If we don't measure these other things, how do we know if we're succeeding?
La visión de Marcelo Ebrard ha generado expectativas e inversión a los municipios del país
#marcelo #ebrard #passport #pasaporte #sre #canciller #mexico #news #noticias #texas #guanajuato #estadosunidos #inmigracion #paisanos
**NEW FROM SAGEABLE **
Sageable #Innovation Brief: @PagerDuty #AIOPs is a Powerful New ‘System of Action’ for #DevOps & #SRE
https://sageable.com/research-library/ (login req'd)
Not a subscriber? Contact @Sageable for a **FREE TRIAL** and to learn about flexible subscription options!
I'm a little triggered by the word "emergent" wrt large distributed systems. Indulge me.
"Oh, this system is emergent" just implies "whelp, something is weird and no point investigating BECAUSE GHOSTS"
In my experience, it is never ghosts. Your system is emergent because your predecessor decided to implement split-horizon DNS in 2003. You will discover this after a long, painful journey in which you *will* contemplate the meaning of sentience, just not in the way you initially thought.
SREcon23 EMEA's deadline is getting closer! Please submit your talks and make this an awesome conference! #srecon #sre
Nice postmortem for recent Github outages https://github.blog/2023-05-16-addressing-githubs-recent-availability-issues/ #postmortem #github #sre #devops
Struggling to get #SRE traction within engineering teams? Gwen Berry and I share our "reliability benchmarking" approach to start the SRE conversation as part of #SLOconf here: https://www.youtube.com/watch?v=pGL69abT7r4
Do you like the mspaint style artwork of Slight Reliability? Now you can peruse it at your leisure on Instagram: https://www.instagram.com/slight_reliability/
(Trying something new... I will continue to upload new artwork from social posts, events, and the podcast + take photos and videos at events to add here) #sre #observability
Are you looking for open source talent? Check out our quick start guide and get your jobs posted in minutes on #OSJobHub https://opensourcejobhub.com/quickstart/ #hiring #OpenSource #tech #jobs #career #developer #Linux #FOSS #SRE #engineer #developer
If you don't have anything better to do (and even if you do!) Check out #SLOconf23 because it's awesome and I have a talk in it!
It's free, it's cool, it's got all the talks available on YouTube so you don't even have to shitpost in slack (although I'd love the engagement tbh)
My talk is on Motivating SLOs Mathematically and it is DERANGED in the best way possible
Hi Friends! Like #DevOps, #SRE, or #PlatformEngineering ? Want to share with the community something you've learned? Submit to our local conference, #DevopsDays #TampaBay happening on September 21st at the #Tampa Riverwalk at Armature Works! #cfp https://sessionize.com/devopsdays-tampa-bay-2023/
One of my favourite reminders, as much as I struggle to apply it, is:
"That person is bound to do that, you might as well resent a fig tree for secreting juice." – Marcus Aurellius.
Also an old #SRE saying.
so is zabbix the go to opensource system monitoring tool now? or is it observium?
Brand new server, new #introduction!
Hi I'm Lily, a proud #trans #lesbian from #denver, a #mom of a 12 year old, and am #engayged to the love of my life.
This the :calckey: alt for @firstname.lastname@example.org
I'm a #neurodivergent #WomanInTech working as an #SRE for Warner Bros Discovery, the creator #FediHost (the platform hosting this and other servers), admin of outdoors.lgbt :mastodon:, and the newly launched calckey.lgbt :calckey:
For fun I like to play #snowboard, #longboard, go #camping, play #videogames, hang out at the lesbian bar, play #guitar, and watch #tiktok in bed w/ my partner.
How did I miss this? Legit one of the best presentations I've seen in my life: https://www.youtube.com/watch?v=BEs6j-BOl20 #sre #srecon
Fighting for a healthy Internet for 20+ years, @mozilla has open positions for software and security engineers, marketing managers, SREs, and more. Check out the jobs now on #OSJobHub https://opensourcejobhub.com/company/488/ #jobs #career #engineer #software #security #marketing #finance #SRE #DataScientist
Abby Bangser shares her journey from a QA to a platform engineer in this episode. We discuss the similarities between testing and infrastructure-related areas like site reliability engineering. In addition, Abby explains observability and shares some sound advice regarding implementing it.
@abangser #SRE #Observability
Think you're not an open source professional? We have technical and non-technical positions at top open source companies. Browse jobs now on #OSJobHub https://opensourcejobhub.com/ #jobs #career #FOSS #engineer #SoftwareDeveloper #sales #marketing #SRE #DevOps #sysadmin
**Comparing the Top Eight Managed Kubernetes Providers **https://medium.com/@elliotgraebert/comparing-the-top-eight-managed-kubernetes-providers-2ae39662391b #sre #cloud #k8s
Made a crossword on common DevOps terminologies..
when I was a practicing #SRE spending a couple hours spelunking in production observability data was my fave way to expand my system knowledge
Give my Production Scavenger Hunt a try if you've recently adopted a new #o11y tool, joined a new team, or feel the call to explore this afternoon
OH: "[...] It is a micro-service architecture with multiple services"...
🤔 Well, sir, that sounds like a tautology! #SoftwareEngineering #SRE
Wir befinden uns aktuell im Aufbau unseres #DevOps / #SRE / #Operations Teams.
Hast du Bock auf #AWS, #Kubernetes, #Terraform, #PlatformEngineering oder #CICD? Egal ob Junior oder Señor, meld' Dich gerne Mal bei mir!
We're celebrating 1 year of Open Source JobHub! Let us help you cut through the noise and find the job that's right for you. Check out open positions on #OSJobHub https://opensourcejobhub.com/ #jobs #hiring #career #OpenSource #developer #engineer #SRE #Linux #FOSS #sales #marketing
Alright for config management what've we got that people like besides:
- Ansible (barely counts)
- CloudInit (barely counts)
- /bin/sh (blessed? cursed? you decide)
👍🏻 New Relic Grok - The first GenAI observability assistant. Get deep insights from heaps of telemetry data using natural language via a chat interface https://newrelic.com/platform/new-relic-grok #machinelearning #sre
I loved the opportunity to share my wisdom with such a thoughtful, engaged audience (and to meet likeminded nerds like @hazelweakly!)
I sincerely believe #SRE and #platform engineering teams are critical for sustaining software #resilience (and #security).
PS if you vibe with my talk and this thread, read my new book: https://securitychaoseng.com
The video of my #SREcon talk is live: https://youtu.be/DGdtfB1eY98
It's all about how SREs can align their mental models of a system with reality to sustain software #resilience -- because SREs are a critical mechanism of adaptation in our systems.
If you're an #SRE you're probably not like, waking up thinking, "How will I be the mechanism of adaptation today?" so I wanted to provide some scaffolding around the concept in the talk.
This will be a 🧵of five key takeaways:
Want to work alongside me? (you fool if you do). We have an opening for a Principal Architect on our Engineering Enablement leadership team.
If you've got a strategic and very human eye for platform services and infrastructure then please apply!
How do large organizations that use public #cloud services (like #aws and #gcp) optimise their costs? Are there off the shelf tools for cost optimisation and cost analysis? I imagine when people reference #finops, it is referencing a #sre function of rearchitecting and performance vs reliability tradeoffs, etc.
Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them - Holly Cummins (email@example.com) https://hollycummins.com/cloud-zombies-qcon-london/
No wizardry needed to use Ansible's magic variable 'hostvars'
#Ansible #RedHat #SysAdmin #DevOps #SRE #Automation #PlatformEngineering #OpenSource
Looking to find your place in the open source ecosystem? Check out all jobs in #SoftwareDevelopment #sales #security #marketing #DevOps #LinuxDevelopment and more on #OSJobHub https://opensourcejobhub.com #jobs #career #OpenSource #SRE #cloud #engineer
It's not data exfiltration, it's an unscheduled offsite backup.
I'm speaking at SLOconf 2023! It'll be May 15-18th, and the talk is titled "Motivating SLOs Mathematically"
It's going to sound more like an unhinged late night rant than a carefully collected presentation, but I hope it gets people thinking! ❤️
"Have you ever wondered if there's something behind the experiential knowledge that we hold as best practices?" 👀
Also, big shout-out to @ahidalgosre for continually pestering me until I agreed to give a talk :ablobfoxbongo:
Dev[Sec]OpsDays is coming to #Prague again in May. Why don't you join me? https://talkweb.eu/openweb/3775/
#devOps #devsecops #ThreatModeling #Prague #techops #platform #sre
I'm an #SRE at a mid-sized #BayArea tech company, looking for resources, advice, and support on unionization drives.
This is an anonymous account, for now, because I am concerned about illegal and unethical reprisals from the leadership of my organization if it gets out that I am involved in unionization before we're able to get the NLRB involved.
Expect me to be talking about the tech industry's technical and social barriers to organization.
A few years ago, a young help desk admin flew across the country for a final #interview with me for a (remote) Sysadmin role. He was nervous to be interviewing for his dream job.
The guy next to him on the hop to #Boise was an engineer at some #Idaho #Tech company, when he learned about the interview he spent the whole damn flight doing interview prep with him.
I wish I knew who that engineer was so I could thank him, kid nailed his interview. And I recently promoted him to #SRE.
Okay #linux #sre and #security nerds. I have an IP on my network. It's showing up a lot in pi-hole as a very active device.
Problem is, I can't identify it.
nmap -O sez "Too many fingerprints match this host to give specific OS details"
nmap alone says no ports are open.
But I can ping it.
MAC address says there's no vendor id for it (it starts with ae:1e:79)
So... what is this thing?
"Eventually this customer has had enough. They leave. This represents both a sizable blow to revenue and a scathing indictment of your product’s reliability at scale. But, on the bright side, both MTTR and MTBF benefit enormously! That’ll look great on the quarterly slide deck." (~700w)
https://blog.danslimmon.com/2023/04/04/incident-metrics-tell-you-nothing-about-reliability/ #sre #devops #incidentresponse #postmortems
@timo are there any good alternatives? We use OpsGenie but suspect it is pretty much the same thing. Any disruptive new players in the market?
🤔 #Google plans to reduce its ratio of site reliability engineers to #developers by using automation. There is currently one site reliability engineer for every 10 software engineers, and the target would bring this to one #SRE for every 20 software engineers, Some Googlers are interpreting this as a sign of impending #layoffs to meet that goal, or it could mean that Google will hire half as many site reliability engineers as before, the source said https://www.businessinsider.com/google-layoffs-engineering-leader-urs-holzle-efficiency-2023-3
I've been on a slow-moving mission against reliance on manual runbooks, in favour of automation.
Yes people will skip steps when doing your manual runbook.
Another drawback is that your manual runbooks have no regression test against them. The steps will absolutely break.
(And no, a wiki stringing 6 different scripts together in the right order and asking people to run them on the right hosts is not automation.)
I should make more slide decks. Enjoying a wee trip down memory lane with some past team members.
Did you know that server actions taken by #Wikimedia sysadmins are logged wiki with history going back to 2004?
And since 2019 have been broadcast to the fediverse via @wikimedia_sal? (which moved instances today)
Details about how it works technically are at https://wikitech.wikimedia.org/wiki/Tool:Stashbot - it's written and maintained by @bd808
#observability related question for anyone who happens to build or run a tracing backend: what happens if I send duplicate spans?
In this article I explain the difference between #metrics, #logs, #traces, and #profiles as simply as possible (using the metaphor of a café). I also discuss the relative strengths and weaknesses of each telemetry type. https://squaredup.com/blog/metrics-vs-logs-vs-traces-vs-profiles/ #observability #sre