https://github.com/abhishektiwari/operational-excellence-primer
Operational Excellence Primer
https://github.com/abhishektiwari/operational-excellence-primer
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Operational Excellence Primer
Basic Info
- Host: GitHub
- Owner: abhishektiwari
- License: mit
- Default Branch: main
- Size: 6.84 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Created about 4 years ago
· Last pushed about 4 years ago
Metadata Files
Readme
License
README.md
Operational Excellence Primer
Design Principles
Perform operations as code
- Everything as a source code
- Compliance, Infrastructure, Delivery Pipeline, Monitoring, Alerting, Playbook
Make frequent, small, reversible changes
- Feature toggles
- Vertical vs horizontal slicing
- Trunk-base development
- Blue green deployments
Refine operations procedures frequently
- Run regular game days
- Keep the playbook up to date
- Perform chaos engineering experiments
Anticipate failure
- Perform pre-mortem exercises to identify source of failure
- Murphy's Law - Expecting the Unexpected
- Test your failure scenarios and validate impact
Learn from all operational failures
- Run blameless postmortem
- Create organizational memory
- Cause and effect analysis - 5 Why's, Fishbone diagram
How to do
Preparation
- Pre-mortam what could go wrong
- GamesDay - chaos experiments
- active failure vs. team tabletops- simulation
- fresh understanding
- Read and update playbooks
Risk Management
- Frequent, small, reversible changes
- Feature toggling
- Blue-green/canary/rolling deployments
- Bring high-risk items ahead in the project timeline
Troubleshooting
- Triage first
- make the system work as well as it can under the circumstances
- Then Examine
- Each component of the system
- Metrics plotted as time series
- Logs - particularly exception and errors
- Then Diagnose
- what changed - deployment or config changes
- See of changes correlating with system bahaviour
- divide and conquer
- data flow between components - distributed tracing
- divide diagnosis by layers or steps
- what changed - deployment or config changes
- Finally test and treat
- Pitfalls
- Looking at symptoms that aren’t relevant
- misunderstanding the meaning of system metric
- Triage first
Event Response
- A clear and well-defined line of command
- Delegated roles and responsibilities
- Incident commander
- Response team
- Communication lead
- Planning lead
- Record the state and actions
- all details of an incident
- every action on debugging and mitigation
Root Cause Analaysis
- 5 Whys - cause and effect
- Fishbone diagram
Organizational Learning
- Creating, retaining, and transferring knowledge within an organization
- Every problem as an opportunity to build a better organization response
- Sharing and transparency
- applied to all systems and teams organization-wide
- Conduct cross-team reviews of postmortems
- Postmartems as a source code (linking to improvements)
- Run blameless postmortems
- Review postmartems culture
- Corrective or Preventive Actions
Owner
- Name: Abhishek Tiwari
- Login: abhishektiwari
- Kind: user
- Location: NY
- Company: Amazon
- Website: https://www.abhishek-tiwari.com/
- Twitter: abhishektiwari
- Repositories: 35
- Profile: https://github.com/abhishektiwari
Tech Savant, Servant Leade.
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Abhishek Tiwari | a****k@a****m | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0