2026-02-02 - 2026-02-22-DDIA_5
|
1 min read
chp 2 - section 3:
- reliability - it keeps running even if there is an issue, and the entire system does not grind to a halt
- a fault - a part of the system not working
- a failure - the how whole system is not working
- SPOF - single point of failure
- chaos engineering - purposely causing faults to test fault-tolerance
- tolerate faults instead for preventing them
- hardware faults:
- increase hardware fault tolerance through redundancy - so that if one hard drive/ram/whatever fails there re other's already in place to do the work
- if your system is a single node - it will have to come down for upgrades and patches
- if your system is multi-node you can have rolling upgrades
- software faults: a fault in a software propagates to all nodes its deployed to
- sociotechnical: interplay between social and technical aspects in complex systems.
- humans:
- make mistakes
- blaming them is counter-productive
- design the systems to configure things to encourage the right things, the same way you designing for customers
- Preventable mistakes should be seen as a problem of the system that allowed not the human you might have done it while rushed and overwhelmed
- unreliable software has far reaching consequences beyond app being down and lose of productivity, humans have been wrongly sent to jail due to bugs