2026-02-02 - 2026-02-22-DDIA_5

chp 2 - section 3:

reliability - it keeps running even if there is an issue, and the entire system does not grind to a halt
a fault - a part of the system not working
a failure - the how whole system is not working
SPOF - single point of failure
chaos engineering - purposely causing faults to test fault-tolerance
tolerate faults instead for preventing them
hardware faults:
- increase hardware fault tolerance through redundancy - so that if one hard drive/ram/whatever fails there re other's already in place to do the work
- if your system is a single node - it will have to come down for upgrades and patches
- if your system is multi-node you can have rolling upgrades
software faults: a fault in a software propagates to all nodes its deployed to
sociotechnical: interplay between social and technical aspects in complex systems.
humans:
- make mistakes
- blaming them is counter-productive
- design the systems to configure things to encourage the right things, the same way you designing for customers
- Preventable mistakes should be seen as a problem of the system that allowed not the human you might have done it while rushed and overwhelmed
unreliable software has far reaching consequences beyond app being down and lose of productivity, humans have been wrongly sent to jail due to bugs