Production Trenches: Pitfalls and Pratfalls
Bri Hatch |
Personal |
Work |
Onsight, Inc
bri@ifokr.org |
ExtraHop Networks
bri@extrahop.com |
Copyright 2015, Bri Hatch,
Creative Commons BY-NC-SA License
Audience
Who should be here?
- People who aren't interested in Software Patent Litigation
- People who want to make new mistakes
- People who want to know the difference between an SLA and the TSA
- People who don't want to see any code
- $ egrep '^([j-m].ed|I)$' /usr/share/dict/words
Background
Who's this Bri guy?
Importance of Analogy
The Datacenter Upgrade
- "That sounds be easy!"
- Airplanes
A Fail Storm
"Apache was returning blank pages Sunday morning starting at 5:23 - what was wrong?"
A Fail Storm (cont)
Looking at the logs
06/Nov 05:23:59 "GET /index.html HTTP/1.1" 200 42331 ""
06/Nov 05:24:03 "GET /thing1/ HTTP/1.1" 200 76442 "http://example.com/index.html"
06/Nov 05:24:12 "GET /thing2/ HTTP/1.1" 200 65232 "http://example.com/thing1/"
....
Looking at logs
Realized - guy is in Eastern, we're in Central
A Fail Storm (cont)
Wait - these look the same!
Looking at the right logs
06/Nov 04:23:59 "GET /stuff/ HTTP/1.1" 200 61472 "http://example.com/"
06/Nov 04:24:03 "GET /thing0/ HTTP/1.1" 200 86442 "http://example.com/index.html"
06/Nov 04:24:12 "GET /about/index.html HTTP/1.1" 200 57774 ""
....
Realize the local clock is also wrong!
A Fail Storm (cont)
Looking at the right logs - really this time!
Still everything looks good... :-(
06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
06/Nov 02:00:03 "GET /search/ HTTP/1.1" 200 92186 "http://example.com/about/index.html"
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"
....
A Fail Storm (cont)
Looking at the right logs - fourth time's the charm!
Whoops - this was DST change
06/Nov 02:59:58 "GET /cart/ HTTP/1.1" 200 42331 "http://example.com/toolbox/"
06/Nov 02:00:03 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/"
06/Nov 02:00:03 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/checkout/"
06/Nov 02:00:05 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/checkout/"
06/Nov 02:00:22 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/checkout/"
....
A Fail Storm (cont)
Why did our monitoring not catch this issue?
$ /usr/lib/nagios/plugins/check_http -I example.com
HTTP OK: HTTP/1.1 200 OK - 0 bytes in 0.006 second response time
Serving errors is really fast!
A Fail Storm (cont)
Happened to have dumps on disk
$ ps -ef | grep tcpdump
tcpdump -n -s 9999 -w /bigdisk/dump.out -G 3600
Wireshark time!
Takeaways
Takeaways
- Time sychronization with NTP
- Timezone standardization with UTC
- Better Monitoring
- Off-box logs / Aggregation
- Wire Data
Logs Lie
Logs Lie - WTF?
- Say what the system thought it did
- Only log what the programmer(s) thought to log
- Get multiple sources of the truth
Handoff to kernel
Negotiated and remote dropped
Monitoring
Monitoring
- Light vs heavy
- Targeted vs generic
- Smart vs stupid
made very smart checks in WWW::Mechanize
Every push required new logic
Slowed down bip checks
Monitoring (cont)
check_http -H | -I [-u ] [-p ]
[-J ] [-K ]
[-w ] [-c ] [-t ] [-L] [-E] [-a auth]
[-b proxy_auth] [-f ]
[-e ] [-d string] [-s string] [-l] [-r | -R ]
[-P string] [-m :] [-4|-6] [-N] [-M ]
[-A string] [-k string] [-S ] [--sni] [-C [,]]
[-T ] [-j method]
Monitoring (cont)
Catch the known
- /fasthealth?site=www
- /fullhealth?site=www
Monitoring (cont)
Catch the unknown
- RUM
- Wire Data
- Angry Users
Alerting
Alerting
Alert based on
- Known-bad
- Percentages
- Trends
- Rapid Changes
SLAs
SLA: Measure of uptime
- 99% == 3.7 d/y, 7 h/mo
- 99.9% == 8.7 h/y, 44 m/mo
- 99.99% == 56 m/y, 4 m/mo
9 8s
SLAs (cont)
What really is an SLA?
- Measure of risk-taking potential!
- 99% == 3.7 d/y, 7 h/mo
- 99.9% == 8.7 h/y, 44 m/mo
- 99.99% == 56 m/y, 4 m/mo
Limit yourself
Restrictions are freeing
- Limited number of languages
- Rigid style guides
- Hypervisors / Clouds
- Code Review
- PM
- Tools
esix, hyperv, kvm, virtualbox, aws, azure, openstack
gerrit, stash, gitolite
Use the tools as designed
Use the tools as designed
- The tool author wrote it that way for a reason
- You are not smarter than the tool author
- If you are, use a different tool
git is in C so contributors have a high bar
git becoming svn w/ incrementing revision numbers, single commits only
team wrote a lot of work on new branch, could not push
Don't be clever
Don't be clever
- More readable code is better than shorter line noise (see: perl)
- You are not smarter than your coworkers
- You are smarter than your 6 month younger self
Don't be clever (cont)
What does this do?
rsync data dns_server:/var/tinydns/data
Don't be clever (cont)
$ cat ~/.ssh/authorized_keys
command="/opt/bin/syncw" ssh-rsa AAAAB3NzaC1yc2EAAAAB....
Don't be clever (cont)
$ cat /opt/bin/syncw
#!/bin/sh
rsync --server . /var/dns/upstream_data
/home/bri/bin/makedns
Don't be clever (cont)
$ cat /home/bri/bin/makedns
#!/bin/sh
for dir in /var/dns/tinydns-[0-9]*/root
do
cd $dir
make
done
Don't be clever (cont)
$ cat /var/dns/tinydns-[0-9]*/root/Makefile
SRC_DATA=/var/dns/upstream_data
LOCAL_ZONES=data.local
data.cdb: $(LOCAL_ZONES) $(SRC_DATA) $(DIRS)
sort -u $(LOCAL_ZONES) $(SRC_DATA) >> data
/usr/bin/tinydns-data
Don't be clever (cont)
This could have all been written as:
rsync data dns_server:/var/tinydns/data
ssh dns_server 'cd /var/tinydns/data && /usr/bin/tinydns-data'
Find your own Fails!
Any questions?