Production Trenches: Pitfalls and Pratfalls
Bri Hatch |
Personal |
Work |
Onsight, Inc
bri@ifokr.org |
ExtraHop Networks
bri@extrahop.com |
Copyright 2016, Bri Hatch,
Creative Commons BY-NC-SA License
Welcome To
Audience
Who should be here?
- People who aren't interested in the current state of
Software Patent Litigation
- People who want to make new mistakes
- People who want to know the difference between an SLA and the TSA
- People who don't want to see code in presentations
- $ egrep '^([j-m].ed|I)$' /usr/share/dict/words
- I
lied
A Fail Storm
"Apache was returning blank pages Sunday morning starting at 5:23 - what was wrong?"
A Fail Storm (cont)
Looking at the logs
06/Nov 05:23:59 "GET /index.html HTTP/1.1" 200 42331 ""
06/Nov 05:24:03 "GET /thing1/ HTTP/1.1" 200 76442 "http://example.com/index.html"
06/Nov 05:24:12 "GET /thing2/ HTTP/1.1" 200 65232 "http://example.com/thing1/"
....
But he was in Eastern, server is in Central!
A Fail Storm (cont)
Looking at the right logs
06/Nov 04:23:59 "GET /stuff/ HTTP/1.1" 200 61472 "http://example.com/"
06/Nov 04:24:03 "GET /thing0/ HTTP/1.1" 200 86442 "http://example.com/index.html"
06/Nov 04:24:12 "GET /about/index.html HTTP/1.1" 200 57774 ""
....
But the local clock is also wrong!
A Fail Storm (cont)
Looking at the right logs - really this time!
06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
06/Nov 02:00:03 "GET /search/ HTTP/1.1" 200 92186 "http://example.com/about/index.html"
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"
....
Still everything looks good - WTF?
A Fail Storm (cont)
06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
Let's parse!
$ awk '{print $7, $2, $4}' access.log | sort -n
0 02:00:04 /thing3/
0 02:00:09 /thing0/
0 02:00:24 /checkout/
0 02:00:32 /checkout/
...
55424 02:00:01 /thing0/
...
78505 02:00:42 /thing3/
78505 02:59:42 /thing3/
...
92186 02:00:03 /search/
But... But...
A Fail Storm (cont)
$ grep '06/Nov 02:.* /thing3/.* 0 ' access.log | wc -l
29
$ grep '06/Nov 02:.* /thing3/.* 0 ' access.log | head -1
06/Nov 02:00:04 "GET /thing3/ HTTP/1.1" 200 0 "http://example.com/thing1/"
$ grep '06/Nov 02:.* /thing3/.* ' access.log | head -1
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"
A Fail Storm (cont)
So it has come to this...
$ less access.log
...
06/Nov 02:59:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/"
06/Nov 02:59:58 "GET /cart/ HTTP/1.1" 200 42331 "http://example.com/toolbox/"
06/Nov 02:00:04 "GET /thing3/ HTTP/1.1" 200 0 "http://example.com/thing1/"
06/Nov 02:00:09 "GET /thing0/ HTTP/1.1" 200 0 "http://example.com/"
06/Nov 02:00:24 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/"
06/Nov 02:00:32 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/"
...
DST change! You Bastard!
A Fail Storm (cont)
Why did our monitoring not catch this issue?
$ /usr/lib/nagios/plugins/check_http -I example.com
HTTP OK: HTTP/1.1 200 OK - 0 bytes in 0.006 second response time
A Fail Storm (cont)
We happened to have dumps on disk - Wireshark Time!
$ ps -ef | grep tcpdump
tcpdump -n -s 9999 -w /bigdisk/dump.out -G 3600
Blame enough to go around.
- The backend DB was at fault - it returned zero rows for all queries.
- The application was at fault - it did not handle this correctly.
- The monitoring was at fault - it thought things were fine.
Takeaways
Takeaways
- Time sychronization with NTP
- Timezone standardization with UTC
- Better Monitoring
- Off-box logs / Aggregation
- Serving errors is really easy and fast!
- Never 200 for errors!
- Wire Data
Logs Lie
Logs Lie - WTF?
- Say what the system thought it did
- Only log what the programmer(s) thought to log
- Hard to correlate
- Always get multiple sources to reach the truth!
Examples:
- poor elapsed time computations
- write(2) handoff to kernel
- logs with ugly date formats (Y U No Year, syslog?)
Oncall
Typical Oncall Week
Pageworthy?
Pages must
- Go to the right person
- Be time critical
- Be actionable
- Require human intelligence to remediate
- Have an escalation path
Not pagewothy!
If you say any of these, it should not page.
- "I can finish lunch first."
- "This always happens this time of day."
- "It typically clears itself up."
- "I just run this script and it's fine again."
- "Only so-and-so can fix this."
Alert Criteria
Alert based on
- Known-bad
- Percentages
- Trends
- Rapid Changes
Don't alert based on
- Thresholds (usually)
- Partial impairment
SLAs
- SLI
- Service Level Indicator - something you can measure
- SLO
- Service Level Objective - a value to obtain
- SLA
- Service Level Agreement - an incentive to meet your SLO
SLAs (cont)
Example SLO: Measure of uptime
- 99% == 3.7 d/y, 7 h/mo
- 99.9% == 8.7 h/y, 44 m/mo
- 99.99% == 56 m/y, 4 m/mo
- 99.999% == 5 m/y, 30 s/mo
- 55.5555555% == 162 d/y, 324 h/mo
SLAs (cont)
SLOs can be any sort of measure, such as
- X% served at less than Y ms
- Error ratio never greater than X%
- Serve small files under 500ms, medium files under 1s, large files under 3s
Waffly SLOs
- Never down for more than 5 consecutive minutes in a one hour window
- Less than 0.1% unscheduled downtime
- Message guaranteed to be delivered no more than 72 hours after being sent
SLAs (cont)
What really is an SLA?
Measure of risk-taking potential!
- 99% == 3.7 d/y, 7 h/mo
- 99.9% == 8.7 h/y, 44 m/mo
- 99.99% == 56 m/y, 4 m/mo
- 99.999% == 5 m/y, 30 s/mo
When SLA violated...
There are many things you can do when SLA is not met
- Dedicate a person to improving the system
- Block all feature work until back in SLA
- Give the pager to the devs
Oncall
Later Oncall Weeks
Oncall Tools
Oncall Tools
- Alertless monitors
- Documentation
- Oncall Playbook
- Call Trees
Monitoring Evolution
Monitoring is an ongoing process
- check_http - but only showed up/down of some pages
- WWW::Mechanize - but never in sync w/ product changes
- Server self-check - but broke LB timeouts
- /fasthealth and /fullhealth - but took time to roll out everywhere
- Standardized URL - but pool misconfigs possible
- Standardized on /fasthealth?site=www and /fullhealth?site=www
Monitoring (cont)
Catch the unknown
- RUM
- Trending
- Wire Data
- Angry Users
Monitoring (cont)
Monitor the right things
define command {
command_name http_port_grep_header
command_line /usr/lib/nagios/plugins/check_http -H '$HOSTADDRESS$'
-I '$HOSTADDRESS$' -p '$ARG1$' -u '$ARG2$' -N
}
...
define service {
check_command http_port_grep_header!3142!/acng-report.html
hostgroup_name apt-cacher-ng
service_description apt_cacher_ng
use generic_service_template
}
...
define host {
host_name vm-103
address 192.168.253.288
hostgroups apt-cacher-ng
...
Monitoring (cont)
Monitor the right things
- Monitor the user-facing name, e.g. the CNAME
- Use and rely on DNS
- Set up dependencies
Backups
Backups
- Daily/weekly/monthly
- Local? Remote? Cloud?
- Yes! The more the better!
- Never underestimate the bandwidth of a pack of Emus
- If you haven't restored, you're not backing up
Backups (cont)
From my pager conversation this am:
- The website is broken!
- We turned off local backups - we needed the space!
- Remote rsync backups have been broken for months now - haven't had time to figure out why!
- The cloud backups stopped when we stopped paying for the service!
Backups (cont)
Pager conversation continued:
- We removed the old logs - we needed more space!
- We turned off new logs - they slowed down the computer!
- We didn't maintain your monitoring - takes too much time to keep current!
- We never implemented any other monitoring systems - costs money!
So... can you figure out what's wrong?
Testing
Testing
- Unit Tests!
- Good test data
- Fuzzing
- Shakespeare
- ... questionable content
Limit yourself
Restrictions are freeing
- Limited number of languages
- Rigid style guides
- Hypervisors / Clouds
- Code Review
- PM
- Tools and infrastructure
Use the tools as designed
Use the tools as designed
- The tool author wrote it that way for a reason
- You are not smarter than the tool author
- If you are, use a different tool
git is in C so contributors have a high bar
git becoming svn w/ incrementing revision numbers, single commits only
team wrote a lot of work on new branch, could not push
Don't be clever
Don't be clever
- Longer but more readable code is better than shorter line noise (see: perl)
- You are not smarter than your coworkers
- You are smarter than your 6 month older self
Don't be clever (cont)
What does this do?
#!/bin/bash
set -e
set -u
cd $(dirname $0)
git pull --rebase
cd path/to/dnsstuff
export SSH_AUTH_SOCK=/path/to/ssh-agent
rsync data dns_server:/var/tinydns/data
And why is the new DNS server not getting updates?
Don't be clever (cont)
server$ rsync data new_dns_server:/var/tinydns/data
rsync: mkstemp "/var/tinydns/.tpmuc.lch0gC" failed: Permission denied (13)
new_dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory
Don't be clever (cont)
server$ rsync data new_dns_server:/var/tinydns/data
rsync: mkstemp "/var/tinydns/.tpmuc.lch0gC" failed: Permission denied (13)
new_dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory
server$ rsync data dns_server:/var/tinydns/data
New file uploaded
Merging local changes
Compiling
HAND!
dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory
Don't be clever (cont)
server$ rsync data dns_server:/var/tinydns/data
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
...
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Offering RSA public key: dns_magic
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Entering interactive session.
debug1: Remote: Forced command.
debug1: Sending environment.
New file uploaded
Merging local changes
Compiling
HAND!
Connection to dns_server closed.
server$
Don't be clever (cont)
server$ rsync data dns_server:/var/tinydns/data
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
...
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Offering RSA public key: dns_magic
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Entering interactive session.
debug1: Remote: Forced command.
debug1: Sending environment.
New file uploaded
Merging local changes
Compiling
HAND!
Connection to dns_server closed.
server$
Don't be clever (cont)
$ cat ~/.ssh/authorized_keys
command="/opt/bin/syncw" ssh-rsa AAAAB3NzaC1yc2EAAAAB....
Don't be clever (cont)
$ cat /opt/bin/syncw
#!/bin/sh
rsync --server . /var/dns/upstream_data
/home/bri/bin/makedns
Don't be clever (cont)
$ cat /home/bri/bin/makedns
#!/bin/sh
for dir in /var/dns/tinydns-[0-9]*/root
do
cd $dir
make
done
Don't be clever (cont)
$ cat /var/dns/tinydns-[0-9]*/root/Makefile
SRC_DATA=/var/dns/upstream_data
LOCAL_ZONES=data.local
data.cdb: $(LOCAL_ZONES) $(SRC_DATA) $(DIRS)
sort -u $(LOCAL_ZONES) $(SRC_DATA) >> data
/usr/bin/tinydns-data
Don't be clever (cont)
This could have all been written explicitly
$ rsync data dns_server:/var/tinydns/data
$ ssh dns_server 'cd /var/tinydns && /usr/bin/tinydns-data'
Importance of Analogy
The Datacenter Upgrade
- "That sounds be easy!"
- Airplanes
Find your own Fails!
Any questions?