Production Trenches: Pitfalls and Pratfalls

Bri Hatch	Personal	Work
Bri Hatch	Onsight, Inc bri@ifokr.org	ExtraHop Networks bri@extrahop.com

Welcome To

Audience

Who should be here?

People who aren't interested in the current state of Software Patent Litigation
People who want to make new mistakes
People who want to know the difference between an SLA and the TSA
People who don't want to see code in presentations
$ egrep '^([j-m].ed|I)$' /usr/share/dict/words
I
lied

A Fail Storm

"Apache was returning blank pages Sunday morning starting at 5:23 - what was wrong?"

A Fail Storm (cont)

Looking at the logs

06/Nov 05:23:59 "GET /index.html HTTP/1.1" 200 42331 ""
06/Nov 05:24:03 "GET /thing1/ HTTP/1.1" 200 76442 "http://example.com/index.html"
06/Nov 05:24:12 "GET /thing2/ HTTP/1.1" 200 65232 "http://example.com/thing1/"
....

But he was in Eastern, server is in Central!

A Fail Storm (cont)

Looking at the right logs

06/Nov 04:23:59 "GET /stuff/ HTTP/1.1" 200 61472 "http://example.com/"
06/Nov 04:24:03 "GET /thing0/ HTTP/1.1" 200 86442 "http://example.com/index.html"
06/Nov 04:24:12 "GET /about/index.html HTTP/1.1" 200 57774 ""
....

But the local clock is also wrong!

A Fail Storm (cont)

Looking at the right logs - really this time!

06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
06/Nov 02:00:03 "GET /search/ HTTP/1.1" 200 92186 "http://example.com/about/index.html"
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"
....

Still everything looks good - WTF?

A Fail Storm (cont)

06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"

Let's parse!

$  awk '{print $7, $2, $4}' access.log | sort -n

0 02:00:04 /thing3/
0 02:00:09 /thing0/
0 02:00:24 /checkout/
0 02:00:32 /checkout/
...
55424 02:00:01 /thing0/
...
78505 02:00:42 /thing3/
78505 02:59:42 /thing3/
...
92186 02:00:03 /search/

 But... But...

A Fail Storm (cont)

$ grep '06/Nov 02:.* /thing3/.* 0 ' access.log | wc -l
29


$ grep '06/Nov 02:.* /thing3/.* 0 ' access.log | head -1
06/Nov 02:00:04 "GET /thing3/ HTTP/1.1" 200 0 "http://example.com/thing1/"



$ grep '06/Nov 02:.* /thing3/.* ' access.log | head -1
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"

A Fail Storm (cont)

So it has come to this...


$ less access.log
...
06/Nov 02:59:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/"
06/Nov 02:59:58 "GET /cart/ HTTP/1.1" 200 42331 "http://example.com/toolbox/"


06/Nov 02:00:04 "GET /thing3/ HTTP/1.1" 200 0 "http://example.com/thing1/"
06/Nov 02:00:09 "GET /thing0/ HTTP/1.1" 200 0 "http://example.com/"
06/Nov 02:00:24 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/"
06/Nov 02:00:32 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/"
...

 DST change! You Bastard!

A Fail Storm (cont)

Why did our monitoring not catch this issue?

$ /usr/lib/nagios/plugins/check_http  -I example.com
HTTP OK: HTTP/1.1 200 OK - 0 bytes in 0.006 second response time

A Fail Storm (cont)

We happened to have dumps on disk - Wireshark Time!

$ ps -ef | grep tcpdump
tcpdump -n -s 9999 -w /bigdisk/dump.out -G 3600

Blame enough to go around.

The backend DB was at fault - it returned zero rows for all queries.
The application was at fault - it did not handle this correctly.
The monitoring was at fault - it thought things were fine.

Takeaways

Time sychronization with NTP
Timezone standardization with UTC
Better Monitoring
Off-box logs / Aggregation
Serving errors is really easy and fast!
Never 200 for errors!
Wire Data

Logs Lie

Logs Lie - WTF?

Say what the system thought it did
Only log what the programmer(s) thought to log
Hard to correlate
Always get multiple sources to reach the truth!

Examples:

poor elapsed time computations
write(2) handoff to kernel
logs with ugly date formats (Y U No Year, syslog?)

Oncall

Typical Oncall Week

Alerting

Page
Ticket
Log

Pageworthy?

Pages must

Go to the right person
Be time critical
Be actionable
Require human intelligence to remediate
Have an escalation path

Not pagewothy!

If you say any of these, it should not page.

"I can finish lunch first."
"This always happens this time of day."
"It typically clears itself up."
"I just run this script and it's fine again."
"Only so-and-so can fix this."

Alert Criteria

Alert based on

Known-bad
Percentages
Trends
Rapid Changes

Don't alert based on

Thresholds (usually)
Partial impairment

SLAs

Service Level Indicator - something you can measure

SLO
- Service Level Objective - a value to obtain
SLA
- Service Level Agreement - an incentive to meet your SLO

SLAs (cont)

Example SLO: Measure of uptime

99% == 3.7 d/y, 7 h/mo
99.9% == 8.7 h/y, 44 m/mo
99.99% == 56 m/y, 4 m/mo
99.999% == 5 m/y, 30 s/mo

55.5555555% == 162 d/y, 324 h/mo

SLAs (cont)

SLOs can be any sort of measure, such as

X% served at less than Y ms
Error ratio never greater than X%
Serve small files under 500ms, medium files under 1s, large files under 3s

Waffly SLOs

Never down for more than 5 consecutive minutes in a one hour window
Less than 0.1% unscheduled downtime
Message guaranteed to be delivered no more than 72 hours after being sent

SLAs (cont)

What really is an SLA?

Measure of risk-taking potential!

99% == 3.7 d/y, 7 h/mo
99.9% == 8.7 h/y, 44 m/mo
99.99% == 56 m/y, 4 m/mo
99.999% == 5 m/y, 30 s/mo

When SLA violated...

There are many things you can do when SLA is not met

Dedicate a person to improving the system
Block all feature work until back in SLA
Give the pager to the devs

Oncall

Later Oncall Weeks

Alerting

Page
Ticket
Log

Oncall Tools

Alertless monitors
Documentation
Oncall Playbook
Call Trees

Monitoring Evolution

Monitoring is an ongoing process

check_http - but only showed up/down of some pages
WWW::Mechanize - but never in sync w/ product changes
Server self-check - but broke LB timeouts
/fasthealth and /fullhealth - but took time to roll out everywhere
Standardized URL - but pool misconfigs possible
Standardized on /fasthealth?site=www and /fullhealth?site=www

Monitoring (cont)

Catch the unknown

RUM
Trending
Wire Data
Angry Users

Monitoring (cont)

Monitor the right things

define command {
    command_name http_port_grep_header  
    command_line /usr/lib/nagios/plugins/check_http -H '$HOSTADDRESS$'
                -I '$HOSTADDRESS$' -p '$ARG1$' -u '$ARG2$' -N
}
...
define service {
    check_command http_port_grep_header!3142!/acng-report.html
    hostgroup_name apt-cacher-ng
    service_description apt_cacher_ng
    use generic_service_template
}
...
define host {
    host_name vm-103
    address 192.168.253.288
    hostgroups apt-cacher-ng
    ...

Monitoring (cont)

Monitor the right things

Monitor the user-facing name, e.g. the CNAME
Use and rely on DNS
Set up dependencies

Backups

Daily/weekly/monthly
Local? Remote? Cloud?
Yes! The more the better!
Never underestimate the bandwidth of a pack of Emus
If you haven't restored, you're not backing up

Backups (cont)

From my pager conversation this am:

The website is broken!
We turned off local backups - we needed the space!
Remote rsync backups have been broken for months now - haven't had time to figure out why!
The cloud backups stopped when we stopped paying for the service!

Backups (cont)

Pager conversation continued:

We removed the old logs - we needed more space!
We turned off new logs - they slowed down the computer!
We didn't maintain your monitoring - takes too much time to keep current!
We never implemented any other monitoring systems - costs money!

So... can you figure out what's wrong?

Testing

Unit Tests!
Good test data

Fuzzing
Shakespeare
... questionable content

Limit yourself

Restrictions are freeing

Limited number of languages
Rigid style guides
Hypervisors / Clouds
Code Review
PM
Tools and infrastructure

Use the tools as designed

The tool author wrote it that way for a reason
You are not smarter than the tool author
If you are, use a different tool

git is in C so contributors have a high bar

git becoming svn w/ incrementing revision numbers, single commits only

team wrote a lot of work on new branch, could not push

Don't be clever

Longer but more readable code is better than shorter line noise (see: perl)
You are not smarter than your coworkers
You are smarter than your 6 month older self

Don't be clever (cont)

What does this do?

#!/bin/bash

set -e
set -u

cd $(dirname $0)
git pull --rebase
cd path/to/dnsstuff
export SSH_AUTH_SOCK=/path/to/ssh-agent

rsync data dns_server:/var/tinydns/data

And why is the new DNS server not getting updates?

Don't be clever (cont)

server$ rsync data new_dns_server:/var/tinydns/data
rsync: mkstemp "/var/tinydns/.tpmuc.lch0gC" failed: Permission denied (13)

new_dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory

Don't be clever (cont)

server$ rsync data new_dns_server:/var/tinydns/data
rsync: mkstemp "/var/tinydns/.tpmuc.lch0gC" failed: Permission denied (13)

new_dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory


server$ rsync data dns_server:/var/tinydns/data
New file uploaded
Merging local changes
Compiling
HAND!

dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory

Don't be clever (cont)

server$ rsync data dns_server:/var/tinydns/data
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
...
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Offering RSA public key: dns_magic
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Entering interactive session.
debug1: Remote: Forced command.
debug1: Sending environment.
New file uploaded
Merging local changes
Compiling
HAND!
Connection to dns_server closed.
server$

Don't be clever (cont)

server$ rsync data dns_server:/var/tinydns/data
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
...
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Offering RSA public key: dns_magic
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Entering interactive session.
debug1: Remote: Forced command.
debug1: Sending environment.
New file uploaded
Merging local changes
Compiling
HAND!
Connection to dns_server closed.
server$

Don't be clever (cont)

$ cat ~/.ssh/authorized_keys
command="/opt/bin/syncw" ssh-rsa AAAAB3NzaC1yc2EAAAAB....

Don't be clever (cont)

$ cat /opt/bin/syncw
#!/bin/sh
rsync --server . /var/dns/upstream_data
/home/bri/bin/makedns

Don't be clever (cont)

$ cat /home/bri/bin/makedns
#!/bin/sh
for dir in /var/dns/tinydns-[0-9]*/root
do
  cd $dir
  make
done

Don't be clever (cont)

$ cat /var/dns/tinydns-[0-9]*/root/Makefile
SRC_DATA=/var/dns/upstream_data
LOCAL_ZONES=data.local

data.cdb: $(LOCAL_ZONES) $(SRC_DATA)  $(DIRS)
    sort -u $(LOCAL_ZONES) $(SRC_DATA) >> data
    /usr/bin/tinydns-data

Don't be clever (cont)

This could have all been written explicitly

$ rsync data dns_server:/var/tinydns/data
$ ssh dns_server 'cd /var/tinydns && /usr/bin/tinydns-data'

Importance of Analogy

The Datacenter Upgrade

"That sounds be easy!"
Airplanes

Find your own Fails!

Any questions?

Thanks!

LFNW Talk/Evaluation Page: http://lfnw.org/node/3898
Presentation Video on YouTube

Personal	Work
Bri Hatch Onsight, Inc bri@ifokr.org	Bri Hatch ExtraHop Networks bri@extrahop.com

LFNW, 2016-04-23