07 Oct 2011

Unix home field advantage - vim and awk

Baseball and Unix

I've been a baseball fan my entire life. Back in 1995, I saved up my money and bought the Strat-O-Matic player cards for 10 different teams. For years I rolled those dice and battled wits with my stepdad, my friend Stevie down the street, and anyone else I could rope into playing with me. I read Bill James and bought into sabermetrics as a middle schooler. A half-dozen years later I found CSFBL and tested my managerial skills against other players.

What does this have to do with Unix? Tonight I watched the Tigers beat the Yankees in a close game 5. I didn't follow much baseball this year, so I checked out the final standings to see how the season wrapped up. While looking at the standings, I noticed that the home team seemed to win a disproportionately high number of games, and I was curious exactly what the percentage was. Unix to the rescue!

Sometimes copy-paste and vim is the easiest way

I usually use the Python Beautiful Soup library for extracting the desired content from webpages. In this case, unfortunately, the data I wanted was not part of a static webpage, so I instead extracted it the easiest way possible: copy-paste.

I pasted it into a vim buffer as follows (anything after ';' is a comment):

:set paste<RET>  ; set to paste verbatim mode (to avoid auto-formatting)
i<C-V><ESC>      ; Insert, paste (using Ctrl-Shift-V in gnome-terminal)

The result was several columns, and I was only concerned with the Home and Road records. With vim, it's easy enough to isolate these two columns using macro editing.

gg    ; return to the beginning of the file
qq    ; begin recording into the 'q' buffer
$     ; move to the end of the line
5dB   ; delete back to the beginning of the (space-delimited) word 5x
4x    ; delete 4 characters
BB    ; go back two (space-delimited) words
d^    ; delete from cursor to the beginning of the line
jq    ; proceed to the following line and end macro recording
@q    ; repeat 'q' macro once (confirm it did what you expected)
99@@  ; repeat previously executed macro to the end of the file

Now we just need to clean up the file by removing the non-record lines (e.g., heading lines like "WCGB L10") using a simple dd on each one. Once we've done that, we can save the file (:w records). The result is a file that looks like this:

52-29   45-36
47-34   44-37
45-36   45-36
42-39   39-42
39-42   30-51
50-31   45-36
44-37   36-45
36-45   43-38
40-41   31-50
33-48   30-51
52-29   44-37
45-36   41-40
43-38   31-50
39-45   28-50
52-29   50-31
47-34   42-39
44-36   36-45
34-47   43-38
31-47   41-43
57-24   39-42
45-36   45-36
42-39   37-44
36-45   36-45
39-42   32-49
31-50   25-56
51-30   43-38
46-35   40-41
42-39   40-40
38-43   35-46
35-46   36-45

sed and awk finish the job

But how to determine the home win percentage? Time to whip out Python? I elected to use awk instead:

sed 's/-/ /g' records |
  awk '{hw += $1; hl += $2; aw += $3; al += $4}
    END{printf &quot;HOME\n  Wins: %i\n  Losses: %i\n  Win %%: %.3f\nAWAY\n  Wins: %i\n  Losses: %i\n&quot;, hw, hl, hw*1.0/(hw+hl), aw, al}'

And the resulting output:

HOME
  Wins: 1277
  Losses: 1152
  Win %: 0.526
AWAY
  Wins: 1152
  Losses: 1277

We can do a quick sanity check by comparing Home W-L to Away W-L. As expected, their values are complements (for every home win, there is an away loss, and vice versa). And here we get the answer we were looking for: the home winning percentage for all Major League teams in 2011 was .526.

The sed command is pretty basic: it simply replaces the '-' in the W-L records with a space. The 'g' flag indicates that we want to perform all replacements, instead of the default one per line. We use sed in this way because awk splits on whitespace sequences by default, so this step ensures that awk recognizes four fields in each row: Home Wins, Home Losses, Away Wins, and Away Losses.

The awk command is where it gets fun. The first thing to note is that the first {...} portion is executed once for each line in the file. Each $ variable represents a field in the row, so $1 is Home Wins, $2 is Home Losses, and so on. I'm totaling these values by accumulating them in the variables hw, hl, aw, and al. Variables in awk are automatically initialized to 0.

The END command separates line-by-line processing from post-processing. Thus, the second {...} portion is executed after the entire file has been processed. Here we use printf in exactly the same way that it is used in C.

Conclusion

And that's that! With just a few dozen keystrokes in vim, it was possible to transform a copy-paste mess into a couple consistent columns. Using sed for preprocessing, I passed the result to an awk script that calculated and reported the data I was looking for.

This is where Unix really shines. Textual data is only a few minutes away from being useful information. I love using Unix tools for one-off tasks; often they later transform into more general, longer-lived bash scripts. Once a script reaches a certain size, however, it usually makes sense to either break it up or translate it into a more general programming language like Python.