stem-and-leaf plots in a tweet


Summary: I describe stem plots, how to read them, and how to make them in Python, using 140 characters.

My friend @JarrodMillman, whose office is across the hall, is teaching a computational statistics course that involves a fair amount programming. He's been grading these homeworks semi-automatically - with python scripts that pull the students' latest changes from GitHub, run some tests, spit out the grade to a JSON file for the student, checks it in and updates a master JSON file that's only accessible to Jarrod. It's been fun periodically tagging along and watching his suite of little programs develop. He came in the other day and said "Do you know of any stem plot implementation in python? I found a few, and I'm using one that's ok, but it looks too complicated."

For those unfamiliar - a stem plot, or stem-and-leaf plot is a more detailed kind of histogram. On the left you have the stem, which is a prefix to all entries on the right. To the right of the stem, each entry takes up one space just like a bar chart, but still retains information about its actual value.

So a stem plot of the numbers 31, 41, 59, 26, 53, 58 looks like this:

 2|6
 3|1
 4|1
 5|389

That last line is hard to parse for the un-initiated. There are three entries to the right of the 50 stem, and these three entries 3 8 and 9 is how the numbers 53, 58, and 59 are concisely represented in a stem plot

As an instructor, you can quickly get a sense of the distribution of grades, without fearing the binning artifact caused by standard histograms. A stem-plot can reveal subtle patterns in the data that are easy to missed with usual grading histograms that have a binwidth of 10. Take this distribution, for example:

70:XXXXXXX
80:XXXXXXXXXXX
90:XXXXXXX

Below are two stem plots which have the same profile as the above, but tell a different story:

 7|7888999
 8|01123477899
 9|3467888

Above is a class that has a rather typical grade distribution that sort of clumps together. But a histogram of the same shape might come from data like this:

 7|0000223
 8|78888999999
 9|0255589

This is a class with 7 students clearly struggling compared to the rest.

So here's the code for making a stem plot in Python using NumPy. stem() expects an array or list of integers, and prints all stems that span the range of the data provided.

from __future__ import print_function
import numpy as np
def stem(d):
    "A stem-and-leaf plot that fits in a tweet by @ivanov"
    l,t=np.sort(d),10
    O=range(l[0]-l[0]%t,l[-1]+11,t)
    I=np.searchsorted(l,O)
    for e,a,f in zip(I,I[1:],O): print('%3d|'%(f/t),*(l[e:a]-f),sep='')

Yes, it isn't pretty, a fair amount of code golfing went into making this work. It is a good example for the kind of code you should not write, especially since I had a little bit of fun with the variable names using characters that look similar to others, especially in sans-serif typefaces (lI10O). Nevertheless, it's kind of fun to fit much functionality into 140 characters.

Here's my original tweet: @ivanov/status/443980372192137216

You can test it by running it on some generated data:

>>> data = np.random.poisson(355, 113)
>>> data
array([367, 334, 317, 351, 375, 372, 350, 352, 350, 344, 359, 355, 358,
   389, 335, 361, 363, 343, 340, 337, 378, 336, 382, 344, 359, 366,
   368, 327, 364, 365, 347, 328, 331, 358, 370, 346, 325, 332, 387,
   355, 359, 342, 353, 367, 389, 390, 337, 364, 346, 346, 346, 365,
   330, 363, 370, 388, 380, 332, 369, 347, 370, 366, 372, 310, 348,
   355, 408, 349, 326, 334, 355, 329, 363, 337, 330, 355, 367, 333,
   298, 387, 342, 337, 362, 337, 378, 326, 349, 357, 338, 349, 366,
   339, 362, 371, 357, 358, 316, 336, 374, 336, 354, 374, 366, 352,
   374, 339, 336, 354, 338, 348, 366, 370, 333])
>>> stem(data)
 29|8
 30|
 31|067
 32|566789
 33|00122334456666777778899
 34|02234466667788999
 35|001223445555577888999
 36|12233344556666677789
 37|0000122444588
 38|0277899
 39|0
 40|8

If you prefer to have spaces between entries, take out the sep='' from the last line.

>>> stem(data)
 29| 8
 30|
 31| 0 6 7
 32| 5 6 6 7 8 9
 33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
 34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
 35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
 36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
 37| 0 0 0 0 1 2 2 4 4 4 5 8 8
 38| 0 2 7 7 8 9 9
 39| 0
 40| 8

To skip over empty stems, add e!=a and in front of print. This will remove the 300 stem from the output (useful for data with lots of gaps).

>>> stem(data)
 29| 8
 31| 0 6 7
 32| 5 6 6 7 8 9
 33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
 34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
 35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
 36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
 37| 0 0 0 0 1 2 2 4 4 4 5 8 8
 38| 0 2 7 7 8 9 9
 39| 0
 40| 8

Thanks for reading.

@ivanov/status/443981782635921408