Summary: I describe stem plots, how to read them, and how to make them in Python, using 140 characters.
My friend @JarrodMillman, whose office is across the hall, is teaching a computational statistics course that involves a fair amount programming. He's been grading these homeworks semi-automatically - with python scripts that pull the students' latest changes from GitHub, run some tests, spit out the grade to a JSON file for the student, checks it in and updates a master JSON file that's only accessible to Jarrod. It's been fun periodically tagging along and watching his suite of little programs develop. He came in the other day and said "Do you know of any stem plot implementation in python? I found a few, and I'm using one that's ok, but it looks too complicated."
For those unfamiliar - a stem plot, or stem-and-leaf plot is a more detailed kind of histogram. On the left you have the stem, which is a prefix to all entries on the right. To the right of the stem, each entry takes up one space just like a bar chart, but still retains information about its actual value.
So a stem plot of the numbers 31, 41, 59, 26, 53, 58 looks like this:
2|6
3|1
4|1
5|389
That last line is hard to parse for the un-initiated. There are three entries to
the right of the 50 stem, and these three entries 3
8
and 9
is how the
numbers 53
, 58
, and 59
are concisely represented in a stem plot
As an instructor, you can quickly get a sense of the distribution of grades, without fearing the binning artifact caused by standard histograms. A stem-plot can reveal subtle patterns in the data that are easy to missed with usual grading histograms that have a binwidth of 10. Take this distribution, for example:
70:XXXXXXX
80:XXXXXXXXXXX
90:XXXXXXX
Below are two stem plots which have the same profile as the above, but tell a different story:
7|7888999
8|01123477899
9|3467888
Above is a class that has a rather typical grade distribution that sort of clumps together. But a histogram of the same shape might come from data like this:
7|0000223
8|78888999999
9|0255589
This is a class with 7 students clearly struggling compared to the rest.
So here's the code for making a stem plot in Python using NumPy. stem()
expects an array or list of integers, and prints all stems that span the range
of the data provided.
from __future__ import print_function
import numpy as np
def stem(d):
"A stem-and-leaf plot that fits in a tweet by @ivanov"
l,t=np.sort(d),10
O=range(l[0]-l[0]%t,l[-1]+11,t)
I=np.searchsorted(l,O)
for e,a,f in zip(I,I[1:],O): print('%3d|'%(f/t),*(l[e:a]-f),sep='')
Yes, it isn't pretty, a fair amount of code
golfing went into making this work.
It is a good example for the kind of code you should not write, especially
since I had a little bit of fun with the variable names using characters that
look similar to others, especially in sans-serif typefaces (lI10O
).
Nevertheless, it's kind of fun to fit much functionality into 140 characters.
Here's my original tweet: @ivanov/status/443980372192137216
You can test it by running it on some generated data:
>>> data = np.random.poisson(355, 113)
>>> data
array([367, 334, 317, 351, 375, 372, 350, 352, 350, 344, 359, 355, 358,
389, 335, 361, 363, 343, 340, 337, 378, 336, 382, 344, 359, 366,
368, 327, 364, 365, 347, 328, 331, 358, 370, 346, 325, 332, 387,
355, 359, 342, 353, 367, 389, 390, 337, 364, 346, 346, 346, 365,
330, 363, 370, 388, 380, 332, 369, 347, 370, 366, 372, 310, 348,
355, 408, 349, 326, 334, 355, 329, 363, 337, 330, 355, 367, 333,
298, 387, 342, 337, 362, 337, 378, 326, 349, 357, 338, 349, 366,
339, 362, 371, 357, 358, 316, 336, 374, 336, 354, 374, 366, 352,
374, 339, 336, 354, 338, 348, 366, 370, 333])
>>> stem(data)
29|8
30|
31|067
32|566789
33|00122334456666777778899
34|02234466667788999
35|001223445555577888999
36|12233344556666677789
37|0000122444588
38|0277899
39|0
40|8
If you prefer to have spaces between entries, take out the sep=''
from the
last line.
>>> stem(data)
29| 8
30|
31| 0 6 7
32| 5 6 6 7 8 9
33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
37| 0 0 0 0 1 2 2 4 4 4 5 8 8
38| 0 2 7 7 8 9 9
39| 0
40| 8
To skip over empty stems, add e!=a and
in front of print
. This will remove
the 300 stem from the output (useful for data with lots of gaps).
>>> stem(data)
29| 8
31| 0 6 7
32| 5 6 6 7 8 9
33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
37| 0 0 0 0 1 2 2 4 4 4 5 8 8
38| 0 2 7 7 8 9 9
39| 0
40| 8
Thanks for reading.
@ivanov/status/443981782635921408