Other articles


  1. starting my job search

    I am starting to look for a job in the San Francisco Bay Area.

    Since many recruiters ask for and presumably look at GitHub profiles, I decided to give mine a little facelift:

    Smart and Gets Things Done Github Contribution
Graph:

    In case you aren't familiar, that banner was motivated by Joel Spolsky's Smart and Gets Things Done, which is a book about hiring good developers . So I decided to tweet it out, mentioning @spolsky and he favorited it!

    @ivanov/status/476932602587123712

    Yesterday, I decided to tweet out an image that's at the top of my resume as a standalone tweet- mentioning Joel Spolsky again, and he liked it well enough to retweet it to his 90 thousand followers, so it's been getting plenty of love.

    Paul Ivanov's Visual Resume

    @ivanov/status/477477547957944321

    @ivanov/status/477520571907842048

    Perhaps unsurprisingly, the only person to contact me as a result of this so far is a reporter from Business Insider :

    My editor would like to post it on our site as an example of a creative way to format a resume... I'm wondering if we can get your permission to do this?

    So that's what prompted this post: I simply added my name and a Creative Commons Attribution Licence (CC-BY) to the two images, and then sent my permission along.

    Outside of that, no prospective employers have gotten in touch. But like I always say: you can't win the lottery if you don't buy a ticket. And since I also enjoy mixing metaphors, I'll just keep on fishing!

    permalink
  2. stem-and-leaf plots in a tweet


    Summary: I describe stem plots, how to read them, and how to make them in Python, using 140 characters.

    My friend @JarrodMillman, whose office is across the hall, is teaching a computational statistics course that involves a fair amount programming. He's been grading these homeworks semi-automatically - with python scripts that pull the students' latest changes from GitHub, run some tests, spit out the grade to a JSON file for the student, checks it in and updates a master JSON file that's only accessible to Jarrod. It's been fun periodically tagging along and watching his suite of little programs develop. He came in the other day and said "Do you know of any stem plot implementation in python? I found a few, and I'm using one that's ok, but it looks too complicated."

    For those unfamiliar - a stem plot, or stem-and-leaf plot is a more detailed kind of histogram. On the left you have the stem, which is a prefix to all entries on the right. To the right of the stem, each entry takes up one space just like a bar chart, but still retains information about its actual value.

    So a stem plot of the numbers 31, 41, 59, 26, 53, 58 looks like this:

     2|6
     3|1
     4|1
     5|389
    

    That last line is hard to parse for the un-initiated. There are three entries to the right of the 50 stem, and these three entries 3 8 and 9 is how the numbers 53, 58, and 59 are concisely represented in a stem plot

    As an instructor, you can quickly get a sense of the distribution of grades, without fearing the binning artifact caused by standard histograms. A stem-plot can reveal subtle patterns in the data that are easy to missed with usual grading histograms that have a binwidth of 10. Take this distribution, for example:

    70:XXXXXXX
    80:XXXXXXXXXXX
    90:XXXXXXX
    

    Below are two stem plots which have the same profile as the above, but tell a different story:

     7|7888999
     8|01123477899
     9|3467888
    

    Above is a class that has a rather typical grade distribution that sort of clumps together. But a histogram of the same shape might come from data like this:

     7|0000223
     8|78888999999
     9|0255589
    

    This is a class with 7 students clearly struggling compared to the rest.

    So here's the code for making a stem plot in Python using NumPy. stem() expects an array or list of integers, and prints all stems that span the range of the data provided.

    from __future__ import print_function
    import numpy as np
    def stem(d):
        "A stem-and-leaf plot that fits in a tweet by @ivanov"
        l,t=np.sort(d),10
        O=range(l[0]-l[0]%t,l[-1]+11,t)
        I=np.searchsorted(l,O)
        for e,a,f in zip(I,I[1:],O): print('%3d|'%(f/t),*(l[e:a]-f),sep='')
    

    Yes, it isn't pretty, a fair amount of code golfing went into making this work. It is a good example for the kind of code you should not write, especially since I had a little bit of fun with the variable names using characters that look similar to others, especially in sans-serif typefaces (lI10O). Nevertheless, it's kind of fun to fit much functionality into 140 characters.

    Here's my original tweet: @ivanov/status/443980372192137216

    You can test it by running it on some generated data:

    >>> data = np.random.poisson(355, 113)
    >>> data
    array([367, 334, 317, 351, 375, 372, 350, 352, 350, 344, 359, 355, 358,
       389, 335, 361, 363, 343, 340, 337, 378, 336, 382, 344, 359, 366,
       368, 327, 364, 365, 347, 328, 331, 358, 370, 346, 325, 332, 387,
       355, 359, 342, 353, 367, 389, 390, 337, 364, 346, 346, 346, 365,
       330, 363, 370, 388, 380, 332, 369, 347, 370, 366, 372, 310, 348,
       355, 408, 349, 326, 334, 355, 329, 363, 337, 330, 355, 367, 333,
       298, 387, 342, 337, 362, 337, 378, 326, 349, 357, 338, 349, 366,
       339, 362, 371, 357, 358, 316, 336, 374, 336, 354, 374, 366, 352,
       374, 339, 336, 354, 338, 348, 366, 370, 333])
    >>> stem(data)
     29|8
     30|
     31|067
     32|566789
     33|00122334456666777778899
     34|02234466667788999
     35|001223445555577888999
     36|12233344556666677789
     37|0000122444588
     38|0277899
     39|0
     40|8
    

    If you prefer to have spaces between entries, take out the sep='' from the last line.

    >>> stem(data)
     29| 8
     30|
     31| 0 6 7
     32| 5 6 6 7 8 9
     33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
     34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
     35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
     36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
     37| 0 0 0 0 1 2 2 4 4 4 5 8 8
     38| 0 2 7 7 8 9 9
     39| 0
     40| 8
    

    To skip over empty stems, add e!=a and in front of print. This will remove the 300 stem from the output (useful for data with lots of gaps).

    >>> stem(data)
     29| 8
     31| 0 6 7
     32| 5 6 6 7 8 9
     33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
     34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
     35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
     36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
     37| 0 0 0 0 1 2 2 4 4 4 5 8 8
     38| 0 2 7 7 8 9 9
     39| 0
     40| 8
    

    Thanks for reading.

    @ivanov/status/443981782635921408

    permalink
  3. Money and CA Propositions

    Since tomorrow we'll be having another one of those practice democracy drills here in California, I thought I'd put together a few bar charts.

    There are five propositions on tomorrow's ballot. In researching them, Lena came across the Cal-Access Campaign Finance Activity: Propositions & Ballot Measures.

    Unfortunately, for each proposition, you have to click through each committee to get the details for the amount of money they've raised and spent. Here's a run-down in visual form, the only data manipulation I did was round to the nearest dollar. Note: no committees formed to support or oppose Proposition 13.

    Here's how much money was raised, by proposition:

    Money
Raised

    Just in case you didn't get the full picture, here is the same data plotted on a common scale:

    Money Raised (common
scale)

    And the same two plots for money spent ((I don't fully understand what these numbers mean, as some groups' "Total Expenditures" exceed their "Total Contributions" and still had positive "Ending Cash")):

    Money Spent

    Money Spent (common scale)

    It could just be my perception of things, but I get pretty suspicious when there's a ton of money involved in politics, especially when it's this lopsided.

    The only thing I have to add is you should Vote "YES" on Prop 15, because Larry Lessig says so, and so do the Alameda County Greens!

    Update #1: Let me write it out in text, so that the search engines have an easier time finding this. According to the official record from Cal-Access (Secretary of State), as of May 22nd, 2010, there were $54.4 million spent in support of various propositions, most notably $40.5 million on Prop 16, $8.9 million on Prop 17, and $4.6 million on Prop 14. Compare that with a "grand" total of less than $1.2 million spent to oppose them, with a trivial $78 thousand (!!) to oppose Prop 16's $40.5 million deep pockets.

    Update #2: The California Voter Foundation included more recent totals (they don't seem to be that different), as well as a listing of the top 5 donors for each side of a proposition in their Online Voter Guide.

    Also, here's the python code used to generate these plots (enable javascript to get syntax highlighting):

    # Create contributions and expenditures bar charts of committees supporting and
    # opposing various propositions on the California Ballot for June 8th, 2010
    # created by Paul Ivanov (http://pirsquared.org)
    
    # figure(0) - Contributions by Proposition (as subplots)
    # figure(1) - Expenditures by Proposition (as subplots)
    # figure(2) - Contributions on a common scale
    # figure(3) - Expenditures on a common scale
    
    import numpy as np
    from matplotlib import pyplot as plt
    import locale
    
    # This part was done by hand by collecting data from CalAccess:
    # http://cal-access.sos.ca.gov/Campaign/Measures/
    prop = np.array([
         4650694.66, 4623830.07    # Yes on 14 Contributions, Expenditures
        , 216050, 52796.71         # No  on 14 Contributions, Expenditures
        , 118807.45, 264136.30     # Yes on 15 Contributions, Expenditures
        , 200750.01, 86822.79      # No  on 15 Contributions, Expenditures
        , 40706258.17, 40582036.58 # Yes on 16 Contributions, Expenditures
        , 83187.29, 78063.91       # No  on 16 Contributions, Expenditures
        , 10328675.12, 8932786.06  # Yes on 17 Contributions, Expenditures
        , 1229783.79, 965218.48    # No  on 17 Contributions, Expenditures
        ])
    prop.shape = -1,2,2
    
    def currency(x, pos):
        """The two args are the value and tick position"""
        if x==0:
            return "$0"
        if x < 1e3:
            return '$%f' % (x)
        elif x< 1e6:
            return '$%1.0fK' % (x*1e-3)
        return '$%1.0fM' % (x*1e-6)
    
    from matplotlib.ticker import FuncFormatter
    formatter = FuncFormatter(currency)
    
    yes,no = range(2)
    c = [(1.,.5,0),'blue']  # color for yes/no stance
    a = [.6,.5]             # alpha for yes/no stance
    t = ['Yes','No ']       # text  for yes/no stance
    
    raised,spent = range(2)
    title = ["Raised for", "Spent on" ] # reuse code by injecting title specifics
    field = ['Contributions', 'Expenditures']
    
    footer ="""
    Data from CalAccess: http://cal-access.sos.ca.gov/Campaign/Measures/
    'Total %s 1/1/2010-05/22/2010' field extracted for every committee
    and summed by position ('Support' or 'Oppose').  No committees formed to
    support or oppose Proposition 13. cc-by Paul Ivanov (http://pirsquared.org).
    """ # will inject field[col] in all plots
    
    color = np.array((.9,.9,.34))*.9 # spine/ticklabel color
    plt.rcParams['savefig.dpi'] = 100
    
    def fixup_subplot(ax,color):
        """ Tufte-fy the axis labels - use different color than data"""
        spines = ax.spines.values()
        # liberate the data! hide right and top spines
        [s.set_visible(False) for s in spines[:2]]
        ax.yaxis.tick_left() # don't tick on the right
    
        # there's gotta be a better way to set all of these colors, but I don't
        # know that way, I only know the hard way
        [s.set_color(color) for s in spines]
        [s.set_color(color) for s in ax.yaxis.get_ticklines()]
        [s.set_visible(False) for s in ax.xaxis.get_ticklines()]
        [(s.set_color(color),s.set_size(8)) for s in ax.xaxis.get_ticklabels()]
        [(s.set_color(color),s.set_size(8)) for s in ax.yaxis.get_ticklabels()]
        ax.yaxis.grid(which='major',linestyle='-',color=color,alpha=.3)
    
    # for subplot spacing, I fiddle around using the f.subplot_tool(), then get
    # this dict by doing something like:
    #    f = plt.gcf()
    #    adjust_dict= f.subplotpars.__dict__.copy()
    #    del(adjust_dict['validate'])
    #    f.subplots_adjust(**adjust_dict)
    
    adjust_dict = {'bottom': 0.12129189716889031, 'hspace': 0.646815834767644,
     'left': 0.13732508948909858, 'right': 0.92971038073543777,
     'top': 0.91082616179001742, 'wspace': 0.084337349397590383}
    
    for col in [raised, spent]: #column to plot - money spent or money raised
        # subplots for each proposition (Fig 0 and Fig 1)
        f = plt.figure(col); f.clf(); f.dpi=100;
        for i in range(len(prop)):
            ax = plt.subplot(len(prop),1, i+1)
            ax.clear()
            p = i+14    #prop number
            for stance in [yes,no]:
                plt.bar(stance, prop[i,stance,col], color=c[stance], linewidth=0,
                        align='center', width=.1, alpha=a[stance])
                lbl = locale.currency(round(prop[i,stance,col]), symbol=True, grouping=True)
                lbl = lbl[:-3] # drop the cents, since we've rounded
                ax.text(stance, prop[i,stance,col], lbl , ha='center', size=8)
    
            ax.set_xlim(-.3,1.3)
            ax.xaxis.set_ticks([0,1])
            ax.xaxis.set_ticklabels(["Yes on %d"%p, "No on %d"%p])
    
            # put a big (but faded) "Proposition X" in the center of this subplot
            common=dict(alpha=.1, color='k', ha='center', va='center', transform = ax.transAxes)
            ax.text(0.5, .9,"Proposition", size=8, weight=600, **common)
            ax.text(0.5, .50,"%d"%p, size=50, weight=300, **common)
    
            ax.yaxis.set_major_formatter(formatter) # plugin our currency labeler
            ax.yaxis.get_major_locator()._nbins=5 # put fewer tickmarks/labels
    
            fixup_subplot(ax,color)
    
        adjust_dict.update(left=0.13732508948909858,right=0.92971038073543777)
        f.subplots_adjust( **adjust_dict)
    
        # Figure title, subtitle
        extra_args = dict(family='serif', ha='center', va='top', transform=f.transFigure)
        f.text(.5,.99,"Money %s CA Propositions"%title[col], size=12, **extra_args)
        f.text(.5,.96,"June 8th, 2010 Primary", size=9, **extra_args)
    
        #footer
        extra_args.update(va='bottom', size=6,ma='left')
        f.text(.5,0.0,footer%field[col], **extra_args)
    
        f.set_figheight(6.); f.set_figwidth(3.6); f.canvas.draw()
        f.savefig('CA-Props-June8th2010-%s-Subplots.png'%field[col])
    
        # all props on one figure (Fig 2 and Fig 3)
        f = plt.figure(col+2); f.clf()
        adjust_dict.update(left= 0.06,right=.96)
        f.subplots_adjust( **adjust_dict)
        f.set_figheight(6.)
        f.set_figwidth(7.6)
    
        extra_args = dict(family='serif', ha='center', va='top', transform=f.transFigure)
        f.text(.5,.99,"Money %s CA Propositions"%title[col], size=12, **extra_args)
        f.text(.5,.96,"June 8th, 2010 Primary", size=9, **extra_args)
    
        extra_args.update(ha='left', va='bottom', size=6,ma='left')
        f.text(adjust_dict['left'],0.0,footer%field[col], **extra_args)
    
        ax = plt.subplot(111)
        for stance in [yes,no]:
            abscissa=np.arange(0+stance*.30,4,1)
            lbl = locale.currency(round(prop[:,stance,col].sum()),True,True)
            lbl = lbl[:-3] # drop the cents, since we've rounded
            lbl = t[stance]+" Total"+ lbl.rjust(12)
            plt.bar(abscissa,prop[:,stance,col], width=.1, color=c[stance],
                    alpha=a[stance],align='center',linewidth=0, label=lbl)
            for i in range(len(prop)):
                lbl = locale.currency(round(prop[i,stance,col]), symbol=True, grouping=True)
                lbl = lbl[:-3] # drop the cents, since we've rounded
                ax.text(abscissa[i], prop[i,stance,col], lbl , ha='center',
                        size=8,rotation=00)
    
        ax.set_xlim(xmin=-.3)
        ax.xaxis.set_ticks(np.arange(.15,4,1))
        ax.xaxis.set_ticklabels(["Proposition %d"%(i+14) for i in range(4)])
        fixup_subplot(ax,color)
    
        # plt.legend(prop=dict(family='monospace',size=9)) # this makes legend tied
        # to the subplot, tie it to the figure, instead
        handles, labels = ax.get_legend_handles_labels()
        l = plt.figlegend(handles, labels,loc='lower right',prop=dict(family='monospace',size=9))
        l.get_frame().set_visible(False)
        ax.yaxis.set_major_formatter(formatter) # plugin our currency labeler
        f.canvas.draw()
        f.savefig('CA-Props-June8th2010-%s.png'%field[col])
    
    plt.show()
    
    permalink
  4. Immigration in the US, contextualized (with pictures)

    So I probably don't need to tell you this since you already know, but

    Arizona sucks!

    It turns out that even documented immigrants agree, and I have the graphs to prove it!

    You see, it all started when I took a great Visualization course this past term which was taught by Maneesh Agrawala. Maneesh gave enough structure for the assignments, but also left some aspect of each open ended. For example, our first assignment had a fixed dataset which everyone had to make a static visualization of, but the means by which we did that was entirely up to us. A lot of people used Excel (in graduate level CS class? gross!), some people wrote little programs (I wrote mine in python using matplotlib and numpy, and did some cool stuff that I will have to post about another time and contribute back to matplotlib), there was even a poor sap who did it all in Photoshop, as I recall, but anything was fair game. Turns out we could even just draw or make something by hand and turn it in!

    The second assignment, the source of my graphs which quantitatively demonstrate the suckiness of Arizona, required us to use interactive visualization software to iteratively develop a visualization by first asking a question, then making a visualization to address this question, and going back several times refine the question and make successive visualizations.

    On thing to keep in mind is that, overall, naturalized citizens are both an exclusive and a discerning lot. In most cases, you have to be a permanent resident (have a Green card) for 5 years before you can apply. And there are quotas for how many people can get a Green card every year, so there are lots of hoops to jump through. Given the amount of effort involved, wouldn't it be nice to look at a breakdown of naturalized citizens by state? Because that would give us an idea about which states immigrants percieve as, for lack of a better word, "awesome", or if you're not so generous, "least sucky". I bet you'll feel that this second description is more appropriate once you take a look at the data, but keep my "least sucky" premise in mind as you read my original write-up which focused on a different angle (but from which we can still draw some reasonable conclusions). I'll return to make a few more comments about the title of this post after the copy-pasted portion.

    here's my original write-up:

    begin cut --->

    There are three kinds of lies: lies, damned lies, and statistics.

    As an immigrant, I've always had the subjective feeling that about half of the people I'm acquainted with are either themselves immigrants, or the children of immigrants. The US prides itself in being a melting pot, a country built by immigrants, so I wanted to dive into the data that would help me understand just how large of a role immigration plays in terms of the entire country. The question I started with, for the purpose of this assignment is this:

    What's the relationship between naturalizations and births in the US?

    But what I really wanted was to find out was what kind of question do I need to ask to get the answer that would be consistent with my world view. :)

    To do this, I started with the DHS 2008 Yearbook of Immigration Statistics, which was linked from the class website.

    The file I started with was natzsuptable1d.xls, which required cleanup before I could read it into Tableau. Turns out that even though "importing" to tableau format is supposed to speed things up, it seems very fragile and would regularly fail when I tried converting type to Number (there were some non-numeric codes, like 'D' for 'Data withheld to limit disclosure). *NOT* importing to Tableua's desired format also had the added benefit of allowing me to change the .xls files externally, and having all the adjustments made in Tableau, without having to re-import the data source.

    Frustratingly, the last column and last row kept not getting loaded in Tableau! I also ran into an issue which I think had to do with the 'Unknown' country of origin and 'Unknown' state of naturalization which made the totals funky. It took a while to figure out, but there was a problem with Korea, because there was a superscript 1 by it, indicating that data from North and South Korea were combined.

    I was trying to use the freshest data possible, so I used the CDC's National Vital Statistics System report titled Births: Preliminary Data for 2007. I just had to copy paste the desired data, and massage it to fit the proper order columns in the excel table I already had handy. I put zeros for U.S. Armed Services Posts and similar territories which is probably not accurate, but this data was not available in the reports that I found. Interesting factoid: according to NVSS (CDC), in 2007 there were more people born in NYC than the rest of the state combined. (about 129K vs 126.5K). The only caveat with this data is that it contains only 98.7% of the data. The states with some missing portion of their data tabulated are Michigan (at 80.2% completeness), Georgia (86.4%), Louisiana (91.4%), Texas (99.4%), Alaska (99.7%), Nevada (99.7%), Delaware (99.9%). Thus, state-level analysis for MI, GA, and LA may be distorted.

    The data I had from DHS is for Fiscal Year 2008, which, as it turns out, goes from October 1st, 2007 - Sept 30th, 2008. Thus, no matter which combination of NVSS and DHS datasets I used, there would necessarily be a mismatch in the date range covered by each, so I settled with describing my visualization as "using the latest available data", noting the actual dates for each dataset in the captions. Also, the NVSS report contained a graph of births over time, which fluctuates very modestly from year-to-year, thus the visualization would not change qualitatively if I had 2008 birth data on hand.

    I was having a really hard time trying to get a look at the data I wanted to see in one sheet, and ended up trying to make a dashboard that combined several sheets. I couldn't figure out a good way to link the different states across datasets. I struggled for quite a while to pull out the data that I wanted to look at, and ended up having to copy past everything from DHS and NVSS (transposed) onto a new sheet in Gnumeric.

    Here's the result:

    [caption id="" align="alignnone" width="744" caption="Initial visualization"][/caption]

    So, in all of the US, about 1 in 5 new american citizens is an immigrant, or for every four births, we have one naturalization. That was kind of unsatisfying. I've lived in California the entire time I've been in the US, and I feel that at least California is more diverse than that. There's all those states in the middle of the country that few people from the rest of the world would want to immigrate to, yet the people living in them are still having babies, throwing off the numbers which would otherwise support my subjective world view...

    So I decided to look at the breakdown by state.

    Broken down by state, what's the relationship between naturalizations and births in the US?

    [caption id="" align="alignnone" width="1226" caption="my second iteration"][/caption]

    I added the reference lines so that you could both read off the approximate total easier, and be able to do proportion calculations visually, instead of mentally. This started looking promising, as I've only lived in California, and it looks like it's got quite a lot of immigrants as a portion of total new citizens.

    It was still kind of hard to see the totals, so I decide to create my very first calculated field - which would had the very simple formula [Births in 2007]+[Total Naturalized]. Using this new field, I could now make a map, to see the growth broken down geographically. This was just a way of reaffirming my earlier bias against the middle states having babies without attracting a sufficient number of immigrants to conform to my world view.

    [caption id="" align="alignnone" width="1072" caption="gratuitous map (was too easy to do using the software)"][/caption]

    In the breakdown by state bar graph, it was also difficult to visually compare the total births by state, because they all started at a different place, depending on the number of naturalizations for that state. So I decided to split the single bar and make small multiples for each state.

    [caption id="" align="alignnone" width="1278" caption="back to something more interpretable"][/caption]

    It's interesting that the contribution of naturalizations slightly changes the ordering of the growth of states. For example, Florida has fewer births than New York, yet it's total growth is larger, because it naturalized 30,000 more people than New York. With this small multiples arrangement, it was now possible to do positional comparisons across categories, not just between naturalizations and totals. Turns out that more people get naturalized in California than are born in the entire state of New York. And since New York has the third highest number of births annually, more people got naturalized in California than are born in any state other than CA and TX.

    This was too large of a graph, and the story I'm interested in is really the ratio between the birth and naturalizations (the closer to 1:1, the better), so I made another calculated field, which is exactly such a ratio, multiplied by a factor of a thousand, so I could give it a sensible description (Naturalizations per 1000 births). This refines my question

    For every 1000 people born in the US, how many many immigrants become naturalized?

    I then ordered on these ratios, and decided to filter the top states. Guam would have made the cut, but it is not a state, and (though I didn't mention it earlier) it's NVSS birth data was only 77% complete, so I excluded it. Fifteen is a nice odd number, but it actually marked a nice transition, as after Texas, everything else is less than 200 naturalizations per 1,000 births.

    The small multiples bar graphs still looked too busy, and there was redundancy in the data, which didn't tell a succinct story. So I switched to just look at the ratios alone. This revealed, that, indeed, the fact that I've been living in California makes my perspective quite unique, as it is one of three states, along with Florida and New Jersey, to have an outstandingly large number of naturalizations compared to births. It is so high, indeed, that it puts the naturalization per births rate in these three states at more than twice the national average!

    Looking at ratio alone tells us about the diversity in each states growth, but carries more meaning in the context of total growth . Thus, added the combined totals (naturalizations and births) as a size variable, for context. The alternating bands to both make it easier to read off the rows, and to aid the comparison of sizes by framing every data point in a common reference window. It obviates that California is the state with 864,261 new citizens because fills the frame completely.

    Final question: What are the Top 15 "Melting Pot" States?

    [caption id="" align="alignnone" width="1095" caption="almost done, would be nice to include context from the visualization I started with"][/caption]

    Ordering the data in this way also shed light on the small but still very diverse states that would not have otherwise made the cut (and did not pop out in any manner on my previous bar graphs). Rhode Island and Hawaii got it going on, in terms of attracting immigrants.

    Certainly the fact that I'm an immigrant myself also greatly influences whom I associate with, further skewing my world view towards a 1:1 ratio, but I'm actually quite impressed with just how close to that ratio is in California - 1:1.9. Of course, the data I've analyzed does not include the American-born 1st generation of children, nor does it take into account the number of immigrants living in the US that do not have citizenship. All of these factors would surely push the ratio even closer toward 1:1.

    I decided to combine the US total growth information, since it's gives further perspective on the entire data set, such as the fact that California accounts for about 16% of total US growth. It also sheds light on how the US average was calculated. A new "twice the nat'l avg" line makes explicit the three most diverse outlier states mentioned before. I also changed the colors to match the convention used in the bar charts made earlier. The US combined total line semantically links the data plotted with the national growth bar chart - i.e. the green dots are formed by the sum of born and naturalized citizens.

    [caption id="" align="aligncenter" width="1259" caption="What are the Top 15 "Melting Pot" States?"][/caption]

    <---- end of cut

    Ok, so, to be honest, it turns out that I wrote a large chunk of this post (Arizona suckage included) before I actually looked back at my visualizations, only going off my memory that it wasn't in the top 10. So Arizona is just below the national average in this "Melting Pot" ratio (a measure I made up, the number of naturalization per 1000 births). Since it is #12, some might say, "Paul, Arizona's on your top 15 list", to which I'll reply: "So's Texas."

    I guess I just wanted to share these purdy graphs I made a few months back, and it seemed like there was a somewhat topical angle on them a few weeks back, when I remembered that I hadn't posted them on here yet. Anyway, I'd love to hear back your thoughts.

    permalink
  5. visualizing world statistics (Gapminder - Hans Rosling)

    Graph: **CO2 emissions per capita versus Time ** CO2 vs Time - Gapminder Above: a plot I made using Gapminder. When I first tried this tool a few months ago, I was left confused and unimpressed. Luckily, since then, I've stumbled upon the following two explanatory videos (~20 min each).

    last year and this year.

    After watching the videos, you can play with Gapminder yourself as it is a web-based tool.

    More info and tool links at gapminder.org.

    permalink
  6. SLC punk'd!

    2007 02 22 life

    visualization

    I'm going to a conference for a week in Utah. It'll be my first time in Park City, and my second time in Salt Lake City. Here's a map of downtown SLC, color-coded to emphasize the insanity:

    SLC Punk'd (by Paul Ivanov)

    **worst. planning. ever. ** (...and I don't buy their propagnda - next to the green arrow on the map above, you could be on N W Temple, between W N Temple and W S Temple [or is it E S Temple?])

    permalink