Visualizing Census Data

Map based visualization and inforgraphics interest me. I especially like the work of New york Times' visualization team. For example, look at the US census explorer, which maps income, housing and education data based on surveys from 2005-2009 for every city and every block for all the cities in the US. I really wish there was some equivalent work done by the Indian Media.

I wanted to see if I can construct a map visualizing some Census data that is of interest to us. I don't have the skills to make an interactive map and hence, my goal was to build a static image depicting specific census data. It took me a week to put togeather the following image, which shows the change in sex ratio (measured as number of females per thousand males) for the districts of Tamil Nadu based on the data from 1951, 1981 and 2011 census.

Change in Sex Ratio from 1951 to 2011

Two caveats: One, the districts of Tiruppur and Krishnagiri are missing, because I didn't have the required geographical information. Two, the intervals are defined such that approximately same number of data entries fall within a given interval. In other words, if you count the total number of regions coloured with the same shade, it should be approximately same.

Still, there are a few interesting things to note in this visualization. As a whole, we can see that the sex ratio in 1951 was better than in 1981. Though 2011 is better, it is not as good as 1951. Note how the sex ratio increases for the Nilgiris district and Ramanathapuram district goes in quite the opposite direction. I gladly point out ('cause it is my home) that the southern districts of Tirunelveli and Thoothukudi have fared better through out the years and the sex ratio remains above 1015 during all the time! A more colourful image of above is shown below:

Change in Sex Ratio from 1951 to 2011

The following is an account of how I made this map. I started out with a search for the geographical data (i.e., latitude and longitude) for the boundaries of India's administrative regions. I find the lack of easily accessible geographical data for India furstrating. For the United States, you get the information for every county and every block from the US census website. India has a GIS website and there is some mapping information available at the National Atlas but they are in pdf format, which are not amenable to build our own visualization using open source tools.

I, then, tried to retreive the data from Open Street Map(OSM). As awesome as Open Street Map is, retrieving the district boundaries is not straight forward (as far as I could understand) - especially for the costal districts. But if you are doing visualization for any urban data, OSM is the place to go! Finally, I found, where one can download the geographical information for administrative regions for many countries. The latitute and longitude coordinates are stored in a standard shapefile format. There are four different shape files available for India : IND_adm0.shp, IND_adm1.shp, IND_adm2.shp, and IND_adm3.shp. IND_adm1.shp and IND_adm2.shp has the information of boundaries of the states and the districts in India respectively.

Now, I have to extract the longitude and latitude information only for the districts of Tamil Nadu from the shape files. There are many libraries available written in various programming languages to process shape files. I choose to use 'R', the programming language for statistics, because I liked the excellent spatial data graphics at the spatial analysis blog - most of them done using R. After a bit of googling about the shape file format and R's maptools library, I extracted the district information from IND_adm2.shp as follows:

    # read the polygon shapes
    district_adm_shp = readShapePoly("IND_adm2.shp")
    # get the meta data of the polygons
    district_df = district_adm_shp@data
    # get the meta data of polygons of TN's districts
    tn_dist_df = data.frame(district_df[grep("Tamil",district_df$NAME_1),])
    # get polygon data of TN's districts
    polygon_list = list()
    for (istr in rownames(tn_dist_df)){
      i = as.numeric(istr) + 1
      tmp = district_adm_shp@polygons[i]
      polygon_list = c(polygon_list,tmp)
    # construct a new shape file with only TN's districts
    dist_spatial = SpatialPolygons(polygon_list,1:30)
    dist_spatial_frame = SpatialPolygonsDataFrame(dist_spatial,data=tn_dist_df)
    dist_df = readShapePoly("tn_dist_state.shp")
    plot(dist_df,add=TRUE) # should print a map of districts in TN

I, then, added the sex ratio census data onto the meta data of the shape file as follows. I downloaded the sex ratio data for the districts of Tamil Nadu from the India Census website. From the Excel sheet, I made a csv file (available here), because it is easier to import csv data into R. The following commands add the required meta data.

    # read the polygon shapes of TN's districts
    tn_dist = readShapePoly("tn_dist_state.shp")
    sxratio = read.csv("sex_ratio.csv")
    # add sex ratio information of TN's districts
    tn_dist@data$SXR_2011 = sxratio[1:30,"X2011"]
    tn_dist@data$SXR_1981 = sxratio[1:30,"X1981"]
    tn_dist@data$SXR_1951 = sxratio[1:30,"X1951"]
    # write to an output file

All the commands I used for data processing are given in the source file here. The final shape file for Tamil Nadu districts along with the sex ratio info. collected at every Census from 1901 to 2011 is here. Note that the shape file is derived from GADM data, which can be used only for non-commercial purposes.

After extracting the necessary data, I used ggolot2 library in R to plot the map and fill colours according to the sex ratio data. Since ggplot2 is a powerful plotting system with many features, I haven't explained it step-by-step. You can see commands for plotting in the source code here.

I hope to do more map based visualizations in the future.