Wednesday, April 28, 2010

Assignment 3

One of the leading indicators of economic decline is new housing starts. Figure 1 below is a timeseries plot of US new housing starts for single family dwellings from 1960 to 2010. What we can see is a clear correspondence to historical recessions: Oil shocks of early 70's, 1980, 1990 and then the most recent. Interestingly for the most recent downturn it quite clearly began trending downward by 2005 and doing so rather steeply. Yet, the the banking and realistate 'crash' didn't occur until the fall of 2008. Moreover, in this simple time-series graph we can see the current economic situation in historical perspective--from this view point the downturn is quite significant and severe.

Figures 2,3 and 4 are plots of the recession from an international perspective. Figure 1 is a snap shot of the major economic nations in Europe, Asia, Latin America and the United States. This plots the percentage change in a nation's GDP from it's last quarter value.For instance, in quarter 1 of 2009 Mexico's GDP saw a near 5 percentage point drop. In this we can see that not every region has been affected equally. Germany and Mexico suffered severely while India has maintained positive growth throughout. Figure 3, adds 5 more nations to the plot. We can see the general trend repeated with some clear 'outliers'. Mexico and Russia suffered stark declines, while India again shows continued growth. Figure 4 is a boxplot of all 9 nations. This allows us to see the general mean of the population as it declines beginning in 2008. For much of 2007, all of the nations were exibiting positive growth. By quarter 3 of 2008 nearly all of the nations had dipped into negative growth. Yet, we can also see that by quarter 2 of 2009 positive GDP growth had become the overall trend.



Figure 1



Figure2

Figure3

Figure4

Friday, April 16, 2010





This is from the 1998 article FIGURE 5
ME Mann, RS Bradley, MK Hughes. "Global-scale temperature patterns and climate forcing over the past six centuries".  Nature, 1998

RCODE: 
 ###Hockey stick replication
###data source http://www.nature.com/nature/journal/v430/n6995/extref/nature02478-s1.htm    "nhmean.txt"


##get rid of 0s
hockey$raw[hockey$raw==0] <- NA 
hockey$recon[hockey$recon==0] <- NA
### Get 1961- mean to center chart

x1<-subset(hockey$raw, hockey$date >1961)
meanx1<-mean(x1)

#### extend plot margins for labels
par(mar=c(5,5.2,4,2))

#plot data and create new ylab numbers
plot(hockey$date, hockey$raw, type="l",col = "red", ylim = c(-1.1, 0.8), las=1, ylab="Departure in temperature (C)\nfrom he 1961 to 1990 average", xlab="Year\n Chris Miner Geo 299B ", yaxt= "n")
axis(2, at=meanx1, las=1, labels="0.0")
axis(2, at=meanx1-0.5, las=1, labels="-0.5")
axis(2, at=meanx1+0.5, las=1, labels="0.5")
axis(2, at=meanx1-1, las=1, labels="-1.0")

##create "standard error" polygon
polygon(x= c(hockey$date, rev(hockey$date)), y=c(hockey$lower, rev(hockey$upper)), col="grey", border=FALSE)


###### add text
rect(1550, -0.9, 1997, -1.1, bor = TRUE, col = "white")
   text(1550, -1.02, "Data from thermometers (red) and from tree rings,\ncorals, ice cores and historical records (blue).", pos = 4, adj = 0)
   text(1700,.7, "NORTHERN HEMISPHERE")

#### add trend line

lines(hockey$date, hockey$recon, col="dodgerblue3") ### go dodgers
abline(h=meanx1) #### mean value line
 

###function to deal with lowess' problem with NAs blahhhh

lowess.na <- function(x, y, f = 2/3,...) { 
 
  x1 <- subset(x,(!is.na(x)) &(!is.na(y)))
  y1 <- subset(y, (!is.na(x)) &(!is.na(y)))
  lowess.na <- lowess(x1,y1,f, ...)
  }
 lines(lowess.na(hockey$date, hockey$recon, f=0.04),lwd=2)


In many ways the lines of the climate debate can be drawn within the frame of one graph. In 1998, Mann, Bradley and Hughes published the now famous 'hockey stick' recreation of 600 years of temperature patterns across the Northern Hemisphere. The graph itself is innocently buried within a dense scholarly discourse full of Eigen values and principle component analysis. As a figure it is one of many in a short article. As a representation of the data, among the other figures presented, it could be said that it is the least visually appealing and contains the least amount of information. However, what it does do is summarize the thrust of the entire article and visually provide near conclusive evidence that there is an historically unique change in climate and that it is caused by human activity.

    The questions the article sets out to answer are fundamental to the climate debate: Is the earth getting warmer; if it is, is that change within the normal variability of long term trends; and finally is human activity involved in that change. The 'hockey stick' graph, whether it was intended to or not by the authors, answers all of these questions and it does so forcefully and emphatically. It depicts a clear monotonic growth in temperature over the 20th century. This growth has gone beyond the visible trends seen in prior centuries, and most importantly the beginning of the current trend seems to exactly correlate with the growth of industry in the northern hemisphere. Yet, and this is the reason we may question the authors intentions; the graph is stunningly clear and conclusive, while the text of the article speaks of the uncertainty and provisional nature of the findings.
    The graph itself, has given rise to its own controversy: it is either a global fraud, one part of a scientific discourse, or the philosopher’s stone of climate change. Critics point out the highly aggregate nature of the data; layers of uncertainty built one upon another. They argue it smoothes and attenuates global trends which exaggerates the data from modern thermometer readings--its data is filled with measurement error; it is a highly non-random sample of both the proxies and of the raw temperature data which introduces more severe auto-correlation than the authors admit and problems of endogeneity in the temperature readings. The defenders claim that methodology and data was open to inspection, that levels of uncertainty were well explicated in that study and following ones, that further studies have built on this evidence, and finally that whatever reasonable level of uncertainty you put on the data something worrying is going on.

    However, leaving the climate debate aside and focusing on the 'hockey stick' itself, the most telling critique of the authors of the original study might not be that their study is flawed but that they underestimated or ignored the power that the visual representation of data can have. In this debate, the graphics overpowered the words. It gave a strong impression of certainty not echoed in the text and thus as was warned in IPCC recommendations, "More consistent estimates of the endpoints of a range for any variable would minimize misunderstandings and reduce the likelihood that interest group could misunderstand or misrepresent the findings". These misunderstandings run across interest groups both for and against the articles findings. For uncertainty does not favor either side of the debate in this case. Just as much as it might be overstated, the problem could be much worse, as scholars have pointed out. This fact has been lost in the debate sparked by the visual representation of Mann et al’s findings.

     Though I only have access to the already aggregated mean centered data, below is a brief discussion of some shortcomings in the ‘hockey stick’ graph and an attempt at improvement.  First, are the wholly arbitrary elements of the graph: the 0 point line drawn through the authors’ chosen point; the coloring, the scale of the axes, the combination of a time-series line plot, a mean centered trend line, and the backdrop of the ‘confidence intervals’.  The scale of the axes, gives the impression of a much greater magnitude of change. The 0 point of the y axis seems chosen for visual effect rather than representing an aspect of the data (why 1961? Why not 1902, the whole timeline).  The uncertainty in the graph is grayed out and simply a background feature.

Yet in my view, one of the most visually misleading aspects of the graph is the trend line drawn through the mean of the of the data points. First of all, these are not observations these are point estimates. In a frequentist approach (which I’m assuming they’re taking) if portraying the level of uncertainty is high on our agenda then a mean trend line is likely in this case to give an overly confident visual impression. For instance, the standard deviation intervals represented in gray in the graph tell us little about how confident we are of where the true value lies--what it tells us is if our assumptions are reasonably accurate then 95% or 97.5% etc percent of the time that confidence interval will cover the true value. We do not know the probability of where that point lies or how likely it is to be at the center or the extreme of that confidence interval. Further, there is no uncertainty given to the thermometer temperature readings. They are treated as if they were an accurate census of the population. However, the thermometer readings are as much a sample from a population as the proxy data. Thus the thermometer readings are at risk for all of the problems suffered in the proxy data--spatial and serial auto-correlation, measurement error, missing data, etc.
 
Below is a WEAKLY ATTEMPTED improvement on the original graph:


I'm trying to capture a little more of the uncertainty. Since the proxy data covers near the whole time span. The thermometer readings are left out as overlaying them on the proxy data obscures the trend being told by the proxy data. Second, the 0 line is centered at the mean of the thermometer data, as the graph is claiming to tell us the deviation from the post-industrial temperatures. Finally, the upper and lower bounds are highlighted with lowess lines (still need to play with the smoothing, as some of the points are outside the bounds). 





Tuesday, April 6, 2010

Week 1 data visualization in Political Science


In political science much of the past research on the outbreak of civil wars has been conducted using aggregate state level demographic and economic data in cross-national comparisons. Using a handful of explanatory variables such as GDP per capita, ethnic fictionalization, or colonial history these studies have generally coalesced around explaining the probability of a nation having a civil war with factors such as natural resource reliance, low per capita GDP, or difficult terrain. Moreover, there seems to be a lack of correlation with, religious or ethnic divisions, regime type, or economic inequality.

           
Below is a graph from an highly influence article on civil war onset. Pictured are probabilities of a nation experiencing a civil war derived from 220+ onsets along a dozen explanatory variables for over 150 nations spanning 40 years. One of the important and somewhat controversial findings is that ethnically divided nations are not significantly more at risk for violence. This graph attempts to summarize that core finding by placing the probability associated with ethnicity in perspective with that associated with different levels of per capita GDP.

How well is this graph presenting the information, is it getting in the way of the data, or is it helping to identify patterns? First, it is unclear exactly what it is telling us. For instance, what does the probability mean, is it a lot, a little? It is quite difficult to tell how varying ethnicity changes the probability. What is the relationship of varying both variables? What is the variation across nations or regions etc . . ? Is it so abstract and highly aggregated that we lose any feel for substantive significance? Could this better be represented by a simple table? Moreover, it may even be misleading? Is that really the relationship between the two, is the level of aggregation obscuring important details? What if we had several highly geographically unequal societies wherein the civil wars were occurring in the wealthy regions? 
































Armed Conflict Location and Events Dataset (ACLED)

The level of aggregation of much of the civil war literature has bothered many scholars. Below is a figure derived from a new  data set attempting to overcome this problem. ACLED is cataloging information on individual civil war events along with locations, dates, participants, context and outcomes. The ovals represent the activity of varying rebel groups while the map shading represents population density.

There seems to be a high correlation with population density and rebel activity. However, is it population density or the border with Rwanda and Uganda that is the important factor. It is not readily apparent given the way the non- DRC countries are 'left out'. Secondly, are the colored circles the best way to represent the second layer of information? Are the circles distracting? How dependent are the areas of the circles on outliers or does it represent a more or less even dispersion?

























The last figure uses  the ACLED data for an analysis of the correlation of violent events and variables such as wealth, location of diamond mines, distance from the capital, ethnic make-up etc. While the unpublished version is in color, the published version is black and white (what most of the world will see, what would Tukey say?). Unlike the first chart the disaggregated information allows us to ask questions such as--can we say diamonds are correlated with civil wars when the conflict site is nowhere near the source of diamonds? Yet, the figure poses its own potential distortions. For example, it is hard to distinguish between size of bubble and number of war events. To the eye, a few war events take on a disproportionate significance. For instance, the majority of events take place around the capital Monrovia,  However, the figure gives the impression of a greater spread of events about the nation.