Following three plots are from R
Following two plots are from ArcGis
Remaining plots are from GeoDa
Obama Kerry multivariate LiSA
Spatial Lag Scatter plots of distance vs queen contiguity
By far the easiest package to use was the GeoDa software. In addition to ease of use, GeoDa seems to provide tools that ArcGis doesn't--in particular the multivariate feature. In terms of data exploration, GeoDa not only saved time by quickly being able to change parameters but it allowed for a very detailed and flexible way to compare changes and techniques side by side. While R certainly had the potential of being the most flexible in that you could do just about whatever you wanted as long as one took the time to track down the functions, it lacked in what can often be the most important aspect of exploring data: visualizing in near real time and comparison. While one can do the same in R, the ease with which it can be done in GeoDa helps greatly in taking it in and grasping the big picture.
Prerequisites, aside from the obvious of knowing how to use the software, having some knowledge of what is going on behind the pixels not only allows one to 'know' what they're trying to say with the data but the possibilities and limitations of what can be said. However, after having played with spatial weights, the most important prerequisite is knowledge of ones subject. If one doesn't come with subject mater expertise and a reasonable justification of the assumptions built into the model, then its all "gobblygoop and magic".
Tuesday, May 25, 2010
Wednesday, May 19, 2010
With Spatial auto-correlation, the value of one variable is in part a function of another variable. For example, if we know the value of "California" we can predict something of the value of Arizona. With Spatial auto-correlation, moreover, the problem is not just in determining how much one is influencing the other, but upon whom the influence is occurring. As we see in the results below, the pattern of correlation changes depending on what our definition of 'nearness' is. In the lagged plots is also evidence that not only is the existence of a relationship changed but the nature of that relation, as observations change quadrants based on how we measure what is a neighbor.
########
#R code#
########
#load packages
library(spdep)
library(maptools)
library(classInt)
library(RColorBrewer)
#load shape file
afghan <- readShapePoly("afghan.shp",proj4string=CRS("+proj=longlat"))
coordinates(afghan)
#put centroids into a file and make it a data frame;
centers = coordinates(afghan)
centers = data.frame(centers)
afghan.centers = coordinates(afghan)
#######################################################
#determine the k nearest neighbors for each point in afghan.centers;
k=1
knn1 = knearneigh(afghan.centers,k,longlat=T)
#create a neighbors list from the knn1 object;
afghan.knn1 = knn2nb(knn1)
##K=2 nearest neighbor
knn2 = knearneigh(afghan.centers,2,longlat=T)
#create a neighbors list from the knn2 object;
afghan.knn2 = knn2nb(knn2)
########################################################
#create a distance based neighbors object (afghan.dist.250) with a 250km threshold;
d = 250
afghan.dist.250 = dnearneigh(afghan.centers,0,d,longlat=T)
##########Playing with options: experiment with nbdist,
###########minimum distance to include all center points+100km
dsts<-unlist(nbdists(afghan.dist.250, afghan.centers))
##finds minimum sll included as neighbors distance
max_dsts<-max(dsts)## finds the max value of the minumum distances
afghan.dist.dsts = dnearneigh(afghan.centers,0,100*max_dsts, longlat=T)
##creates distance based neighbor list
##################################
#####################Moran plots of spatial lags
par(mfrow = c(2, 2))##create 2x2 window for the 4 plots
mp1<-moran.plot(afghan$foodinsecu,nb2listw(afghan.dist.250),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="Nearest Neighbor 250km")
mp2<-moran.plot(afghan$foodinsecu,nb2listw(afghan.dist.dsts),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="Nearest Neighbor 100km", sub="*minimum distance include all center points+100km", cex.sub=.8, cex.lab=.8)
mp3<-moran.plot(afghan$foodinsecu,nb2listw(afghan.knn1),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="K=1 Nearest Neighbor")
mp4<-moran.plot(afghan$foodinsecu,nb2listw(afghan.knn2),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="K=2 Nearest Neighbor")
########## moran's I test of autocorr
mt1<-moran.test(afghan$foodinsecu,nb2listw(afghan.dist.250, style="W"))
mt2<-moran.test(afghan$foodinsecu,nb2listw(afghan.dist.dsts, style="W"))
mt3<-moran.test(afghan$foodinsecu,nb2listw(afghan.knn1, style="W"))
mt4<-moran.test(afghan$foodinsecu,nb2listw(afghan.knn2, style="W"))
####################This is just me being fancy pants and having
###################R make the output into a nice latex table
library(xtable)
res1 <- matrix("", ncol=5, nrow=4)
rownames(res1) <- c("Dist=250", "Dist=100+min", "K=1", "K=2")
colnames(res1) <- c("$I$", "$E(I)$", "$var(I)$", "st. deviate", "$p$-value")
res1[1, 1:3] <- format(mt1$estimate, digits=3)
res1[1, 4] <- format(mt1$statistic, digits=3)
res1[1, 5] <- format.pval(mt1$p.value, digits=2, eps=1e-8)
res1[2, 1:3] <- format(mt2$estimate, digits=3)
res1[2, 4] <- format(mt2$statistic, digits=3)
res1[2, 5] <- format.pval(mt2$p.value, digits=2, eps=1e-8)
res1[3, 1:3] <- format(mt3$estimate, digits=3)
res1[3, 4] <- format(mt3$statistic, digits=3)
res1[3, 5] <- format.pval(mt3$p.value, digits=2, eps=1e-8)
res1[4, 1:3] <- format(mt4$estimate, digits=3)
res1[4, 4] <- format(mt4$statistic, digits=3)
res1[4, 5] <- format.pval(mt4$p.value, digits=2, eps=1e-8)
print(xtable(res1, align=c("c", rep("r", 5))), floating=TRUE,
sanitize.text.function = function(SANITIZE) SANITIZE)
Sunday, May 9, 2010
For the R code:
x4 = runif(250,0,99)
y4 = rnorm(250,50,15)
mypoints = cbind(x4,y4)
write.csv(mypoints,file="mypoints.csv")
The points are a combination of a unifrom and a normal distribution.
Patterns: The 'broad' distribution is captured in each figure, a central density running down the center with slight skew to the top. However, at each rescaling from 3x3 to 25x25 the detail of the distribution changes. At 3x3 to 10x10, depending on what type of conclusions we are trying to draw from the information, our level of precision is increasing, however, we may not necessarily be making false inferences.
Rather counter-intuitively the possibility of error seems to increase at the higher detail.
. . . . continuing
x4 = runif(250,0,99)
y4 = rnorm(250,50,15)
mypoints = cbind(x4,y4)
write.csv(mypoints,file="mypoints.csv")
The points are a combination of a unifrom and a normal distribution.
Patterns: The 'broad' distribution is captured in each figure, a central density running down the center with slight skew to the top. However, at each rescaling from 3x3 to 25x25 the detail of the distribution changes. At 3x3 to 10x10, depending on what type of conclusions we are trying to draw from the information, our level of precision is increasing, however, we may not necessarily be making false inferences.
Rather counter-intuitively the possibility of error seems to increase at the higher detail.
. . . . continuing
Wednesday, May 5, 2010
Project 4
Below is a sketch of health insurance coverage by state for 2008. The data is from the Kaiser Family Foundation http://www.statehealthfacts.org/. Figure 1 plots State spending per Capita and the percentage of the non-elderly population without health coverage. Alaska clearly has a strong effect on the regression line, so lines are given with and without Alaska. There is a lot of variation about the mean, however from the plot there does seem to be a middling association between state spending and health coverage. Figure 2 plots poverty level. Here we see a rather strong relationship. Certainly this is to be expected. However there are some states which are clearly performing above or below the trend such as Texas or Massachusetts. Figures 3 and 4 are looking at some of the regional variation in health coverage trends. In figure 4 rather than poverty level I have plotted the unemployment rate by state. In figures 3 and 4, there seems to be a different dynamic in the South as compared to the rest of the nation. In figure 3, the downward sloping line for the west is again being caused by Alaska.
Wednesday, April 28, 2010
Assignment 3
One of the leading indicators of economic decline is new housing starts. Figure 1 below is a timeseries plot of US new housing starts for single family dwellings from 1960 to 2010. What we can see is a clear correspondence to historical recessions: Oil shocks of early 70's, 1980, 1990 and then the most recent. Interestingly for the most recent downturn it quite clearly began trending downward by 2005 and doing so rather steeply. Yet, the the banking and realistate 'crash' didn't occur until the fall of 2008. Moreover, in this simple time-series graph we can see the current economic situation in historical perspective--from this view point the downturn is quite significant and severe.
Figures 2,3 and 4 are plots of the recession from an international perspective. Figure 1 is a snap shot of the major economic nations in Europe, Asia, Latin America and the United States. This plots the percentage change in a nation's GDP from it's last quarter value.For instance, in quarter 1 of 2009 Mexico's GDP saw a near 5 percentage point drop. In this we can see that not every region has been affected equally. Germany and Mexico suffered severely while India has maintained positive growth throughout. Figure 3, adds 5 more nations to the plot. We can see the general trend repeated with some clear 'outliers'. Mexico and Russia suffered stark declines, while India again shows continued growth. Figure 4 is a boxplot of all 9 nations. This allows us to see the general mean of the population as it declines beginning in 2008. For much of 2007, all of the nations were exibiting positive growth. By quarter 3 of 2008 nearly all of the nations had dipped into negative growth. Yet, we can also see that by quarter 2 of 2009 positive GDP growth had become the overall trend.
Figure 1
Figure2
Figure3
Figure4
Figures 2,3 and 4 are plots of the recession from an international perspective. Figure 1 is a snap shot of the major economic nations in Europe, Asia, Latin America and the United States. This plots the percentage change in a nation's GDP from it's last quarter value.For instance, in quarter 1 of 2009 Mexico's GDP saw a near 5 percentage point drop. In this we can see that not every region has been affected equally. Germany and Mexico suffered severely while India has maintained positive growth throughout. Figure 3, adds 5 more nations to the plot. We can see the general trend repeated with some clear 'outliers'. Mexico and Russia suffered stark declines, while India again shows continued growth. Figure 4 is a boxplot of all 9 nations. This allows us to see the general mean of the population as it declines beginning in 2008. For much of 2007, all of the nations were exibiting positive growth. By quarter 3 of 2008 nearly all of the nations had dipped into negative growth. Yet, we can also see that by quarter 2 of 2009 positive GDP growth had become the overall trend.
Figure 1
Figure2
Figure3
Figure4
Friday, April 16, 2010
This is from the 1998 article FIGURE 5
ME Mann, RS Bradley, MK Hughes. "Global-scale temperature patterns and climate forcing over the past six centuries". Nature, 1998
RCODE:
###Hockey stick replication
###data source http://www.nature.com/nature/journal/v430/n6995/extref/nature02478-s1.htm "nhmean.txt"
##get rid of 0s
hockey$raw[hockey$raw==0] <- NA
hockey$recon[hockey$recon==0] <- NA
### Get 1961- mean to center chart
x1<-subset(hockey$raw, hockey$date >1961)
meanx1<-mean(x1)
#### extend plot margins for labels
par(mar=c(5,5.2,4,2))
#plot data and create new ylab numbers
plot(hockey$date, hockey$raw, type="l",col = "red", ylim = c(-1.1, 0.8), las=1, ylab="Departure in temperature (C)\nfrom he 1961 to 1990 average", xlab="Year\n Chris Miner Geo 299B ", yaxt= "n")
axis(2, at=meanx1, las=1, labels="0.0")
axis(2, at=meanx1-0.5, las=1, labels="-0.5")
axis(2, at=meanx1+0.5, las=1, labels="0.5")
axis(2, at=meanx1-1, las=1, labels="-1.0")
##create "standard error" polygon
polygon(x= c(hockey$date, rev(hockey$date)), y=c(hockey$lower, rev(hockey$upper)), col="grey", border=FALSE)
###### add text
rect(1550, -0.9, 1997, -1.1, bor = TRUE, col = "white")
text(1550, -1.02, "Data from thermometers (red) and from tree rings,\ncorals, ice cores and historical records (blue).", pos = 4, adj = 0)
text(1700,.7, "NORTHERN HEMISPHERE")
#### add trend line
lines(hockey$date, hockey$recon, col="dodgerblue3") ### go dodgers
abline(h=meanx1) #### mean value line
###function to deal with lowess' problem with NAs blahhhh
lowess.na <- function(x, y, f = 2/3,...) {
x1 <- subset(x,(!is.na(x)) &(!is.na(y)))
y1 <- subset(y, (!is.na(x)) &(!is.na(y)))
lowess.na <- lowess(x1,y1,f, ...)
}
lines(lowess.na(hockey$date, hockey$recon, f=0.04),lwd=2)
In many ways the lines of the climate debate can be drawn within the frame of one graph. In 1998, Mann, Bradley and Hughes published the now famous 'hockey stick' recreation of 600 years of temperature patterns across the Northern Hemisphere. The graph itself is innocently buried within a dense scholarly discourse full of Eigen values and principle component analysis. As a figure it is one of many in a short article. As a representation of the data, among the other figures presented, it could be said that it is the least visually appealing and contains the least amount of information. However, what it does do is summarize the thrust of the entire article and visually provide near conclusive evidence that there is an historically unique change in climate and that it is caused by human activity.
The questions the article sets out to answer are fundamental to the climate debate: Is the earth getting warmer; if it is, is that change within the normal variability of long term trends; and finally is human activity involved in that change. The 'hockey stick' graph, whether it was intended to or not by the authors, answers all of these questions and it does so forcefully and emphatically. It depicts a clear monotonic growth in temperature over the 20th century. This growth has gone beyond the visible trends seen in prior centuries, and most importantly the beginning of the current trend seems to exactly correlate with the growth of industry in the northern hemisphere. Yet, and this is the reason we may question the authors intentions; the graph is stunningly clear and conclusive, while the text of the article speaks of the uncertainty and provisional nature of the findings.
The graph itself, has given rise to its own controversy: it is either a global fraud, one part of a scientific discourse, or the philosopher’s stone of climate change. Critics point out the highly aggregate nature of the data; layers of uncertainty built one upon another. They argue it smoothes and attenuates global trends which exaggerates the data from modern thermometer readings--its data is filled with measurement error; it is a highly non-random sample of both the proxies and of the raw temperature data which introduces more severe auto-correlation than the authors admit and problems of endogeneity in the temperature readings. The defenders claim that methodology and data was open to inspection, that levels of uncertainty were well explicated in that study and following ones, that further studies have built on this evidence, and finally that whatever reasonable level of uncertainty you put on the data something worrying is going on.
However, leaving the climate debate aside and focusing on the 'hockey stick' itself, the most telling critique of the authors of the original study might not be that their study is flawed but that they underestimated or ignored the power that the visual representation of data can have. In this debate, the graphics overpowered the words. It gave a strong impression of certainty not echoed in the text and thus as was warned in IPCC recommendations, "More consistent estimates of the endpoints of a range for any variable would minimize misunderstandings and reduce the likelihood that interest group could misunderstand or misrepresent the findings". These misunderstandings run across interest groups both for and against the articles findings. For uncertainty does not favor either side of the debate in this case. Just as much as it might be overstated, the problem could be much worse, as scholars have pointed out. This fact has been lost in the debate sparked by the visual representation of Mann et al’s findings.
Though I only have access to the already aggregated mean centered data, below is a brief discussion of some shortcomings in the ‘hockey stick’ graph and an attempt at improvement. First, are the wholly arbitrary elements of the graph: the 0 point line drawn through the authors’ chosen point; the coloring, the scale of the axes, the combination of a time-series line plot, a mean centered trend line, and the backdrop of the ‘confidence intervals’. The scale of the axes, gives the impression of a much greater magnitude of change. The 0 point of the y axis seems chosen for visual effect rather than representing an aspect of the data (why 1961? Why not 1902, the whole timeline). The uncertainty in the graph is grayed out and simply a background feature.
Yet in my view, one of the most visually misleading aspects of the graph is the trend line drawn through the mean of the of the data points. First of all, these are not observations these are point estimates. In a frequentist approach (which I’m assuming they’re taking) if portraying the level of uncertainty is high on our agenda then a mean trend line is likely in this case to give an overly confident visual impression. For instance, the standard deviation intervals represented in gray in the graph tell us little about how confident we are of where the true value lies--what it tells us is if our assumptions are reasonably accurate then 95% or 97.5% etc percent of the time that confidence interval will cover the true value. We do not know the probability of where that point lies or how likely it is to be at the center or the extreme of that confidence interval. Further, there is no uncertainty given to the thermometer temperature readings. They are treated as if they were an accurate census of the population. However, the thermometer readings are as much a sample from a population as the proxy data. Thus the thermometer readings are at risk for all of the problems suffered in the proxy data--spatial and serial auto-correlation, measurement error, missing data, etc.
Below is a WEAKLY ATTEMPTED improvement on the original graph:
I'm trying to capture a little more of the uncertainty. Since the proxy data covers near the whole time span. The thermometer readings are left out as overlaying them on the proxy data obscures the trend being told by the proxy data. Second, the 0 line is centered at the mean of the thermometer data, as the graph is claiming to tell us the deviation from the post-industrial temperatures. Finally, the upper and lower bounds are highlighted with lowess lines (still need to play with the smoothing, as some of the points are outside the bounds).
Tuesday, April 6, 2010
Week 1 data visualization in Political Science
In political science much of the past research on the outbreak of civil wars has been conducted using aggregate state level demographic and economic data in cross-national comparisons. Using a handful of explanatory variables such as GDP per capita, ethnic fictionalization, or colonial history these studies have generally coalesced around explaining the probability of a nation having a civil war with factors such as natural resource reliance, low per capita GDP, or difficult terrain. Moreover, there seems to be a lack of correlation with, religious or ethnic divisions, regime type, or economic inequality.
Below is a graph from an highly influence article on civil war onset. Pictured are probabilities of a nation experiencing a civil war derived from 220+ onsets along a dozen explanatory variables for over 150 nations spanning 40 years. One of the important and somewhat controversial findings is that ethnically divided nations are not significantly more at risk for violence. This graph attempts to summarize that core finding by placing the probability associated with ethnicity in perspective with that associated with different levels of per capita GDP.
How well is this graph presenting the information, is it getting in the way of the data, or is it helping to identify patterns? First, it is unclear exactly what it is telling us. For instance, what does the probability mean, is it a lot, a little? It is quite difficult to tell how varying ethnicity changes the probability. What is the relationship of varying both variables? What is the variation across nations or regions etc . . ? Is it so abstract and highly aggregated that we lose any feel for substantive significance? Could this better be represented by a simple table? Moreover, it may even be misleading? Is that really the relationship between the two, is the level of aggregation obscuring important details? What if we had several highly geographically unequal societies wherein the civil wars were occurring in the wealthy regions?
How well is this graph presenting the information, is it getting in the way of the data, or is it helping to identify patterns? First, it is unclear exactly what it is telling us. For instance, what does the probability mean, is it a lot, a little? It is quite difficult to tell how varying ethnicity changes the probability. What is the relationship of varying both variables? What is the variation across nations or regions etc . . ? Is it so abstract and highly aggregated that we lose any feel for substantive significance? Could this better be represented by a simple table? Moreover, it may even be misleading? Is that really the relationship between the two, is the level of aggregation obscuring important details? What if we had several highly geographically unequal societies wherein the civil wars were occurring in the wealthy regions?
Armed Conflict Location and Events Dataset (ACLED)
The level of aggregation of much of the civil war literature has bothered many scholars. Below is a figure derived from a new data set attempting to overcome this problem. ACLED is cataloging information on individual civil war events along with locations, dates, participants, context and outcomes. The ovals represent the activity of varying rebel groups while the map shading represents population density.
There seems to be a high correlation with population density and rebel activity. However, is it population density or the border with Rwanda and Uganda that is the important factor. It is not readily apparent given the way the non- DRC countries are 'left out'. Secondly, are the colored circles the best way to represent the second layer of information? Are the circles distracting? How dependent are the areas of the circles on outliers or does it represent a more or less even dispersion?
The last figure uses the ACLED data for an analysis of the correlation of violent events and variables such as wealth, location of diamond mines, distance from the capital, ethnic make-up etc. While the unpublished version is in color, the published version is black and white (what most of the world will see, what would Tukey say?). Unlike the first chart the disaggregated information allows us to ask questions such as--can we say diamonds are correlated with civil wars when the conflict site is nowhere near the source of diamonds? Yet, the figure poses its own potential distortions. For example, it is hard to distinguish between size of bubble and number of war events. To the eye, a few war events take on a disproportionate significance. For instance, the majority of events take place around the capital Monrovia, However, the figure gives the impression of a greater spread of events about the nation.
Subscribe to:
Posts (Atom)