Tuesday, May 25, 2010

Following three plots are from R

Following two plots are from ArcGis

Remaining plots are from GeoDa




Obama Kerry multivariate LiSA


Spatial Lag Scatter plots of distance vs queen contiguity



By far the easiest package to use was the GeoDa software. In addition to ease of use,  GeoDa seems to provide tools that ArcGis doesn't--in particular the multivariate feature. In terms of data exploration, GeoDa not only saved time by quickly being able to change parameters but it allowed for a very detailed and flexible way to compare changes and techniques side by side. While R certainly had the potential of being the  most flexible in that you could do just about whatever you wanted as long as one took the time to track down the functions, it lacked in what can often be the most important aspect of exploring data: visualizing in near real time and comparison. While one can do the same in R, the ease with which it can be done in GeoDa helps greatly in taking it in and grasping the big picture.

Prerequisites, aside from the obvious of knowing how to use the software, having some knowledge of what is going on behind the pixels not only allows one to 'know' what they're trying to say with the data but the possibilities and limitations of what can be said. However, after having played with spatial weights, the most important prerequisite is knowledge of ones subject. If one doesn't come with subject mater expertise and a reasonable justification of the assumptions built into the model, then its all "gobblygoop and magic".

Wednesday, May 19, 2010

With Spatial auto-correlation, the value of one variable is in part a function of another variable. For example, if we know the value of "California" we can predict something of the value of Arizona. With Spatial auto-correlation, moreover, the problem is not just in determining how much one is influencing the other, but upon whom the influence is occurring. As we see in the results below, the pattern of correlation changes depending on what our definition of 'nearness' is. In the lagged plots is also evidence that not only is the existence of a relationship changed but the nature of that relation, as observations change quadrants based on how we measure what is a neighbor.


########
#R code#
########
#load packages
library(spdep)
library(maptools)
library(classInt)
library(RColorBrewer)

#load shape file
afghan <- readShapePoly("afghan.shp",proj4string=CRS("+proj=longlat"))

coordinates(afghan)
#put centroids into a file and make it a data frame;
centers = coordinates(afghan)
centers = data.frame(centers)
afghan.centers = coordinates(afghan)

#######################################################
#determine the k nearest neighbors for each point in afghan.centers;
k=1
knn1 = knearneigh(afghan.centers,k,longlat=T)

#create a neighbors list from the knn1 object;
afghan.knn1 = knn2nb(knn1)

##K=2 nearest neighbor
knn2 = knearneigh(afghan.centers,2,longlat=T)

#create a neighbors list from the knn2 object;
afghan.knn2 = knn2nb(knn2)


########################################################

#create a distance based neighbors object (afghan.dist.250) with a 250km threshold;
d = 250
afghan.dist.250 = dnearneigh(afghan.centers,0,d,longlat=T)

##########Playing with options: experiment with nbdist,
###########minimum distance to include all center points+100km

dsts<-unlist(nbdists(afghan.dist.250, afghan.centers))
##finds minimum sll included as neighbors distance
max_dsts<-max(dsts)## finds the max value of the minumum distances
afghan.dist.dsts = dnearneigh(afghan.centers,0,100*max_dsts, longlat=T)
##creates distance based neighbor list
##################################

#####################Moran plots of spatial lags
par(mfrow = c(2, 2))##create 2x2 window for the 4 plots

mp1<-moran.plot(afghan$foodinsecu,nb2listw(afghan.dist.250),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="Nearest Neighbor 250km")

mp2<-moran.plot(afghan$foodinsecu,nb2listw(afghan.dist.dsts),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="Nearest Neighbor 100km", sub="*minimum distance include all center points+100km", cex.sub=.8, cex.lab=.8)

mp3<-moran.plot(afghan$foodinsecu,nb2listw(afghan.knn1),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="K=1 Nearest Neighbor")

mp4<-moran.plot(afghan$foodinsecu,nb2listw(afghan.knn2),labels=afghan$PRV_NAME, ylab="Spatial Lag", xlab="Afgan Food Insecurity", main="K=2 Nearest Neighbor")


########## moran's I test of autocorr
mt1<-moran.test(afghan$foodinsecu,nb2listw(afghan.dist.250, style="W"))

mt2<-moran.test(afghan$foodinsecu,nb2listw(afghan.dist.dsts, style="W"))

mt3<-moran.test(afghan$foodinsecu,nb2listw(afghan.knn1, style="W"))

mt4<-moran.test(afghan$foodinsecu,nb2listw(afghan.knn2, style="W"))

####################This is just me being fancy pants and having
###################R make the output into a nice latex table

library(xtable)
res1 <- matrix("", ncol=5, nrow=4)
rownames(res1) <- c("Dist=250", "Dist=100+min", "K=1", "K=2")
colnames(res1) <- c("$I$", "$E(I)$", "$var(I)$", "st. deviate", "$p$-value")
res1[1, 1:3] <- format(mt1$estimate, digits=3)
res1[1, 4] <- format(mt1$statistic, digits=3)
res1[1, 5] <- format.pval(mt1$p.value, digits=2, eps=1e-8)
res1[2, 1:3] <- format(mt2$estimate, digits=3)
res1[2, 4] <- format(mt2$statistic, digits=3)
res1[2, 5] <- format.pval(mt2$p.value, digits=2, eps=1e-8)
res1[3, 1:3] <- format(mt3$estimate, digits=3)
res1[3, 4] <- format(mt3$statistic, digits=3)
res1[3, 5] <- format.pval(mt3$p.value, digits=2, eps=1e-8)
res1[4, 1:3] <- format(mt4$estimate, digits=3)
res1[4, 4] <- format(mt4$statistic, digits=3)
res1[4, 5] <- format.pval(mt4$p.value, digits=2, eps=1e-8)
print(xtable(res1, align=c("c", rep("r", 5))), floating=TRUE,
 sanitize.text.function = function(SANITIZE) SANITIZE)

Sunday, May 9, 2010


For the R code:
x4 = runif(250,0,99)

y4 = rnorm(250,50,15)
mypoints = cbind(x4,y4)
write.csv(mypoints,file="mypoints.csv")

The points are a combination of a unifrom and a normal distribution.
Patterns: The 'broad' distribution is captured in each figure, a central density running down the center with slight skew to the top. However, at each rescaling from 3x3 to 25x25 the detail of the distribution changes. At 3x3 to 10x10, depending on what type of conclusions we are trying to draw from the information, our level of precision is increasing, however, we may not necessarily be making false inferences.
 
Rather counter-intuitively the possibility of error seems to increase at the higher detail.
 
. . . . continuing

Wednesday, May 5, 2010

Project 4

Below is a sketch of health insurance coverage by state for 2008. The data is from the Kaiser Family Foundation http://www.statehealthfacts.org/. Figure 1 plots State spending per Capita and the percentage of the non-elderly population without health coverage. Alaska clearly has a strong effect on the regression line, so lines are given with and without Alaska. There is a lot of variation about the mean, however from the plot there does seem to be a middling association between state spending and health coverage. Figure 2 plots poverty level. Here we see a rather strong relationship. Certainly this is to be expected. However there are some states which are clearly performing above or below the trend such as Texas or Massachusetts. Figures 3 and 4 are looking at some of the regional variation in health coverage trends. In figure 4 rather than poverty level I have plotted the unemployment rate by state. In figures 3 and 4, there seems to be a different dynamic in the South as compared to the rest of the nation. In figure 3, the downward sloping line for the west is again being caused by Alaska.