## Monday, October 8, 2012

### Simulating Spatial Data

# Spatial data tags are an increasingly recorded for data that is being generating as a result of widescale implementation of GPS technology.

# In this post I will present a simulation in which the population is distributed around a single town center.

# The important characteristic of the population is that their characteristics are autocorrelated.  This can either be in response values or in error terms.

# Let's first specify an imaginary population.

population = 1000

# Specify the location in which the population is centered

center.X = 0
center.Y = 0

# Dispersion, the average distance of an individual from the center

dispersion.km = 40

# Now let's generate our population position

# First off.  We would like our population to be distributed in a circle around the center.  So we need to pick an angle in radians.

angle <- pi*runif(population)*2

# This is the distance from the center that each member of the population will be.  If this were set to a constant then the following commands would end up drawing a circle.
distance <- rnorm(population)*dispersion.km

mydata <- data.frame(x=center.X+distance*cos(angle), y=center.Y+distance*sin(angle))

smoothScatter(mydata, nrpoints=0, main="Population Density")

# Let's imagine that there is an unobserved variable called "soil quality" which varies in both the x and y.

mydata\$soil.quality <- sin(mydata\$x*pi/50+mydata\$y*pi/100)+1

# Farm size is random. But on average smaller as the plots get closer to the city center.

mydata\$size <- runif(population)/4+abs(distance)/100

# In order to produce some cool graphs we will need to install a new package:
install.packages("ggplot2")
library(ggplot2)

qplot(x, y, data=mydata, size=size, colour = soil.quality, main="Farms tend to be smaller near town")

# Rainfall also is spatially correlated.
mydata\$rainfall <- sin(mydata\$x*pi/60+mydata\$y*pi/160) + sin(mydata\$y*pi/60)+1

qplot(x, y, data=mydata, size=size, colour = rainfall, main="Rainfall is also distributed spatially")

# Now, let's imagine some technology usage, say fertilizer.

mydata\$fert.use = mydata\$rainfall-mydata\$soil.quality+mydata\$size+rnorm(population) + 4

qplot(x, y, data=mydata, size=size, colour = fert.use, main="Fertilizer use as a result should also be spatially distributed")

# Now let's see if we can't test if we can if fertilizer use is spatially correlated.

# The trick is figuring out what that means.

# I will define it as this, spatial correlation is the test to see if the use of fertilizer by one person is correlated with the use of fertilizer by another person.

# So I need to figure out a way of finding out what the closest person fertilizer usage is.

neighbor.fert.use <- neighbor.x <- neighbor.y <- 0

for (i in 1:population) {
# We will look at each person i and find the person which is closest.
# First let us constuct a variable that measures distance.
# This is that standard Euclidean distance formula.
distance.from.i = ((mydata\$x[i]-mydata\$x)^2 + (mydata\$y[i]-mydata\$y)^2)^.5

# The following set of nested statements can be somewhat confusing.  Read from the inside statement first.
# rank() will create a vector of length population that ranks distance from i.
# Outside of that is a logical operator that will create a vector of length population which is all False except from rank == 2 which is true which means that mydata\$fert will be drawn from that single value of rank==2.
# Finally we assign it to the ith place in the neighbor.fert.use vector.
neighbor.fert.use[i]<- mydata\$fert.use[rank(distance.from.i, ties.method="random")==2]
neighbor.x[i]<- mydata\$x[rank(distance.from.i, ties.method="random")==2]
neighbor.y[i]<- mydata\$y[rank(distance.from.i, ties.method="random")==2]
}

plot(mydata\$x,mydata\$y, main="Arrows indicate closest farm")
for (i in 1:population) arrows(x0=mydata\$x[i], x1=neighbor.x[i],y0=mydata\$y[i], y1=neighbor.y[i],length = 0.075)

# From the plot we can see that the farm matching algorithm specified above appears to be working well.  We can see from that plot that every farm has a farm that is closest to it however there is some farms that are not the closest farm to any other farm (which makes sense).

cor(neighbor.fert.use, mydata\$fert.use)
# From the correlation between fertilizer uses between each farm and it's closest neighbor I get a correlation of around .4
# This indicates that fertilizer use is spatially correlated.  However, as a result of how we set up the model this correlation is rather simple to handle.  It is not because a neighbor uses fertilizer that a farmer will use fertilizer but rather the effect of unobservables which is driving the use of fertilizer.  Mainly rainfall variation and soil quality variation.

# Imagine that we observe rainfall but not soil quality.  Let's see how well we can predict fertilizer usage.

# first let's remember how fertilizer use is calculated:
# mydata\$fert.use = mydata\$rainfall-mydata\$soil.quality+mydata\$size+rnorm(population) + 4

summary(lm(fert.use~rainfall+size,data=mydata))
# We can see that the coefficient on rainfall is too small.  This is because within the construction of this sample rainfall is correlated with soil quality.
cor(mydata\$rainfall,mydata\$soil.quality)

# Let's try to include technology choice of closest farm neihbor as a proxy for the spatial correlation of soil quality.
summary(lm(fert.use~rainfall+size+neighbor.fert.use,data=mydata))

# We can see that the coefficient on rainfall is even smaller.  This indicates that controlling for the neighbors choice is not helping.  Why is that?

# Sure, controlling for the neighbors fertilizer use is controlling for some of the soil quality variable.  However, what else is in the neighbor's decision = rainfail+size.  Since size is correlated spatially and rainfall is correlated spatially controlling for the neighbor's choice in effect controls for some of the effects of the explanatory variables.  Thus, both rain and size variables suffer as a result of controlling for the nearest neighbor's technology choice.