Environmental Risk Management
Spatial Epidemiology
Assessed practical:
Point Pattern Analysis
Tutor: Dr Elias Symeonakis (E410a)
e.symeonakis@mmu.ac.uk
26 January 2018
Introduction
In GIS we are able to utilise and analyse a variety of spatially‐referenced datasets. Typically
these spatial or geographical datasets are represented by what we term spatial entity data
models* or entities for short. Entities are essentially graphical components used by the
computer to represent the different phenomena of interest within the chosen study area.
There are several types of entity described below (after Chang (2003) and Heywood et al.
(2006)) including:
Point - a zero dimensional feature represented by a single coordinate XY pair or an
individual pixel
Line – a one dimensional feature which represents length and is encoded either as a
string of coordinate XY pairs or a linear series of contiguous pixels
Polygon – a two dimensional feature which has both an area and a perimeter. It is
represented either using a series of connected coordinate XY pairs with the same
start and end point coordinate or a cluster of contiguous pixels
Surface – this is a special form of entity which represents continuous phenomena
either using a raster grid or a Triangulated Irregular Network (TIN)
*You can find a fuller explanation of the spatial data modelling process (including entity
Network – this is another specialist entity representation which recognises the
selection)
in Heywood of
et line
al. (2006:71-107).
interconnection
features
1
In this practical exercise we will focus entirely on the point entity data model. Points are used
to represent the spatial location of events or activities known to have occurred in a defined
geographical area (Bailey & Gatrell 1995, Boots & Getis 1988). These are typically individual
events, or features, such as the centroid location or address point of a person (or persons)
affected by a particular illness or disease. This type of analysis is very commonly used in
spatial analysis, particularly in the areas of health, crime and ecology with a myriad of
academic papers available on the subject and several textbooks which focus on this particular
aspect of spatial analysis alone.
Point pattern analysis is a common procedure where centroid (or point location) data form
the primary dataset (Birkin et al. 1996). Researchers then employ a series of statistical
methods in an attempt to determine whether any patterns exist in the spatial or geographical
distribution of points (i.e. events) in the study area. Spatial point patterns specifically include
“a set of locations, irregularly distributed within a designated region and presumed to have
been generated by some form of [random or other] mechanism” (Diggle 2003:vii).
Rather than rely upon simple visual interpretation of the point distribution(s) which may
suggest specific patterns where none truly exists, specialist statistical methods are employed
to help identify whether any discernable point patterns exist and to help establish the
possible underlying causes for any evident spatial behaviours and patterns.
Types of point pattern
In undertaking point pattern analysis the user is exploring the dataset for evidence of specific
spatial or geographic properties. From this, the user can then begin to establish whether
there are specific processes which have generated the observed point pattern. Typically this
involves the study of the dispersion of points (location of point patterns with respect to the
geographical study area) or alternatively the arrangement of points with respect to each
other (Boots and Getis 1988). To understand these properties more clearly it is important to
define the possible patterns expected in a point pattern map display (Figure 1).
Figure 1. Point patterns (after Boots & Getis 1988)
2
The point pattern conditions include clustering (or aggregation), regularity, and randomness,
and are defined in the box below.
Clustering (Aggregation)
A concentration of events or objects (O’Sullivan & Unwin 2003), where the points are more
tightly grouped together than would be expected from a completely random pattern.
Dispersion (Evenly spaced)
The events or objects appear to be uniformly, or evenly, spaced. The observed average
distance between the points is also greater than that found within a completely random
pattern.
Randomness
Diggle (1983) describes the pattern of Complete Spatial Randomness (CSR) where the
points are characterised by uniformity and independence. More simply, the pattern of points
occurs by chance, with no variation in intensity across the study area. Boots & Getis (1988)
note that CSR is doubtful in real world situations where the likelihood is that no single
process (acting upon the points) is dominant, giving the appearance of a CSR pattern.
When studying an area of interest, it is useful to adopt point patterns analysis methods. Very
often data are collected at a number of discrete locations. Usually, we attempt to extrapolate
from the limited data to obtain information about the wider population or region. The analysis
of these points can allow us to identify whether there is any definable spatial component in
their behaviour. Take the following crime-based example: your hometown has recently
suffered a spate of break-ins, and the local police authority want to obtain further information
to help them catch the criminals and reduce the incidence of burglary. It is highly likely that
the police workers will record the burglaries in point form employing the household location as
the unit of observation. Using spatial statistics and GIS the police can begin to piece together
criminal activity in the area. First of all the police will be very interested in determining
whether there is any pattern to these burglaries. For instance, are there any localities that are
more affected than others by the criminal activity, i.e. hotspot areas where a greater number
of burglaries have been recorded? Obviously the distribution of points is likely to be affected
by the type of built environment and population numbers and dynamics. Once this factor has
been accounted for, the researcher or police worker can begin to examine the distribution of
point data to see whether there are any discernable patterns.
In this examination the police worker can start to determine whether there are clusters of
criminal activity. Using this evidence we can begin to hypothesise about the nature of the
burglaries, and potentially establish the reasons for such increased activity. For example, are
any areas affected in particular? And if so, is there any additional evidence that might be able
to help explain this? The following list of bullet points highlights some potential lines of
inquiry:
Higher incidence of burglaries in areas of socio-economic deprivation, potentially
as a result of poorer home security 3
Higher incidence in student areas, where multiple occupancy (e.g. flats in
renovated houses) is common offering greater opportunity for criminals
The use of such data exploration techniques to help develop hypotheses is a fairly typical aspect
of spatial data analysis. This type of technique can be used to help build spatial process models
and improve our understanding of the phenomenon under observation. Importantly, this is often
an iterative process, with many steps involved in the development of the spatial model.
Furthermore, there are a variety of different point pattern analysis methods available to the GIS
user and some of the key techniques are discussed below with example exercises for you to
complete later.
Distance Measures
One of the most common ways to detect any pattern within a point distribution is to examine the
distances – or spaces – between points (Gatrell et al. 1996), and compare these to another, typically
random, arrangement. Although relatively straightforward to calculate such measures are particularly
effective in demonstrating what are known as second order effects, described by O’Sullivan and Unwin
(2003: 79) as indicative of some form of “interaction between locations”. In other words, second order
effects demonstrate local patterning or variation which
is distinct from the global pattern or first order effects (Bailey and Gatrell 1995). Two such distance
measures commonly used in point pattern analysis are described below. Nearest neighbour analysis is
explained first followed by Ripley’s K statistic.
Nearest Neighbour Analysis
Nearest neighbour analysis is based upon a solid geographical principle that those objects or
phenomena that are located in close proximity to one another are likely to share similar
properties. This procedure describes the point pattern through calculating the mean distance to
each point’s nearest neighbour (Kitchin and Tate 2000). Then, using relatively simple statistical
analyses that compare the average distance(s) between closest neighbouring point observations
with those of a previously known pattern (typically the analyst would select a random pattern in
this type of analysis) it is possible to establish whether there are clustering or dispersed patterns
within the point distribution. Cluster patterns are defined by the short distances between proximal
neighbouring points, while dispersed point patterns display greater observed average distances
between points when compared to a random distribution network. To calculate the expected
average distant neighbours the following equation is used:
Rexp = 1 / (2√ (n / A))
Where A is the area of the study location and n is the number of points in the particular
distribution.
Lee and Wong (2001) identify another useful statistic based upon the average distance
information ‐ this is the randomness statistic, and is a simple ratio between observed and
expected distance between point locations.
R = robs / rexp
Where robs is the observed average distance between nearest neighbours and rexp is the
expected average distance between nearest neighbours using the basis of the theoretical
pattern.
4
Employing this statistic is relatively easy to determine whether point distributions follow
clustered, random or dispersed distributions. Where R is less than 1 the data set is
characterised by an increasing cluster tendency, and in contrast R values greater than 1
assume dispersed spatial behaviour (evenly spaced events).
Nearest Neighbour Analysis: A Worked Example
So taking a hypothetical example of a study of town and city locations within a 100 x 100
kilometre study area, we can begin to establish the mean distances of the different events
and compare this to the expected average distance between nearest neighbours. The region
of interest contains 8 major towns and cities across its 10, 000 square kilometre study area
as shown in Figure 2. The location of each settlement is provided in Table 1, as are the
details of distance to closest neighbour. The calculation of the nearest neighbour index is
given after the table.
Figure 2. Point display of settlement distribution in hypothetical study area
5
Ripley’s K Function
One of the problems associated with the nearest neighbour statistic is that it only considers the
closest neighbour and does not consider other spatial scale effects (O’Sullivan and Unwin 2003,
Mehrer and Westcott 2006). The K function originally developed by Ripley (1976) provides an
opportunity to explore spatial patterning at different spatial scales within the chosen study area.
To calculate K we must visit every event or point in the study area and then establish the mean
number of other points falling within a set distance of the start point (Bailey and Gatrell 1995).
Typically this distance is defined as a circle of radius d and is repeated for different radius values
(O’Sullivan and Unwin 2003) (Figure 3). The mean counts for each circle are then divided by
what is known as the mean intensity of the process – which is in effect the total number of events
or points divided by the study area (Fotheringham et al. 2000).
6
Figure 3. Determining the K function (source: Bailey & Gatrell (1995:93))
The results of the K function can be presented graphically and help to show at what spatial
scales different pattern behaviours (such as clustering may) occur (Figure 4). When the
observed K value is larger than the expected K value for a particular distance, the distribution is
more clustered than a random distribution at that distance (scale of analysis). When the
observed K value is smaller than the expected K value, the distribution is more dispersed than
a random distribution at that distance. When the observed K value is larger than the Higher
Confidence Envelope value, spatial clustering for that distance is statistically significant. When
the observed K value is smaller than the Lower Confidence Envelope value, spatial dispersion
for that distance is statistically significant.
7
Figure 4. Point pattern behaviour at different spatial scales (Source: ArcGIS 10.1 help
pages)
Intensity measures
Alternative approaches to measuring point patterns have moved away from basic measures of
distance to the intensity (or density) of points in a given area. One such method is quadrat
analysis, where simply the number of events (points) that occurs within a set of, typically square,
sampling frames is counted. This is used to establish a frequency distribution, which records the
number of events in each individual quadrat. This distribution can then be compared against
another distribution, commonly a random pattern. In a random pattern, the mean number of
points in each quadrat would approximate the variance of the number of points per quadrat. This
can be calculated by the Variance Mean Ratio (VMR), which equals 1 for a random distribution.
Where the VMR is greater than 1 then a cluster pattern is identified. Dispersed patterns are
shown by a VMR of less than 1. This type of method has significant problems, however, most
notably concerning the choice of quadrat size and the fact that it does not consider local density
– only measuring the number of points and not their spatial distribution within a single quadrat.
Thankfully there are a number of other intensity‐based measures, the most significant of which is
the Kernel Density Estimator described next.
Kernel Density Estimation
The kernel density estimation technique involves the creation of a continuous (raster) surface
which represents the variation in the density of point events in a given study area (Chainey &
8
Ratcliffe 2005). Specifically the analysis involves the estimation of the density of points
across geographical space using kernels which have a defined search radius (Figure 5).
Figure 5. The kernel function (Source: Chang (2003:282))
The appearance of the resultant raster‐based output is strongly influenced by the choice of kernel bandwidth
– the radius used to search for other points around each event (O’Sullivan & Unwin 2003).
Software environments for point pattern analysis
There is a great deal of specialist software available for all kinds of spatial analysis including
point pattern detection. Many of the standard desktop GIS packages, such as ArcGIS and
IDRISI, include some (admittedly rather limited) point pattern analysis functionality, although
you will find standalone specialist packages such as CrimeStat and R more capable for the
task with a great range of point pattern analysis options available.
ArcGIS
ESRI’s ArcGIS software environment offers users the ability to undertake nearest neighbour, Ripley’s K and
kernel density estimation. This is primarily carried out through the ArcToolbox
unction in ArcGIS desktop. You should see the ArcToolbox as a small icon on the main toolbar.
9
CrimeStat
CrimeStat is a standalone spatial statistical package for the analysis of point-based crime
data. It was created by Ned Levine for the analysis of US crime data and is freely available
for download for educational and research use. It offers a range of measures from basic
centrographic analysis through to complex spatio-temporal modelling.
http://www.icpsr.umich.edu/CRIMESTAT/
R – Statistical Computing
R is a statistical computing environment created by the academic community and freely
available for non-commercial use. Although it is used for many different statistical tasks it has
a very strong spatial statistical component based around a series of additional packages
which can be downloaded and added to the main R GUI interface.
http://www.r-project.org/
Working in ArcGIS
The practical exercise is to be completed using the desktop GIS package ArcGIS available in
the computer labs. Please note that you may not finish this task within the hour or so available
and therefore may need to work on this outside of the GIS lab class. You are required to submit
the map and table outputs from the different analyses and answer the questions set out below.
You should aim to write this up (including any figures) in 2 or 3 sides of A4.
Point Pattern Analysis with ArcGIS
The dataset provided for this practical exercise is:
Lancashire Lung Cancer data – this is a shapefile with the locations of reported lung cancer
incidences in southern Lanchasire (Source: Bailey and Gatrell (1995)).
The data are available on S:\Faculty of Science & Engineering\Environmental &
Geographical Sciences\6F6Z2002\Point Pattern Analysis\lung_cancer_lancs
Copy the folder onto your own drive space or alternatively onto a USB flash drive.
Nearest neighbour analysis
1. Open up ArcGIS and connect to the lung_cancer_lancs data folder in your personal
drive space (or USB stick).
2. Add the lung_cancer_lancashire.shp file to your display.
10
3. You should now see a display like that shown in Figure 6 – you will see that it contains
point data that represent the incidence of lung cancer among the local population.
You may want to change the symbology properties of the data if the default symbol
and colour is not to your liking.
Figure 6. Lung Cancer data from Lancashire (Source: Bailey and Gatrell (1995))
4. Visually examine the lung cancer point dataset for Lancashire. Can you see any
pattern emerge? How might you describe this set of points – clustered, dispersed,
random?
5. Once you have decided upon how to describe this pattern visually the next step is to
see whether there is any statistical basis to this assumption. Here you will employ
the nearest neighbour index. Select ArcToolbox from the toolbar – this is the
little red tool box icon on the toolbar – you should now see a new menu display on
your screen next to Layers. Within ArcToolbox there are numerous different
modules and operations.
11
6. Find Spatial Statistics Tools and from its submenu select Analyzing Patterns.
Here you will find the option to perform the Average Nearest Neighbour
technique (Figure 7).
Figure 7. Average Nearest Neighbour statistic
7. Select lung_cancer_lancashire as the Input Feature Class. Check the Generate
Report box and click OK (accept all other defaults).
8. ArcGIS should now start processing your data and calculating its nearest neighbour index.
Don’t worry if this takes the computer a minute or two to complete. When it does finish
you should ask to see the results (from the main menu: Geoprocessing > Results). What is
the value of the Nearest Neighbour Ratio? Remember, you should compare it with unity.
What pattern does the lung cancer events data show? Double-click on the HTML Report
File: Nearest Neighbor_Result.html to open up the graphical output of the results (Figure
7b). Make sure you keep a record of the results.
12
Figure 7b. Graphical output of NN results using ArcGIS
Kernel Density Estimation (KDE)
9. To estimate a surface that describes the density of the cancer incidences in
Lancashire using the KDE approach, go to the Arctoolbox > Spatial Analyst Tools >
Density menu and double click on Kernel Density. Select lung_cancer_lancashire
as the Input Feature Class and choose an appropriate output name and location for
the raster that will be created via the KDE. Leave the rest of the defaults as they
are and click OK. After the KDE algorithm finishes, the raster will be
automatically displayed. You can modify the colouring scheme as you see fit,
e.g. Figure 7c. To modify the colouring scheme, you need to double click on the
Kernel Desnity layer, click on the Symbology tab and then click on the Classify button
to modify the number of classes and the method of classification. You can use the
Symbology of the lung cancer locations layer to change their symbol too, from the
default dots to x’s (Figure 7c) so that they do not cover too large an area of the map.
You can also click on the 0 density value class in the table of contents to change it’s
colour to transparent (Figure 7d).
Figure 7c. Density of lung cancer occurrences in Lancashire estimated using the Kernel
Density Estimation tool (number of classes: 9, method: Quantile)
Figure 7d: Changing the colour of individual classes
Are the locations of lung cancer in Lancashire clustered? To assist in the discussion, click on
the little arrow next to the Add Data button from the top menu in ArcMap and select Add
Basemap (Figure 7e):
Figure 7e: Adding a basemap
Can you find out using the base map what if any urban areas are linked to these clusters? Use
the zoom in tool if you need to. What happens if you modify the search radius option in the KD
estimation window? Try a larger and a smaller radius and visualize the resulting rasters to
compare.
K statistical analysis of the lung cancer data
10. Using the lung_cancer_lancashire.shp data select Multi-Distance Spatial Cluster
Analysis (Ripley’s K Function) from the Analyzing Patterns submenu of Spatial
Statistical Tools. You should be prompted with a dialogue box approximately like that
shown in Figure 8.
Figure 8. Ripley’s K Function in ArcGIS
11. Select lung_cancer_lancashire as the Input Feature Class and then choose a suitable
name and location for the Output Table. For Compute Confidence Envelope select 99
Permutations. Check the display output graphically box and click OK.
12. After several minutes of processing you should be presented with a dialogue box
which shows how the data are clustered (or dispersed) and how these patterns
change with spatial scale.
13. Are the locations of lung cancer in Lancashire clustered? And is there any variation
with spatial scale? Keep a copy of the output graph for inclusion in your submission.
TASKS
Write up this practical in report format with (i) an introduction, (ii) a description of methods, (iii)
the visual presentation of any maps/result outputs, and (iv) your answers/written discussions to
any set questions below.
Include a map display for the Lancashire lung cancer data. Your map should be presented
separately and include north arrow, legend and your name (clearly labeled).
Write a brief commentary describing the pattern of lung cancer events (is any clustering or
other pattern present?) and link this discussion with the results of your nearest neighbor
analysis, kerned density estimation and Ripley’s K analysis. You should also include the output
graphical results and KDE map.
Make sure that your write up makes appropriate use of the supporting academic
literature.
14
References
Bailey, T.C. & Gatrell, A.C. (1995) Interactive Spatial Data Analysis. Harlow:Prentice Hall
Boots, B. & Getis, A. (1988) Point Pattern Analysis. London: Sage
Birkin, M., Clarke, G.P., Clarke, M. & Wilson, A.G. (1996) Intelligent GIS: Location Decisions
and Strategic Planning. Cambridge: Geoinformation.
Chainey, S. & Ratcliffe, J. (2005) GIS and Crime Mapping. London: Wiley.
Chang, K‐T. (2003) Introduction to Geographic Information Systems. Second Edition. Boston:
McGraw Hill.
Diggle, P.J. (2003) Statistical Analysis of Spatial Point Patterns. Second Edition. London:
Arnold.
Fotheringham, A.S., Brunsdon, C. & Charlton, M. (2000) Quantitative Geography:
Perspectives on Spatial Data Analysis. London: Sage.
Gatrell, A.C., Bailey, T.C., Diggle, P.J. & Rowlingson, B.S. (1996) Spatial point pattern
analysis and its application in geographical epidemiology. Trans Inst Br Geogr 21:256-274.
Heywood, I., Cornelius, S. & Carver, S. (2006) An Introduction to Geographical Information
Systems. Third Edition. Harlow: Prentice Hall.
Kitchin, R. & Tate, N.J. (2000) Conducting Research in Human Geography: Theory,
Methodology and Practice. Harlow: Prentice Hall.
Lee, J. & Wong, DS. (2001) Statistical Analysis with ArcView GIS. New York: Wiley.
Levine, N. (2007) CrimeStat: A Spatial Statistics Program for the Analysis of Crime Incident
Locations (v3.1). Ned Levine & Associates, Houston, TX, and the National Institute of
Justice, Washington, DC.
Mehrer, M. & Westcott, K. (2006) GIS and Archaeological Site Location Modeling. CRC
Press. O’Sullivan, D. & Unwin, D. (2003) Geographic Information Analysis. New Jersey:
Wiley.
Ripley, B.D. (1976) The second-order analysis of stationary point processes. Journal of
Applied Probability 13: 255-266.
Acknowledgements:
The data for this exercise have been created by other researchers and are included in Bailey
and Gatrell (1995).
15