close
Fact-checked by Grok 3 months ago

Moran's _I_

Moran's I is a statistical measure of spatial autocorrelation, developed by Patrick A. P. Moran in 1948 to assess the extent to which values of a variable at nearby locations in space are more similar (or dissimilar) than would be expected under spatial randomness.[1] It is widely used in fields such as geography, ecology, economics, and public health to detect patterns of clustering, dispersion, or randomness in spatial data, such as disease incidence, economic indicators, or environmental variables.[2] The statistic is computed using a spatial weights matrix that defines neighborhood relationships among observations, making it applicable to both lattice and point data configurations.[3] The formula for global Moran's I is given by
I=ni=1nj=1nwijzizjS0i=1nzi2, I = \frac{n \sum_{i=1}^n \sum_{j=1}^n w_{ij} z_i z_j}{S_0 \sum_{i=1}^n z_i^2},
where $ n $ is the number of observations, $ z_i = x_i - \bar{x} $ are the deviations from the mean $ \bar{x} $, $ w_{ij} $ are elements of the spatial weights matrix $ W $, and $ S_0 = \sum_{i=1}^n \sum_{j=1}^n w_{ij} $ is the sum of all weights.[2] Values of I typically range from -1 to +1, with I > 0 indicating positive spatial autocorrelation (similar values cluster together), I < 0 indicating negative autocorrelation (dissimilar values are adjacent), and I ≈ 0 suggesting no spatial structure beyond randomness; the expected value under the null hypothesis of no autocorrelation is approximately -1/(n-1).[4] Statistical significance is assessed via a z-score and p-value, often assuming asymptotic normality for large samples.[3] Moran's I forms the basis for extensions such as local indicators of spatial association (LISA), which identify specific locations of clusters or outliers, as developed by Luc Anselin in 1995.[5] Popularized in spatial econometrics and geostatistics through works like Cliff and Ord (1973), it remains a foundational tool for exploratory spatial data analysis (ESDA), often visualized using Moran scatterplots to reveal high-high, low-low, high-low, and low-high associations.[2] Applications span testing for spatial dependence in regression models to mapping socioeconomic disparities, with software implementations available in tools like GeoDa, ArcGIS, and R packages such as spdep.[2]

Introduction

Definition and Purpose

Moran's I is a statistical measure used to quantify spatial autocorrelation, which refers to the correlation of a variable with itself across different spatial locations, indicating that nearby observations tend to exhibit similar values due to underlying spatial processes.[4] This concept is foundational to spatial analysis and aligns with Tobler's First Law of Geography, which posits that everything is related to everything else, but near things are more related than distant things.[6] Spatial autocorrelation arises in various fields, such as geography and ecology, where attribute values—like population density or environmental factors—are not independent but influenced by proximity.[7] The statistic ranges from -1 to +1, where a value of +1 indicates perfect positive spatial autocorrelation (clustering of similar values), -1 signifies perfect negative spatial autocorrelation (dispersion or checkerboard pattern), and 0 suggests no spatial autocorrelation, implying a random distribution.[3] This range allows researchers to assess the strength and direction of spatial dependence in datasets, helping to identify patterns that violate assumptions of spatial independence in traditional statistical models.[8] Moran's I serves to detect overall or localized spatial dependence in variables across geographic, ecological, or socioeconomic contexts, enabling analysts to uncover hidden structures in spatial data that might otherwise be overlooked.[9] It exists in global and local forms: the global version provides an aggregate measure of spatial autocorrelation for an entire dataset, while the local version decomposes this to reveal specific locations of clusters or outliers. Both rely on a spatial weights matrix to define neighborhood relationships among observations.[3]

Historical Development

The concept of spatial autocorrelation, central to Moran's I, emerged in the mid-20th century amid growing interest in analyzing patterns in geographical data. Patrick Alfred Pierce Moran, an Australian statistician, first introduced ideas related to spatial dependence in his 1948 paper on interpreting statistical maps, where he explored methods to detect non-random patterns in spatially distributed variables. This work laid preliminary groundwork for quantifying spatial relationships, motivated by challenges in mapping continuous phenomena like environmental or biological distributions. Moran's seminal contribution came in 1950 with his publication "Notes on Continuous Stochastic Phenomena" in Biometrika, where he formally defined Moran's I as a measure of spatial autocorrelation for lattice data, extending earlier probabilistic approaches to account for interdependence among nearby observations. This statistic addressed limitations in traditional correlation measures by incorporating spatial weights, building on the broader foundations of spatial theory established by geographers in the preceding decades, such as Walter Christaller's 1933 central place theory, which modeled hierarchical settlement patterns and influenced subsequent statistical examinations of spatial organization. The global form of Moran's I remained the primary tool for spatial analysis through the late 20th century, but its scope expanded in 1995 when geographer Luc Anselin introduced local indicators of spatial association (LISA), including the local Moran's I, to identify heterogeneous clusters and outliers within datasets. Anselin's framework decomposed global autocorrelation into location-specific measures, enabling finer-grained detection of spatial heterogeneity. Post-1990s advancements in geographic information systems (GIS) dramatically enhanced the computational accessibility of Moran's I, integrating it into software platforms for routine analysis of large-scale spatial data. This evolution facilitated widespread adoption across disciplines, particularly in epidemiology during the 2000s, where the statistic was applied to map disease clustering and environmental risk factors, as evidenced by the surge in spatial epidemiological studies employing GIS-based autocorrelation tests.

Global Moran's I

Mathematical Formulation

The global Moran's I statistic measures overall spatial autocorrelation in a dataset consisting of attribute values xix_i observed at nn spatial locations. It is formally defined as
I=nS0i=1nj=1nwij(xixˉ)(xjxˉ)i=1n(xixˉ)2, I = \frac{n}{S_0} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2},
where xˉ\bar{x} is the mean of the attribute values, wijw_{ij} are the elements of the spatial weights matrix capturing the spatial structure among observations, and S0=i=1nj=1nwijS_0 = \sum_{i=1}^n \sum_{j=1}^n w_{ij} is the sum of all weights.[10] This formulation derives from the analogy to the sample covariance in classical statistics, adapted to incorporate spatial dependence. In standard covariance, the term (xixˉ)(xjxˉ)\sum (x_i - \bar{x})(x_j - \bar{x}) for iji \neq j captures linear association between paired observations, but without structure, it averages to zero under independence. To account for spatial arrangement, the pairs are weighted by wijw_{ij}, which is zero for non-neighbors and positive for spatially proximate locations, yielding the spatial autocovariance ijwij(xixˉ)(xjxˉ)\sum_i \sum_j w_{ij} (x_i - \bar{x})(x_j - \bar{x}). Dividing by the total variance i(xixˉ)2\sum_i (x_i - \bar{x})^2 normalizes this to a correlation-like scale, bounded approximately between -1 and 1. The prefactor n/S0n / S_0 adjusts for the total "spatial connectivity" encoded in the weights, ensuring the measure remains consistent regardless of the absolute scale of wijw_{ij}.[10] The normalization by S0S_0 specifically achieves scale invariance, meaning that multiplying all weights by a constant does not alter II, which facilitates comparisons across datasets with varying spatial resolutions or weight definitions.[11] The derivation assumes a stationary spatial process, where the mean and variance of the attribute are constant across the study area, allowing the global statistic to summarize average dependence without location-specific biases. For interpretability, the weights are commonly row-standardized such that each row sums to 1 (wij=wij/jwijw_{ij}^* = w_{ij} / \sum_j w_{ij}), treating them as conditional probabilities of influence from neighboring locations.[3]

Spatial Weights Matrix

The spatial weights matrix $ W $, denoted as an $ n \times n $ matrix for $ n $ spatial units, captures the structure of spatial interdependence, where each off-diagonal element $ w_{ij} $ quantifies the connection or proximity between units $ i $ and $ j $, typically with $ w_{ii} = 0 $ to exclude self-interaction.[12] This matrix serves as a key input in computing Moran's $ I $, multiplying deviations from the mean to weight neighboring similarities.[13] Common types of spatial weights matrices include binary contiguity, which sets $ w_{ij} = 1 $ if units $ i $ and $ j $ share a boundary and 0 otherwise, with variants such as the rook criterion (sharing an edge) or queen criterion (sharing an edge or vertex).[12] Another approach is the $ k $-nearest neighbors method, where each unit connects to its $ k $ closest units based on a distance metric, ensuring no isolated units regardless of irregular distributions.[12] Distance-based weights, such as inverse distance ($ w_{ij} = 1 / d_{ij} $) or decay functions like $ w_{ij} = 1 / d_{ij}^2 $ or exponential forms $ w_{ij} = e^{-\gamma d_{ij}} $, emphasize connections that diminish with separation $ d_{ij} $.[13] For irregular polygonal units, weights can be proportional to the shared boundary length, reflecting interaction strength based on physical adjacency extent, as in migration or diffusion processes.[14] Standardization of the weights matrix is often applied to facilitate interpretation and computation, with row-standardization transforming $ w_{ij}^* = w_{ij} / \sum_j w_{ij} $ so that each row sums to 1, effectively averaging neighbor influences.[13] This method, common in Moran's $ I $ applications, introduces potential asymmetry but aligns the spatial lag with a mean of neighbors.[12] The choice of spatial weights profoundly influences Moran's $ I $ outcomes due to their role in defining neighborhoods, necessitating sensitivity analysis to assess robustness.[12] For instance, contiguity weights may overestimate autocorrelation in dense urban data with irregular boundaries, while distance-based weights better capture sparse rural interactions over varying terrains, potentially yielding lower $ I $ values under contiguity in low-density settings.[2]

Expected Value and Variance

Under the null hypothesis of no spatial autocorrelation, which corresponds to a random permutation of the observed values across the spatial units, the expected value of the global Moran's I statistic is $ E(I) = -\frac{1}{n-1} $, where $ n $ is the number of observations. For large sample sizes, this expected value approaches zero, reflecting the absence of systematic spatial structure. The variance of I under this null hypothesis is derived from higher-order moments of the spatial weights and accounts for the finite sample properties of the randomization distribution. Define the sums of powers of the spatial weights as $ S_k = \sum_i \sum_j w_{ij}^k $ for $ k = 0, 1, 2, 3, 4 $, where $ w_{ij} $ are the elements of the spatial weights matrix W. The variance is then given by
Var(I)=nS1S2(n2S33nS1+3S02)n2S4(nS123S0S2)+3S02(nS12nS2S0)(n1)3(n2)(n3)S02. Var(I) = \frac{n S_1 S_2 (n^2 S_3 - 3n S_1 + 3 S_0^2) - n^2 S_4 (n S_1^2 - 3 S_0 S_2) + 3 S_0^2 (n S_1^2 - n S_2 S_0)}{(n-1)^3 (n-2) (n-3) S_0^2}.
This formula arises from combinatorial considerations in the permutation distribution, assuming the weights matrix is symmetric and non-stochastic. An alternative variance expression assumes multivariate normality of the data, leading to a simpler form based on the first few moments of the weights matrix, though the randomization variance is generally preferred for exact inference in finite samples. These moments facilitate the standardization of I to a z-score for asymptotic normal testing: $ z = \frac{I - E(I)}{\sqrt{Var(I)}} $, which follows a standard normal distribution under the null for large n.

Local Moran's I

Formulation and Computation

The local Moran's II statistic, denoted IiI_i for a location ii, measures the extent to which the value at ii is similar to or different from the values at its neighboring locations, thereby identifying local spatial clustering or dispersion.[13] The formula is given by
Ii=zim2jiwijzj, I_i = \frac{z_i}{m_2} \sum_{j \neq i} w_{ij} z_j,
where zi=xixˉz_i = x_i - \bar{x} and zj=xjxˉz_j = x_j - \bar{x} are the deviations from the global mean xˉ\bar{x}, wijw_{ij} are the elements of the spatial weights matrix WW (with wii=0w_{ii} = 0), and m2=1nk=1nzk2m_2 = \frac{1}{n} \sum_{k=1}^n z_k^2 is the variance of the attribute values (with nn denoting the total number of locations).[13] This formulation decomposes the global measure into location-specific contributions, allowing for the detection of hotspots, coldspots, and spatial outliers.[13] To compute the local Moran's IiI_i, first calculate the global mean xˉ\bar{x} and variance m2m_2 from the attribute values x1,,xnx_1, \dots, x_n.[13] Next, for each location ii, compute the deviations ziz_i and the weighted sum of deviations for its neighbors, jiwijzj\sum_{j \neq i} w_{ij} z_j, using the predefined spatial weights matrix WW.[13] Finally, apply the formula to obtain IiI_i for every ii, which collectively sum to a multiple of the global Moran's II.[13] The local indicators relate directly to the global Moran's II through their aggregation: for a row-standardized weights matrix (where each row sums to 1), the sum of all IiI_i equals nn times the global II, providing a formal decomposition that ensures iIi=nI\sum_i I_i = n I.[13] Edge effects, arising from locations with fewer neighbors due to study area boundaries, are typically addressed by row-standardizing the spatial weights matrix, which divides each wijw_{ij} by the total number of neighbors for ii, thereby normalizing the influence of varying neighborhood sizes and stabilizing local estimates.[15][16]

Interpretation and Types

The local Moran's I statistic facilitates the identification of spatial regimes by classifying locations based on the similarity or dissimilarity of their values relative to neighboring locations, revealing patterns of positive or negative local spatial autocorrelation. Locations with significantly positive local Moran's I values indicate clusters of similar values, categorized into high-high (HH) clusters, where a high value is surrounded by other high values, often interpreted as hot spots of elevated activity or concentration, and low-low (LL) clusters, where a low value is surrounded by low values, denoting cold spots of diminished activity. Conversely, significantly negative values signify spatial outliers, including high-low (HL) types, where a high value is adjacent to predominantly low values, and low-high (LH) types, where a low value neighbors high values, highlighting atypical or transitional zones in the spatial pattern.[13] These classifications are typically visualized using a Moran scatterplot, which plots each location's standardized value against its spatial lag (the weighted average of neighbors' values), dividing the plot into four quadrants corresponding to the HH (upper-right), LL (lower-left), HL (upper-left), and LH (lower-right) types, with the slope approximating the global Moran's I and points scaled by significance to emphasize notable clusters or outliers. Complementary significance maps overlay these cluster types on the geographic layout, using color coding (e.g., red for HH hot spots, blue for LL cold spots, and contrasting shades for outliers) and filtering by pseudo p-values derived from permutation tests to display only statistically significant locations, aiding in the detection of local spatial structures amid noise.[16][13] Within Anselin's framework of Local Indicators of Spatial Association (LISA), the local Moran's I serves as a primary measure for detecting local clustering, satisfying criteria of local significance and decomposition into a global counterpart, alongside other LISA statistics such as the local Geary and Getis-Ord statistics that offer complementary perspectives on spatial variation but focus on different aspects like distance-based dissimilarity or extreme value concentrations. For multivariate analysis, the bivariate local Moran's I extends this approach to assess cross-variable spatial autocorrelation, evaluating how values of one variable at a location covary with the spatial lag of another variable among its neighbors, such as identifying HH clusters where high income areas neighbor high education levels, which can reveal underlying spatial dependencies or policy-relevant interactions.[13][17]

Statistical Inference

Significance Testing

Significance testing for Moran's I assesses whether observed spatial autocorrelation patterns deviate significantly from expectations under spatial randomness, enabling researchers to distinguish substantive clustering from chance variation. For the global Moran's I, the standard approach assumes approximate normality of the statistic and computes a z-score as $ z = \frac{I - \mathbb{E}[I]}{\sqrt{\mathrm{Var}(I)}} $, where E[I]\mathbb{E}[I] and Var(I)\mathrm{Var}(I) are derived from the spatial weights matrix and data properties.[18] This z-score is compared to the standard normal distribution to obtain a p-value, with significance typically evaluated at α=0.05\alpha = 0.05; positive (negative) z-scores indicate clustering (dispersion).[18] When normality assumptions fail, particularly for small samples or non-Gaussian data, Monte Carlo permutation tests provide a robust alternative by generating a reference distribution through 999 or more random permutations of the attribute values while preserving the spatial structure, yielding an empirical p-value as the proportion of permuted statistics exceeding the observed I.[18] For local Moran's I, significance is evaluated location-by-location via pseudo p-values from conditional permutation tests, which randomize neighbor values while fixing the focal observation to avoid introducing global autocorrelation bias into local inferences.[18] These raw p-values must account for multiple comparisons across n locations to control the family-wise error rate or false discovery rate, as unadjusted testing inflates Type I errors. Common adjustments include the conservative Bonferroni correction, which multiplies each p-value by n (capped at 1), and the less stringent false discovery rate (FDR) procedure of Benjamini and Hochberg, which ranks p-values and adjusts thresholds to control the expected proportion of false positives among significant results; FDR is preferred in spatial contexts for better retaining true local clusters while handling dependence.[16] Power analysis evaluates the sensitivity of Moran's I tests to detect true spatial dependence, revealing that power increases with larger sample sizes and denser spatial weights matrices but decreases with sparsity (e.g., few neighbors per location), where normal approximations often underestimate variability and lead to erroneous inferences; exact permutation-based distributions mitigate this by accurately capturing tail behavior.[19] A key pitfall in local testing is multiple testing inflation, where the probability of at least one false positive approaches 1 without correction, exacerbating over-identification of spurious clusters in LISA maps; the Šidák adjustment addresses this under independence assumptions via $ p_i^* = 1 - (1 - p_i)^n $, though spatial dependence typically requires FDR or permutation-based controls for validity.[20]

Software and Implementation

Moran's I and its local variants can be computed using a variety of open-source and commercial software tools, each offering distinct capabilities for spatial weights construction, statistical testing, and visualization. In open-source environments, the R programming language provides robust implementations through packages such as spdep, which facilitates the creation of spatial weights matrices and computation of both global and local Moran's I statistics via functions like moran.test and localmoran.[11] Complementing this, the ape package extends Moran's I to phylogenetic and comparative analyses, computing the autocorrelation coefficient using distance-based weights as described by Gittleman and Kot (1990).[21] In Python, the PySAL library—originating from the GeoDa project—includes the esda module for exploratory spatial data analysis, where the Moran class calculates global Moran's I and supports visualization through Moran scatterplots, while Moran_Local handles local indicators of spatial association (LISA).[22] Commercial software emphasizes integrated geospatial workflows. ArcGIS Pro, via its Spatial Statistics toolbox, offers the "Spatial Autocorrelation (Global Moran's I)" tool for assessing overall clustering and the "Cluster and Outlier Analysis (Anselin Local Moran's I)" tool for identifying significant hot spots, cold spots, and outliers, with built-in support for Moran scatterplots and significance testing.[9] Additionally, GeoDa serves as a free standalone application (despite its commercial roots in development), specializing in LISA mapping and univariate/bivariate Moran's I computations, with interactive tools for exploring spatial autocorrelation patterns.[2] Effective implementation requires careful data preparation, such as converting inputs to shapefiles or coordinate matrices to define spatial relationships accurately. For large datasets, leveraging sparse matrices in libraries like spdep or PySAL optimizes memory usage and computation speed during weights matrix construction and permutation-based inference. To ensure reproducibility, especially in Monte Carlo simulations for significance testing, practitioners should set random seeds—using set.seed() in R or random_state parameters in Python's esda—to yield consistent p-values across runs. Post-2020 developments have enhanced accessibility in open geospatial platforms, notably through QGIS plugins like the Hotspot Analysis tool, which integrates PySAL for Moran's I-based clustering and LISA visualization, with updates maintaining compatibility through QGIS 3.x versions as of 2025.[23] The Spatial Analysis Toolbox plugin further extends this by providing direct Moran's I and local variants within QGIS, streamlining workflows for non-programmers.[24]

Applications

Fields of Use

Moran's I has found extensive application across multiple disciplines due to its ability to quantify spatial autocorrelation in diverse datasets, enabling researchers to detect patterns of clustering or dispersion in geographic phenomena. In geography and geographic information systems (GIS), it is employed in urban planning to analyze residential segregation indices, where local variants help identify areas of high ethnic or socioeconomic dissimilarity. Similarly, in environmental monitoring, local Moran's I integrated with GIS identifies pollution hotspots, such as lead contamination in urban soils, facilitating targeted remediation efforts. In epidemiology, Moran's I supports disease mapping by revealing spatial clusters of infections, including the post-2020 spread of COVID-19, where global and local measures highlighted regional hotspots and informed public health responses. It also aids in examining health disparities, such as uneven disease burdens across populations, by assessing autocorrelation in incidence rates. Within ecology, the statistic evaluates species distribution clustering, accounting for spatial autocorrelation in biodiversity data to improve modeling accuracy for conservation planning. It further quantifies autocorrelation in biodiversity metrics, such as species richness across landscapes, helping to discern environmental drivers of ecological patterns. In economics and sociology, Moran's I analyzes regional inequality, including income convergence processes, where it detects spatial dependencies in per capita income disparities across European Union regions. For crime pattern analysis, it identifies local clusters of offenses, supporting sociological studies of urban violence and informing policy on neighborhood-level interventions. Beyond these core fields, Moran's I extends to dialectology, where it maps linguistic boundaries by measuring autocorrelation in dialectal features across geographic areas. In geomorphology, it clusters terrain features, such as landform variations, to understand spatial patterns in landscape evolution. While global Moran's I provides an overview of overall spatial structure, local versions pinpoint specific hotspots within these applications.

Case Studies

One prominent application of Moran's I is in the analysis of poverty distribution across U.S. counties, as pioneered by Anselin in his development of local indicators of spatial association (LISA). Using 1990 census data, local Moran's I identified significant clustering patterns, including low-low poverty clusters in the Midwest, where neighboring counties exhibited persistently low poverty rates, visualized through LISA maps and associated z-scores indicating statistical significance at p < 0.05. These clusters highlighted areas of relative affluence amid broader national disparities, aiding in the understanding of regional economic stability. An updated analysis in 2020, applying local Moran's I to contemporary county-level poverty estimates from the CDC's Social Vulnerability Index, reaffirmed low-low clusters in the Midwest (e.g., parts of Iowa, Nebraska, and the Dakotas), with z-scores exceeding 2.58 for many units, underscoring enduring spatial patterns despite economic shifts. These case studies demonstrate Moran's I's utility in revealing actionable spatial patterns, such as poverty clusters informing resource allocation. Overall, results from these applications have influenced policy, enabling targeted interventions like subsidized economic development in U.S. low-low poverty regions. Recent extensions, such as the Moran's I lasso method for models with spatially correlated data, have enhanced its application in econometrics as of 2025.[25]

Table of Contents