Skip to content
Mar 2

Feature Engineering: Geospatial Features

MT
Mindli Team

AI-Generated Content

Feature Engineering: Geospatial Features

Moving beyond simple latitude and longitude columns, geospatial feature engineering transforms raw coordinates into powerful predictors for machine learning models. Whether predicting house prices, optimizing delivery routes, or targeting local marketing campaigns, location context is often the decisive factor. This guide will teach you how to systematically extract that context, building location-aware feature sets that give your models a critical edge in spatial domains like real estate, logistics, and marketing.

Representing and Understanding Coordinates

The foundation of all geospatial work is the coordinate pair: latitude and longitude. Latitude measures north-south position (-90° at the South Pole to +90° at the North Pole), while longitude measures east-west position (-180° to +180°). It is crucial to remember that these are angular measurements on a sphere, not linear distances on a flat plane. A common initial feature is simply converting these decimal degrees into separate lat and lon features for your model, but this is just the starting point. The real power comes from creating features that describe relationships between points and their surroundings.

A critical first step is data validation. You must check that your latitude values fall between -90 and 90 and your longitude values between -180 and 180. Furthermore, for many distance calculations, you may need to convert these degrees into radians using the formula .

Creating Distance-Based Features

One of the most intuitive and valuable geospatial features is distance to landmarks. This measures the spatial relationship between a point of interest and key locations. For a real estate model, you might calculate the distance from each property to the city center, the nearest school, a subway station, or a body of water. In logistics, you would calculate distances to warehouses, major highways, or port facilities.

To compute these distances accurately over the Earth's surface, you cannot use simple Euclidean distance () because it assumes a flat plane. Instead, you must use the Haversine formula, which calculates the great-circle distance between two points on a sphere given their longitudes and latitudes. The Haversine distance is calculated as:

where is the Earth’s radius (mean radius = 6,371 km), and all latitude and longitude values are in radians. This gives you an accurate distance in kilometers or miles.

Beyond single distances, you can create nearest neighbor features. For each point in your dataset, you can find the k-nearest other points (e.g., other houses, retail stores, reported incidents) and calculate statistics like the mean distance to the three nearest neighbors, which can indicate isolation or clustering.

Density, Aggregation, and Spatial Clustering

Distance tells you about proximity to specific points, but spatial density features describe the broader context of an area. For a given coordinate, you can count how many other points of interest (e.g., coffee shops, crime reports, listing prices) exist within a radius of 500 meters, 1 km, or 5 km. This creates features like num_restaurants_within_1km or avg_ticket_price_within_2km. Density is a powerful proxy for urban density, commercial activity, or risk concentration.

Taking this a step further, spatial clustering membership identifies if a point belongs to a natural group. Algorithms like DBSCAN (Density-Based Spatial Clustering of Applications with Noise) can automatically find high-density areas and assign each point a cluster label. This label becomes a categorical feature that can capture unobserved neighborhood effects. For example, DBSCAN might identify distinct commercial districts, residential suburbs, and industrial parks without you having to define those boundaries manually.

Spatial Indexing and Geohash Encoding

When working with millions of points, performing radius queries for density calculations can be computationally expensive. Spatial indexing structures like Uber's H3 hexagonal indexing system solve this by partitioning the world into a hierarchical grid of hexagons. You convert a (lat, lon) coordinate into an H3 cell address (e.g., a resolution 10 hexagon). The immediate benefit is rapid aggregation—you can now group all points in the same hexagon to compute local statistics (e.g., median price per hexagon). Hexagon membership itself becomes a high-cardinality categorical feature, and the centroids of parent hexagons (at lower resolutions, like level 8) can be used as aggregation anchors.

A simpler, older method is geohash encoding, which converts coordinates into a hierarchical string code. Nearby locations will share a long prefix in their geohash. For instance, the geohash for one point might be u4pruy. A point very close by might have the geohash u4pruv. You can use the first 5 or 6 characters of the geohash as an aggregation key or a feature representing a broader area. Both H3 and geohash enable efficient spatial joins and are essential for handling large-scale geospatial data.

Deriving Features from Addresses and Reverse Geocoding

Often, your raw data isn't coordinates but addresses. The process of converting an address to coordinates is called geocoding. The inverse process, reverse geocoding, converts coordinates back into human-readable address components, which are a goldmine for feature engineering. Using a service or library (like the Google Maps API or OpenStreetMap's Nominatim), you can extract:

  • Street name and type
  • Neighborhood
  • City, State, and Postal Code
  • Country

These become powerful categorical features. For example, you could create a feature for the property's street type (Avenue, Boulevard, Lane) or use postal code to join with external demographic datasets (average income, population density). This builds a location-aware feature set that combines precise distances, local densities, administrative boundaries, and socio-economic context.

Common Pitfalls

  1. Using Euclidean Distance for Geographic Coordinates: This is the most fundamental error. Latitude and longitude are not Cartesian coordinates on a flat grid. Using the Pythagorean theorem will give wildly incorrect distances, especially over larger spans. Always use the Haversine formula for true distance.
  2. Ignoring Coordinate System and Projection: While Haversine is correct for global data, if you are working with a localized, projected coordinate system (like UTM zones), where coordinates are already in meters, Euclidean distance is appropriate. Know your data's coordinate reference system (CRS).
  3. Data Leakage in Spatial Features: When creating density or nearest-neighbor features, you must calculate them in a time-aware manner for time-series or predictive tasks. For example, when predicting house prices, the number of nearby sales should only include sales that occurred before the listing date of the house you're predicting for. Using all sales data leaks future information.
  4. Overlooking Computational Cost: Performing a radius search (e.g., "find all points within 1km") on a dataset with n points has a naive cost of , which is infeasible for large *nO(n \log n)$.

Summary

  • Geospatial feature engineering moves beyond raw latitude and longitude to create features that capture spatial relationships and context, which are critical for models in real estate, logistics, and marketing.
  • Distance to landmarks, calculated using the Haversine formula, provides direct measures of proximity, while nearest neighbor features and spatial density features quantify the local point environment.
  • Spatial clustering membership (e.g., via DBSCAN) can automatically identify and label natural geographic groupings within your data.
  • Spatial indexing systems like H3 hexagonal indexing and geohash encoding are essential for efficient aggregation, analysis, and feature creation at scale.
  • Reverse geocoding converts coordinates into hierarchical address components (street, postal code, etc.), enabling joins with external data and the creation of rich, location-aware feature sets.
  • Always avoid using Euclidean distance for unprojected lat/lon coordinates, guard against temporal data leakage in spatial calculations, and use spatial indexes to manage computational cost.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.