GIS Orchestra Places - Comprehensive Geospatial Data Processing Pipeline

Geospatial Revolution Team November 16, 2025 geospatial-technology

geospatial-processingpipeline-orchestrationjson-ldautomation

GIS Orchestra Places: Orchestrating sophisticated geospatial data processing pipelines for place location validation and standardization

The project orchestrates a sophisticated pipeline that processes place records for major metropolitan areas, executing geographic validation, data normalization, fuzzy matching algorithms, spatial distance calculations, and quality assurance workflows before generating standardized deliverables in multiple geospatial formats.

Project Overview

GIS Orchestra Places is a comprehensive geospatial data processing application designed to manage, validate, and standardize place location data. The system performs collection, validation, standardization, and similarity analysis between Foursquare datasets and Overturemaps reference data using PostgreSQL with PostGIS extension and Node.js automation scripts. The project orchestrates a sophisticated pipeline that processes place records for major metropolitan areas, executing geographic validation, data normalization, fuzzy matching algorithms, spatial distance calculations, and quality assurance workflows before generating standardized deliverables in multiple geospatial formats.

Architecture and Technology Stack

The system is built on a modern geospatial processing stack combining database-centric operations with automated scripting workflows.

Core Technologies

The foundation relies on PostgreSQL database server with PostGIS spatial extension for all geospatial computations and data storage. The server runs on localhost port 5434 with a default workspace database that hosts all processing tables and spatial operations. The automation layer uses Node.js runtime with CommonJS module system to orchestrate the multi-step processing pipeline. Key dependencies include the pg library for PostgreSQL connectivity, dotenv for environment configuration management, and Puppeteer for automated report generation capabilities. GDAL and QGIS tools provide the geospatial data transformation layer, with ogr2ogr handling the export operations from PostgreSQL to standard geospatial formats including GeoPackage, Shapefile, and GeoJSON.

Data Architecture

The system maintains a structured directory hierarchy with clear separation of concerns. The context directory stores JSON-LD semantic files that serve as the source of truth for the entire system, defining workflows, tools, coding principles, and project metadata. The project directory contains city-specific artifacts and deliverables organized by unique project identifiers. The data-schema directory holds JSON-LD schemas defining geodata structures and delivery formats. The scripts_js directory houses all Node.js automation scripts that drive the processing pipeline. External raw geodata sources are stored in a dedicated Google Drive location organized by region, supporting multiple regions with dedicated subdirectories. Shared delivery outputs for client distribution are maintained in a separate Google Drive folder with timestamped delivery packages.

Data Processing Workflow

The system implements a sophisticated multi-stage pipeline that transforms raw place data through validation, normalization, matching, and quality control stages before generating final deliverables.

Stage One - Data Initialization

The workflow begins with project initialization where raw place data from two primary sources is imported into PostgreSQL tables. The Foursquare dataset represents the primary collection requiring validation, while the Overturemaps dataset serves as the authoritative reference for comparison and matching operations. Each dataset is loaded into dedicated PostgreSQL tables with PostGIS geometry columns configured for WGS84 coordinate reference system. The system creates backup copies of all source tables before any transformations occur, ensuring data integrity throughout the processing lifecycle. The metadata layer captures essential project information including spatial coverage, data source locations, record counts, creation timestamps, and version identifiers. This metadata drives subsequent pipeline operations and provides traceability for all processing steps.

Stage Two - Street Reference Extraction

The pipeline extracts unique street information from both place datasets to create standardized street reference tables. This aggregation process identifies distinct street type and street name combinations, collecting all place geometries associated with each street into multipoint geometry collections. The street extraction computes aggregated statistics including total place count per street and spatial bounding boxes that encompass all places on each street. These reference tables become the foundation for subsequent matching operations, reducing computational complexity by matching at the street level before processing individual places. Each street record receives a sequential primary key identifier and stores both the original street type and toponym fields along with computed geometric aggregations. The system maintains separate street tables for Foursquare and Overturemaps datasets, enabling parallel processing and comparison workflows.

Stage Three - Data Normalization

The normalization stage adds standardized output columns to both street reference tables, implementing consistent text processing rules across all data sources. The system generates output fields by trimming whitespace, collapsing multiple spaces into single spaces, and preserving the original character casing for final delivery. Development-specific normalized fields are created using lowercase transformation and non-alphanumeric character replacement strategies. Street type and toponym values are converted to lowercase, all non-alphanumeric characters are replaced with single dash separators, consecutive dashes are collapsed, and trailing dashes are removed to create clean matching keys. This dual-column approach maintains human-readable output fields while creating machine-optimized development fields specifically designed for fuzzy string matching and similarity analysis. The normalization ensures consistent comparison regardless of original data quality variations.

Stage Four - Street Matching with Similarity Analysis

The matching engine performs sophisticated similarity analysis between Foursquare and Overturemaps street tables using PostgreSQL trigram similarity algorithms and fuzzy string matching extensions. The system enables the unaccent, fuzzystrmatch, and pg_trgm PostgreSQL extensions to support advanced text comparison operations. The matching process executes in three distinct phases. The first phase identifies exact matches on normalized toponym fields, assigning perfect similarity scores of 1.0 and marking the match type as exact. The second phase applies trigram similarity matching to remaining unmatched records, calculating similarity scores between 0.0 and 1.0 for all candidate pairs that meet a minimum threshold of 0.4 similarity. The third phase ranks all similarity matches by score and retains only the best match for each Foursquare street. The system adds comprehensive match metadata columns to the Foursquare street table including boolean existence flags, similarity score values, match type classifications, match ranking integers for multiple candidate scenarios, and the matched street name from the Overturemaps reference dataset. GIN indexes using trigram operators are created on normalized fields to accelerate the similarity search operations.

Stage Five - Manual Review and Validation

Similarity matches require human validation before acceptance into the final dataset. The system implements a two-phase manual review workflow that exports candidate matches to JSON format with timestamp suffixes for audit trail purposes. The export includes all similarity matches ordered by score descending, presenting the original Foursquare street names, matched Overturemaps street names, similarity scores, and a manual_matched flag defaulted to false. Reviewers examine each match and set the flag to true for approved correspondences, rejecting false positives by leaving the flag at false. After review completion, the system imports the validated matches and applies transformations to the Foursquare street table. Approved matches receive the Overturemaps street names as their normalized values, similarity scores are elevated to 1.0 to indicate certainty, match types are reclassified as manual_match, and existence flags are set to true. Rejected similarity matches are marked with a match type of similarity_discard, similarity scores are nullified, and existence flags remain false. This human-in-the-loop validation ensures data quality while leveraging automated matching to reduce manual effort. The system collects comprehensive statistics including exact match counts, similarity match counts, manual match counts, discard counts, and unmatched record counts.

Stage Six - Place Table Enrichment

The validated street matching results are propagated back to the original place tables through relational joins. The system adds development columns to the Foursquare place table and populates them with normalized street data and match metadata from the street reference table. Each place record receives normalized street type and toponym values, existence flags indicating whether its street appears in the Overturemaps dataset, match type classifications describing how the street was matched, and the corresponding Overturemaps street names for matched records. The join operation matches places to streets using the original street type and toponym combination. The Overturemaps place table undergoes similar enrichment, receiving normalized street values that enable subsequent place-level matching operations. This symmetric processing ensures both datasets use consistent normalization schemes during final place comparison. Statistics are collected documenting total place counts, places with complete street information, places with streets that exist in the reference dataset, and the distribution of match types across exact, similarity, and manual classification categories.

Stage Seven - Place-Level Matching

With street-level matching complete, the system performs granular place-level matching based on civic numbers. The matching algorithm implements a progressive strategy that attempts exact civic number matches first, then falls back to number-only matches when exact matches fail. The process initializes all Foursquare places to an unmatched state before executing matching phases. The first matching phase performs exact civic number matching, joining Foursquare and Overturemaps places where streets are already matched and complete civic number strings are identical. The second phase matches on numeric components only, accommodating scenarios where suffix characters differ but base numbers align. Each matched Foursquare place receives a boolean existence flag, match type classification of exact or num_only, the matched civic number value from Overturemaps, and the primary key identifier of the matched Overturemaps place record. This identifier enables subsequent spatial distance calculations between matched place pairs. The progressive matching strategy maximizes match rates while maintaining clear traceability of match quality through the match type classification. Statistics track total places, matched counts, unmatched counts, and the distribution between exact and number-only match types.

Stage Eight - Spatial Distance Calculation

For all successfully matched place pairs, the system calculates the spatial distance between Foursquare and Overturemaps geometries using PostGIS spatial functions. The calculation transforms point geometries from WGS84 geographic coordinates to a metric projection system to obtain accurate distance measurements in meters. The default projection uses EPSG:32633 which represents UTM zone 33N, appropriate for geographic analysis. The ST_Distance function operates on the transformed geometries, computing the shortest distance between matched place points. Distance values are classified into seven discrete clusters to support analysis and quality assessment. The clusters include under 2 meters for high-precision matches, under 5 meters for very close matches, under 10 meters for close matches, under 20 meters for nearby matches, under 50 meters for moderate distance matches, under 100 meters for distant matches, and over 100 meters for significant displacement scenarios. The distance calculation adds two columns to the Foursquare place table: a double precision field storing the exact distance in meters and a text field containing the cluster classification. Statistics are aggregated including total places with distance measurements, counts per distance cluster, average distance, minimum distance, and maximum distance across the entire dataset. This spatial analysis provides quantitative quality metrics for the matching process, identifying potential geocoding discrepancies and enabling prioritization of records requiring positional refinement.

Stage Nine - Duplicate Geometry Detection

The quality control stage identifies places sharing identical geographic coordinates, flagging potential data quality issues where multiple distinct places are assigned the same point location. The detection algorithm uses GeoJSON string comparison at six decimal place precision, which provides approximately 0.1 meter positional accuracy. The system converts all place geometries to GeoJSON format and groups by the resulting strings to identify duplicates. Places whose geometries appear more than once in the dataset receive a boolean duplicate flag set to true. The statistics capture total place counts, duplicate place counts, unique place counts, and the number of distinct duplicate geometry groups. This analysis highlights geocoding quality issues, address range interpolation artifacts, or data collection methodology problems that require attention before final delivery. The duplicate flag enables downstream filtering and prioritization strategies.

Stage Ten - Delivery Table Construction

The system creates a final delivery table from the enriched and validated Foursquare place data, implementing a standardized schema that supports client requirements and maintains development metadata for quality tracking. The delivery table follows a versioned naming convention incorporating the project identifier and version number with dots replaced by underscores. The schema includes standard columns for municipality identification code and name, street type and toponym from output normalized fields, placeholder columns for alternative names and historical data, civic number components separated into base number and suffix values, coordinate pairs extracted from point geometries, and placeholder fields for postal codes, source attribution, unique identifiers, priority rankings, change tracking, and revision timestamps. Development columns are preserved in the delivery table to maintain full traceability. These include street existence flags in the reference dataset, street match type classifications, place existence flags in the reference dataset, place match type classifications, matched place primary key references, spatial distance measurements to matched places, distance cluster classifications, and duplicate geometry flags. The delivery table preserves original point geometries in WGS84 coordinate reference system, ensuring compatibility with standard GIS platforms and web mapping applications. The table is created as a new database object through a SELECT statement with column transformations, maintaining data lineage back to source tables.

Stage Eleven - Geospatial Format Export

The delivery table is exported from PostgreSQL to GeoPackage format using the ogr2ogr command-line utility from GDAL. GeoPackage provides an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information, implemented as a SQLite database container. The export process constructs a versioned delivery folder name incorporating the current date in YYMMDD format, the project slug identifier, and the semantic version components. The folder is created within a shared Google Drive location accessible to clients and stakeholders. The ogr2ogr execution includes critical configuration options to ensure offline processing without network-based coordinate reference system definitions and to optimize directory reading performance. The geometry field specification, layer naming, and overwrite flags ensure clean export results. User confirmations are required before export execution and before optional deletion of the source delivery table from PostgreSQL. The confirmation workflow displays the input table name, output folder path, and output filename for verification before proceeding. After successful export, the system updates project metadata with delivery folder path, delivery file path, and export timestamp, creating a complete audit trail. The optional table deletion reduces database storage requirements after delivery completion while maintaining backup copies of source data tables.

Stage Twelve - Automated Report Generation

The final stage produces comprehensive processing reports in both HTML and PDF formats, documenting the complete pipeline execution with statistics, quality metrics, and data lineage information. The report generation reads from a centralized report-data.jsonld file that serves as the single source of truth for all process execution metadata. Reports include project identification and metadata, data source descriptions with record counts, processing stage summaries with execution timestamps, match statistics across all stages, quality metrics including distance distributions and duplicate counts, and delivery artifact details with file locations and export timestamps. The HTML report supports interactive exploration with expandable sections and embedded visualizations. The PDF report provides an archival format suitable for client deliverables and long-term documentation requirements. Both formats are generated through automated templating and stored in the project report directory with version-specific subdirectories.

Project Organization and Metadata

The system implements a comprehensive metadata management strategy using JSON-LD semantic structures throughout the entire processing lifecycle. This approach provides machine-readable project definitions that support automated code generation, pipeline orchestration, and documentation generation.

JSON-LD Context System

The context directory serves as the canonical source of truth for all project semantics. The workflow-jsonld.json file acts as the main index, referencing all subordinate context files including project metadata, coding principles, project principles, maintenance workflows, project tools, address processing tools, current subproject context, data schemas, and git semantics. Each context file implements Schema.org vocabularies with domain-specific extensions, enabling semantic querying and relationship inference across project artifacts. The structured metadata supports automated validation, consistency checking, and cross-reference verification throughout development and processing workflows.

Subproject Structure

Individual city processing projects are organized as subprojects within the larger framework, each maintaining independent metadata, pipeline configuration, results, and deliverables while sharing common tooling and processing scripts. Each subproject directory contains a metadata.jsonld file describing the project identifier, name, description, version, creation and modification timestamps, spatial coverage with geographic bounds, directory locations, data sources with contentUrls and table definitions, pipeline configuration reference, deliverables listing, notes collection, and changelog with versioned snapshots. The pipeline.jsonld file defines the enabled processing steps, deprecated tools, and available tools for the specific subproject. This configuration drives the automated pipeline runner, determining which processing stages execute and in what sequence.

Version Management

The system implements semantic versioning with major, minor, and patch components. Version numbers are embedded in table names, delivery folder names, and file names to ensure traceability and prevent conflicts. The version progression is tracked in metadata changelog entries, capturing data source changes, processing modifications, and deliverable updates. Version bumps trigger delivery folder regeneration with updated naming conventions, ensuring clear separation between different processing iterations. The versioning strategy supports parallel processing of multiple city datasets while maintaining consistent tooling and methodology.

Process Tracking and Statistics

Each processing stage records execution metadata including start and end timestamps, input and output table names, record counts processed, match statistics, quality metrics, and error conditions. This execution data is aggregated into the report-data.jsonld file, providing a complete audit trail for the entire pipeline run. Statistics collection is standardized across all processing stages, enabling consistent reporting and comparative analysis across different city datasets. The statistics support quality monitoring, performance optimization, and methodology refinement across project iterations.

Quality Assurance and Validation

The system incorporates multiple quality control mechanisms throughout the processing pipeline to ensure data accuracy, completeness, and consistency.

Backup and Recovery

All source data tables are backed up immediately after import and before any transformation operations. Backup table naming follows a consistent convention appending a _bkp suffix to the original table name. This strategy enables rapid recovery from processing errors and supports comparative analysis between original and transformed data states.

Manual Review Checkpoints

Critical matching operations include mandatory manual review checkpoints where automated results are exported for human validation before acceptance. The review workflow preserves audit trails with timestamped export files and version-controlled validation decisions.

Spatial Validation

Geographic coordinates are validated through PostGIS spatial functions including geometry validity checks, coordinate reference system verification, and spatial relationship testing. Invalid geometries are flagged for investigation and correction before delivery generation.

Statistical Monitoring

Comprehensive statistics are collected at each processing stage, enabling detection of anomalous patterns that might indicate data quality issues or processing errors. Statistical thresholds can trigger warnings or halt processing pending manual investigation.

Match Quality Metrics

The distance calculation and clustering analysis provide quantitative quality metrics for the matching process. High match rates combined with low average distances indicate successful processing, while high distances or low match rates trigger quality review workflows.

Pipeline Orchestration and Automation

The system implements a flexible pipeline orchestration framework that supports both automated batch processing and interactive development workflows.

Pipeline Runner

The runPipeline.js script orchestrates sequential execution of all enabled processing stages defined in the subproject pipeline configuration. The runner loads the current subproject context, reads the pipeline configuration, resolves tool identifiers to script file paths, executes each enabled process in sequence, captures execution output and error conditions, maintains execution statistics, and terminates on first error to prevent cascade failures. The runner supports parameterized tool invocation where certain processes accept runtime arguments such as dataset type specifications or coordinate reference system identifiers. The execution output is streamed to the console in real-time, providing visibility into processing progress.

Subproject Initialization

The subprojectInit.js script automates creation of new city processing projects, generating the complete directory structure, metadata files, pipeline configuration, and data source placeholders. This initialization ensures consistency across all city projects and reduces manual setup effort.

Current Subproject Context

The current-subproject.jsonld file maintains a reference to the active project identifier, enabling tool scripts to dynamically resolve table names, file paths, and configuration settings without hardcoded values. This abstraction supports parallel development on multiple city projects within a single workspace.

Tool Registry

The pipeline orchestration maintains a dynamic tool registry mapping process identifiers to executable script paths. The registry is constructed by reading tool definitions from the context/tools directory and project-specific tools directories, enabling both shared common tools and project-specific customizations. Tools are defined using JSON-LD Action schemas that specify the tool identifier, description, target entry point with URL template, action application with code repository path, and instrument properties describing parameters, inputs, outputs, and behaviors.

Error Handling

Pipeline execution implements fail-fast error handling where any process failure immediately terminates the entire pipeline run. This approach prevents downstream processes from operating on incomplete or corrupted data. Error messages include the failed process identifier and detailed error descriptions to support troubleshooting.

Data Schema and Standards

The system enforces standardized data schemas throughout the processing pipeline, ensuring consistency and interoperability across different city datasets.

Input Schema

Source data tables conform to defined input schemas specifying required columns including original street type and toponym fields, civic number fields with optional numeric and suffix components, spatial geometry columns in WGS84 coordinate reference system, and optional attributes such as municipality codes, names, and postal codes. The schema definitions are maintained as JSON-LD files in the data-schema directory, providing machine-readable specifications that support automated validation and documentation generation.

Development Schema

Processing stages add development columns with standardized naming conventions using a dev_ prefix. These columns store intermediate processing results including normalized field values, existence flags, match type classifications, similarity scores, matched references, and quality indicators. The development schema segregates transformation artifacts from production output fields, enabling clear separation between final deliverable content and processing metadata useful for quality control and debugging.

Output Schema

The delivery table implements a standardized output schema designed to support client requirements and GIS platform compatibility. Column names follow concise abbreviations, coordinate fields use double precision numeric types, text fields are sized appropriately for content, and geometry columns specify explicit spatial reference systems. The output schema includes placeholder columns for future enhancements, ensuring schema stability as requirements evolve. All delivery tables across different city projects conform to the same schema, enabling aggregation and comparative analysis.

Extension Points and Customization

While the core processing pipeline provides standardized workflows applicable to most city datasets, the system supports project-specific customizations through several extension mechanisms.

Custom Processing Scripts

Projects can define custom processing scripts in the project-specific tools directory, identified by process identifiers incorporating the project slug. These custom processes are registered in the tool registry and can be inserted at specific points in the pipeline configuration. Custom scripts have access to the same database client utilities, context loading functions, and report tracking mechanisms as standard tools, ensuring consistent patterns and error handling.

Pipeline Configuration

The pipeline.jsonld enabled array can be reordered or filtered to skip certain stages, add custom processes, or modify the processing sequence. This flexibility supports experimentation, debugging, and adaptation to unique dataset characteristics without modifying core tooling.

Metadata Extensions

Project metadata files support arbitrary property additions beyond the base schema, enabling capture of project-specific attributes such as special geographic considerations, data source peculiarities, or client-specific requirements.

Conclusion

GIS Orchestra Places implements a sophisticated geospatial data processing pipeline that transforms heterogeneous place datasets into standardized, validated, and quality-controlled deliverables. The system combines automated batch processing with strategic human review checkpoints, achieving high efficiency while maintaining data quality. The metadata-driven architecture using JSON-LD semantics enables machine-readable project definitions that support automation, code generation, and documentation workflows. The modular design with clear separation between framework tooling and project-specific customizations enables scalable processing across multiple city datasets while maintaining consistency. The comprehensive quality assurance mechanisms including backup strategies, spatial validation, statistical monitoring, distance-based match quality metrics, and duplicate detection ensure delivery of reliable geospatial data products suitable for critical applications.