GIS Orchestra Places is a comprehensive geospatial data processing
application designed to manage, validate, and standardize place
location data. The system performs collection, validation,
standardization, and similarity analysis between Foursquare datasets
and Overturemaps reference data using PostgreSQL with PostGIS
extension and Node.js automation scripts.
The project orchestrates a sophisticated pipeline that processes place
records for major metropolitan areas, executing geographic validation,
data normalization, fuzzy matching algorithms, spatial distance
calculations, and quality assurance workflows before generating
standardized deliverables in multiple geospatial formats.
Project Overview
GIS Orchestra Places is a comprehensive geospatial data processing
application designed to manage, validate, and standardize place location
data. The system performs collection, validation, standardization, and
similarity analysis between Foursquare datasets and Overturemaps
reference data using PostgreSQL with PostGIS extension and Node.js
automation scripts. The project orchestrates a sophisticated pipeline
that processes place records for major metropolitan areas, executing
geographic validation, data normalization, fuzzy matching algorithms,
spatial distance calculations, and quality assurance workflows before
generating standardized deliverables in multiple geospatial formats.
Architecture and Technology Stack
The system is built on a modern geospatial processing stack combining
database-centric operations with automated scripting workflows.
Core Technologies
The foundation relies on PostgreSQL database server with PostGIS spatial
extension for all geospatial computations and data storage. The server
runs on localhost port 5434 with a default workspace database that hosts
all processing tables and spatial operations. The automation layer uses
Node.js runtime with CommonJS module system to orchestrate the
multi-step processing pipeline. Key dependencies include the pg library
for PostgreSQL connectivity, dotenv for environment configuration
management, and Puppeteer for automated report generation capabilities.
GDAL and QGIS tools provide the geospatial data transformation layer,
with ogr2ogr handling the export operations from PostgreSQL to standard
geospatial formats including GeoPackage, Shapefile, and GeoJSON.
Data Architecture
The system maintains a structured directory hierarchy with clear
separation of concerns. The context directory stores JSON-LD semantic
files that serve as the source of truth for the entire system, defining
workflows, tools, coding principles, and project metadata. The project
directory contains city-specific artifacts and deliverables organized by
unique project identifiers. The data-schema directory holds JSON-LD
schemas defining geodata structures and delivery formats. The scripts_js
directory houses all Node.js automation scripts that drive the
processing pipeline. External raw geodata sources are stored in a
dedicated Google Drive location organized by region, supporting multiple
regions with dedicated subdirectories. Shared delivery outputs for
client distribution are maintained in a separate Google Drive folder
with timestamped delivery packages.
Data Processing Workflow
The system implements a sophisticated multi-stage pipeline that
transforms raw place data through validation, normalization, matching,
and quality control stages before generating final deliverables.
Stage One - Data Initialization
The workflow begins with project initialization where raw place data
from two primary sources is imported into PostgreSQL tables. The
Foursquare dataset represents the primary collection requiring
validation, while the Overturemaps dataset serves as the authoritative
reference for comparison and matching operations. Each dataset is loaded
into dedicated PostgreSQL tables with PostGIS geometry columns
configured for WGS84 coordinate reference system. The system creates
backup copies of all source tables before any transformations occur,
ensuring data integrity throughout the processing lifecycle. The
metadata layer captures essential project information including spatial
coverage, data source locations, record counts, creation timestamps, and
version identifiers. This metadata drives subsequent pipeline operations
and provides traceability for all processing steps.
Stage Two - Street Reference Extraction
The pipeline extracts unique street information from both place datasets
to create standardized street reference tables. This aggregation process
identifies distinct street type and street name combinations, collecting
all place geometries associated with each street into multipoint
geometry collections. The street extraction computes aggregated
statistics including total place count per street and spatial bounding
boxes that encompass all places on each street. These reference tables
become the foundation for subsequent matching operations, reducing
computational complexity by matching at the street level before
processing individual places. Each street record receives a sequential
primary key identifier and stores both the original street type and
toponym fields along with computed geometric aggregations. The system
maintains separate street tables for Foursquare and Overturemaps
datasets, enabling parallel processing and comparison workflows.
Stage Three - Data Normalization
The normalization stage adds standardized output columns to both street
reference tables, implementing consistent text processing rules across
all data sources. The system generates output fields by trimming
whitespace, collapsing multiple spaces into single spaces, and
preserving the original character casing for final delivery.
Development-specific normalized fields are created using lowercase
transformation and non-alphanumeric character replacement strategies.
Street type and toponym values are converted to lowercase, all
non-alphanumeric characters are replaced with single dash separators,
consecutive dashes are collapsed, and trailing dashes are removed to
create clean matching keys. This dual-column approach maintains
human-readable output fields while creating machine-optimized
development fields specifically designed for fuzzy string matching and
similarity analysis. The normalization ensures consistent comparison
regardless of original data quality variations.
Stage Four - Street Matching with Similarity Analysis
The matching engine performs sophisticated similarity analysis between
Foursquare and Overturemaps street tables using PostgreSQL trigram
similarity algorithms and fuzzy string matching extensions. The system
enables the unaccent, fuzzystrmatch, and pg_trgm PostgreSQL extensions
to support advanced text comparison operations. The matching process
executes in three distinct phases. The first phase identifies exact
matches on normalized toponym fields, assigning perfect similarity
scores of 1.0 and marking the match type as exact. The second phase
applies trigram similarity matching to remaining unmatched records,
calculating similarity scores between 0.0 and 1.0 for all candidate
pairs that meet a minimum threshold of 0.4 similarity. The third phase
ranks all similarity matches by score and retains only the best match
for each Foursquare street. The system adds comprehensive match metadata
columns to the Foursquare street table including boolean existence
flags, similarity score values, match type classifications, match
ranking integers for multiple candidate scenarios, and the matched
street name from the Overturemaps reference dataset. GIN indexes using
trigram operators are created on normalized fields to accelerate the
similarity search operations.
Stage Five - Manual Review and Validation
Similarity matches require human validation before acceptance into the
final dataset. The system implements a two-phase manual review workflow
that exports candidate matches to JSON format with timestamp suffixes
for audit trail purposes. The export includes all similarity matches
ordered by score descending, presenting the original Foursquare street
names, matched Overturemaps street names, similarity scores, and a
manual_matched flag defaulted to false. Reviewers examine each match and
set the flag to true for approved correspondences, rejecting false
positives by leaving the flag at false. After review completion, the
system imports the validated matches and applies transformations to the
Foursquare street table. Approved matches receive the Overturemaps
street names as their normalized values, similarity scores are elevated
to 1.0 to indicate certainty, match types are reclassified as
manual_match, and existence flags are set to true. Rejected similarity
matches are marked with a match type of similarity_discard, similarity
scores are nullified, and existence flags remain false. This
human-in-the-loop validation ensures data quality while leveraging
automated matching to reduce manual effort. The system collects
comprehensive statistics including exact match counts, similarity match
counts, manual match counts, discard counts, and unmatched record
counts.
Stage Six - Place Table Enrichment
The validated street matching results are propagated back to the
original place tables through relational joins. The system adds
development columns to the Foursquare place table and populates them
with normalized street data and match metadata from the street reference
table. Each place record receives normalized street type and toponym
values, existence flags indicating whether its street appears in the
Overturemaps dataset, match type classifications describing how the
street was matched, and the corresponding Overturemaps street names for
matched records. The join operation matches places to streets using the
original street type and toponym combination. The Overturemaps place
table undergoes similar enrichment, receiving normalized street values
that enable subsequent place-level matching operations. This symmetric
processing ensures both datasets use consistent normalization schemes
during final place comparison. Statistics are collected documenting
total place counts, places with complete street information, places with
streets that exist in the reference dataset, and the distribution of
match types across exact, similarity, and manual classification
categories.
Stage Seven - Place-Level Matching
With street-level matching complete, the system performs granular
place-level matching based on civic numbers. The matching algorithm
implements a progressive strategy that attempts exact civic number
matches first, then falls back to number-only matches when exact matches
fail. The process initializes all Foursquare places to an unmatched
state before executing matching phases. The first matching phase
performs exact civic number matching, joining Foursquare and
Overturemaps places where streets are already matched and complete civic
number strings are identical. The second phase matches on numeric
components only, accommodating scenarios where suffix characters differ
but base numbers align. Each matched Foursquare place receives a boolean
existence flag, match type classification of exact or num_only, the
matched civic number value from Overturemaps, and the primary key
identifier of the matched Overturemaps place record. This identifier
enables subsequent spatial distance calculations between matched place
pairs. The progressive matching strategy maximizes match rates while
maintaining clear traceability of match quality through the match type
classification. Statistics track total places, matched counts, unmatched
counts, and the distribution between exact and number-only match types.
Stage Eight - Spatial Distance Calculation
For all successfully matched place pairs, the system calculates the
spatial distance between Foursquare and Overturemaps geometries using
PostGIS spatial functions. The calculation transforms point geometries
from WGS84 geographic coordinates to a metric projection system to
obtain accurate distance measurements in meters. The default projection
uses EPSG:32633 which represents UTM zone 33N, appropriate for
geographic analysis. The ST_Distance function operates on the
transformed geometries, computing the shortest distance between matched
place points. Distance values are classified into seven discrete
clusters to support analysis and quality assessment. The clusters
include under 2 meters for high-precision matches, under 5 meters for
very close matches, under 10 meters for close matches, under 20 meters
for nearby matches, under 50 meters for moderate distance matches, under
100 meters for distant matches, and over 100 meters for significant
displacement scenarios. The distance calculation adds two columns to the
Foursquare place table: a double precision field storing the exact
distance in meters and a text field containing the cluster
classification. Statistics are aggregated including total places with
distance measurements, counts per distance cluster, average distance,
minimum distance, and maximum distance across the entire dataset. This
spatial analysis provides quantitative quality metrics for the matching
process, identifying potential geocoding discrepancies and enabling
prioritization of records requiring positional refinement.
Stage Nine - Duplicate Geometry Detection
The quality control stage identifies places sharing identical geographic
coordinates, flagging potential data quality issues where multiple
distinct places are assigned the same point location. The detection
algorithm uses GeoJSON string comparison at six decimal place precision,
which provides approximately 0.1 meter positional accuracy. The system
converts all place geometries to GeoJSON format and groups by the
resulting strings to identify duplicates. Places whose geometries appear
more than once in the dataset receive a boolean duplicate flag set to
true. The statistics capture total place counts, duplicate place counts,
unique place counts, and the number of distinct duplicate geometry
groups. This analysis highlights geocoding quality issues, address range
interpolation artifacts, or data collection methodology problems that
require attention before final delivery. The duplicate flag enables
downstream filtering and prioritization strategies.
Stage Ten - Delivery Table Construction
The system creates a final delivery table from the enriched and
validated Foursquare place data, implementing a standardized schema that
supports client requirements and maintains development metadata for
quality tracking. The delivery table follows a versioned naming
convention incorporating the project identifier and version number with
dots replaced by underscores. The schema includes standard columns for
municipality identification code and name, street type and toponym from
output normalized fields, placeholder columns for alternative names and
historical data, civic number components separated into base number and
suffix values, coordinate pairs extracted from point geometries, and
placeholder fields for postal codes, source attribution, unique
identifiers, priority rankings, change tracking, and revision
timestamps. Development columns are preserved in the delivery table to
maintain full traceability. These include street existence flags in the
reference dataset, street match type classifications, place existence
flags in the reference dataset, place match type classifications,
matched place primary key references, spatial distance measurements to
matched places, distance cluster classifications, and duplicate geometry
flags. The delivery table preserves original point geometries in WGS84
coordinate reference system, ensuring compatibility with standard GIS
platforms and web mapping applications. The table is created as a new
database object through a SELECT statement with column transformations,
maintaining data lineage back to source tables.
Stage Eleven - Geospatial Format Export
The delivery table is exported from PostgreSQL to GeoPackage format
using the ogr2ogr command-line utility from GDAL. GeoPackage provides an
open, standards-based, platform-independent, portable, self-describing,
compact format for transferring geospatial information, implemented as a
SQLite database container. The export process constructs a versioned
delivery folder name incorporating the current date in YYMMDD format,
the project slug identifier, and the semantic version components. The
folder is created within a shared Google Drive location accessible to
clients and stakeholders. The ogr2ogr execution includes critical
configuration options to ensure offline processing without network-based
coordinate reference system definitions and to optimize directory
reading performance. The geometry field specification, layer naming, and
overwrite flags ensure clean export results. User confirmations are
required before export execution and before optional deletion of the
source delivery table from PostgreSQL. The confirmation workflow
displays the input table name, output folder path, and output filename
for verification before proceeding. After successful export, the system
updates project metadata with delivery folder path, delivery file path,
and export timestamp, creating a complete audit trail. The optional
table deletion reduces database storage requirements after delivery
completion while maintaining backup copies of source data tables.
Stage Twelve - Automated Report Generation
The final stage produces comprehensive processing reports in both HTML
and PDF formats, documenting the complete pipeline execution with
statistics, quality metrics, and data lineage information. The report
generation reads from a centralized report-data.jsonld file that serves
as the single source of truth for all process execution metadata.
Reports include project identification and metadata, data source
descriptions with record counts, processing stage summaries with
execution timestamps, match statistics across all stages, quality
metrics including distance distributions and duplicate counts, and
delivery artifact details with file locations and export timestamps. The
HTML report supports interactive exploration with expandable sections
and embedded visualizations. The PDF report provides an archival format
suitable for client deliverables and long-term documentation
requirements. Both formats are generated through automated templating
and stored in the project report directory with version-specific
subdirectories.
Project Organization and Metadata
The system implements a comprehensive metadata management strategy using
JSON-LD semantic structures throughout the entire processing lifecycle.
This approach provides machine-readable project definitions that support
automated code generation, pipeline orchestration, and documentation
generation.
JSON-LD Context System
The context directory serves as the canonical source of truth for all
project semantics. The workflow-jsonld.json file acts as the main index,
referencing all subordinate context files including project metadata,
coding principles, project principles, maintenance workflows, project
tools, address processing tools, current subproject context, data
schemas, and git semantics. Each context file implements Schema.org
vocabularies with domain-specific extensions, enabling semantic querying
and relationship inference across project artifacts. The structured
metadata supports automated validation, consistency checking, and
cross-reference verification throughout development and processing
workflows.
Subproject Structure
Individual city processing projects are organized as subprojects within
the larger framework, each maintaining independent metadata, pipeline
configuration, results, and deliverables while sharing common tooling
and processing scripts. Each subproject directory contains a
metadata.jsonld file describing the project identifier, name,
description, version, creation and modification timestamps, spatial
coverage with geographic bounds, directory locations, data sources with
contentUrls and table definitions, pipeline configuration reference,
deliverables listing, notes collection, and changelog with versioned
snapshots. The pipeline.jsonld file defines the enabled processing
steps, deprecated tools, and available tools for the specific
subproject. This configuration drives the automated pipeline runner,
determining which processing stages execute and in what sequence.
Version Management
The system implements semantic versioning with major, minor, and patch
components. Version numbers are embedded in table names, delivery folder
names, and file names to ensure traceability and prevent conflicts. The
version progression is tracked in metadata changelog entries, capturing
data source changes, processing modifications, and deliverable updates.
Version bumps trigger delivery folder regeneration with updated naming
conventions, ensuring clear separation between different processing
iterations. The versioning strategy supports parallel processing of
multiple city datasets while maintaining consistent tooling and
methodology.
Process Tracking and Statistics
Each processing stage records execution metadata including start and end
timestamps, input and output table names, record counts processed, match
statistics, quality metrics, and error conditions. This execution data
is aggregated into the report-data.jsonld file, providing a complete
audit trail for the entire pipeline run. Statistics collection is
standardized across all processing stages, enabling consistent reporting
and comparative analysis across different city datasets. The statistics
support quality monitoring, performance optimization, and methodology
refinement across project iterations.
Quality Assurance and Validation
The system incorporates multiple quality control mechanisms throughout
the processing pipeline to ensure data accuracy, completeness, and
consistency.
Backup and Recovery
All source data tables are backed up immediately after import and before
any transformation operations. Backup table naming follows a consistent
convention appending a _bkp suffix to the original table name. This
strategy enables rapid recovery from processing errors and supports
comparative analysis between original and transformed data states.
Manual Review Checkpoints
Critical matching operations include mandatory manual review checkpoints
where automated results are exported for human validation before
acceptance. The review workflow preserves audit trails with timestamped
export files and version-controlled validation decisions.
Spatial Validation
Geographic coordinates are validated through PostGIS spatial functions
including geometry validity checks, coordinate reference system
verification, and spatial relationship testing. Invalid geometries are
flagged for investigation and correction before delivery generation.
Statistical Monitoring
Comprehensive statistics are collected at each processing stage,
enabling detection of anomalous patterns that might indicate data
quality issues or processing errors. Statistical thresholds can trigger
warnings or halt processing pending manual investigation.
Match Quality Metrics
The distance calculation and clustering analysis provide quantitative
quality metrics for the matching process. High match rates combined with
low average distances indicate successful processing, while high
distances or low match rates trigger quality review workflows.
Pipeline Orchestration and Automation
The system implements a flexible pipeline orchestration framework that
supports both automated batch processing and interactive development
workflows.
Pipeline Runner
The runPipeline.js script orchestrates sequential execution of all
enabled processing stages defined in the subproject pipeline
configuration. The runner loads the current subproject context, reads
the pipeline configuration, resolves tool identifiers to script file
paths, executes each enabled process in sequence, captures execution
output and error conditions, maintains execution statistics, and
terminates on first error to prevent cascade failures. The runner
supports parameterized tool invocation where certain processes accept
runtime arguments such as dataset type specifications or coordinate
reference system identifiers. The execution output is streamed to the
console in real-time, providing visibility into processing progress.
Subproject Initialization
The subprojectInit.js script automates creation of new city processing
projects, generating the complete directory structure, metadata files,
pipeline configuration, and data source placeholders. This
initialization ensures consistency across all city projects and reduces
manual setup effort.
Current Subproject Context
The current-subproject.jsonld file maintains a reference to the active
project identifier, enabling tool scripts to dynamically resolve table
names, file paths, and configuration settings without hardcoded values.
This abstraction supports parallel development on multiple city projects
within a single workspace.
Tool Registry
The pipeline orchestration maintains a dynamic tool registry mapping
process identifiers to executable script paths. The registry is
constructed by reading tool definitions from the context/tools directory
and project-specific tools directories, enabling both shared common
tools and project-specific customizations. Tools are defined using
JSON-LD Action schemas that specify the tool identifier, description,
target entry point with URL template, action application with code
repository path, and instrument properties describing parameters,
inputs, outputs, and behaviors.
Error Handling
Pipeline execution implements fail-fast error handling where any process
failure immediately terminates the entire pipeline run. This approach
prevents downstream processes from operating on incomplete or corrupted
data. Error messages include the failed process identifier and detailed
error descriptions to support troubleshooting.
Data Schema and Standards
The system enforces standardized data schemas throughout the processing
pipeline, ensuring consistency and interoperability across different
city datasets.
Input Schema
Source data tables conform to defined input schemas specifying required
columns including original street type and toponym fields, civic number
fields with optional numeric and suffix components, spatial geometry
columns in WGS84 coordinate reference system, and optional attributes
such as municipality codes, names, and postal codes. The schema
definitions are maintained as JSON-LD files in the data-schema
directory, providing machine-readable specifications that support
automated validation and documentation generation.
Development Schema
Processing stages add development columns with standardized naming
conventions using a dev_ prefix. These columns store intermediate
processing results including normalized field values, existence flags,
match type classifications, similarity scores, matched references, and
quality indicators. The development schema segregates transformation
artifacts from production output fields, enabling clear separation
between final deliverable content and processing metadata useful for
quality control and debugging.
Output Schema
The delivery table implements a standardized output schema designed to
support client requirements and GIS platform compatibility. Column names
follow concise abbreviations, coordinate fields use double precision
numeric types, text fields are sized appropriately for content, and
geometry columns specify explicit spatial reference systems. The output
schema includes placeholder columns for future enhancements, ensuring
schema stability as requirements evolve. All delivery tables across
different city projects conform to the same schema, enabling aggregation
and comparative analysis.
Extension Points and Customization
While the core processing pipeline provides standardized workflows
applicable to most city datasets, the system supports project-specific
customizations through several extension mechanisms.
Custom Processing Scripts
Projects can define custom processing scripts in the project-specific
tools directory, identified by process identifiers incorporating the
project slug. These custom processes are registered in the tool registry
and can be inserted at specific points in the pipeline configuration.
Custom scripts have access to the same database client utilities,
context loading functions, and report tracking mechanisms as standard
tools, ensuring consistent patterns and error handling.
Pipeline Configuration
The pipeline.jsonld enabled array can be reordered or filtered to skip
certain stages, add custom processes, or modify the processing sequence.
This flexibility supports experimentation, debugging, and adaptation to
unique dataset characteristics without modifying core tooling.
Metadata Extensions
Project metadata files support arbitrary property additions beyond the
base schema, enabling capture of project-specific attributes such as
special geographic considerations, data source peculiarities, or
client-specific requirements.
Conclusion
GIS Orchestra Places implements a sophisticated geospatial data
processing pipeline that transforms heterogeneous place datasets into
standardized, validated, and quality-controlled deliverables. The system
combines automated batch processing with strategic human review
checkpoints, achieving high efficiency while maintaining data quality.
The metadata-driven architecture using JSON-LD semantics enables
machine-readable project definitions that support automation, code
generation, and documentation workflows. The modular design with clear
separation between framework tooling and project-specific customizations
enables scalable processing across multiple city datasets while
maintaining consistency. The comprehensive quality assurance mechanisms
including backup strategies, spatial validation, statistical monitoring,
distance-based match quality metrics, and duplicate detection ensure
delivery of reliable geospatial data products suitable for critical
applications.