========================================
Architecture & Performance Model
========================================

Overview
--------
`eo-processor` accelerates Earth Observation (EO) numeric primitives by implementing
user‑defined functions (UDFs) in Rust and exposing them to Python via `PyO3`.
This page details the architectural choices that yield consistent speedups over
pure Python / NumPy workflows while preserving safety, determinism, and API clarity.

Design Objectives
-----------------
1. Eliminate Python GIL contention for CPU-bound loops.
2. Provide dimensional dispatch (1D–4D) without Python-level branching overhead.
3. Preserve numerical stability (uniform float64 internal type + epsilon guards).
4. Minimize intermediate allocations (single-pass or fused operations where feasible).
5. Ensure compile-time memory safety (no `unsafe` blocks).
6. Fit seamlessly into XArray / Dask pipelines as ufunc-like callables.
7. Keep error messages explicit (shape mismatches, invalid parameter values).
8. Offer predictable scaling thresholds for parallel execution (avoid micro-task overhead).

Rust Extension Boundary
-----------------------
The compiled module (`eo_processor._core`) is a native extension created by `maturin`.
Key steps at build/import time:

1. Rust crates (`ndarray`, `numpy`, `rayon`, `pyo3`) compile into a shared object.
2. Functions annotated with `#[pyfunction]` are registered inside the `#[pymodule] fn _core(...)`.
3. Python entrypoints in `python/eo_processor/__init__.py` provide thin wrappers (and aliases like `evi`).
4. Autodoc and type stub (`__init__.pyi`) expose signatures to editors and static type checkers.

Dimensional Dispatch
--------------------
Each computational family (indices, temporal, spatial, masking) inspects the input array rank
inside Rust:

- Spectral indices: support 1D or 2D (and some 3D/4D variants for batch/time-first layouts).
- Temporal statistics: 1D–4D; interpret leading axis as time.
- Median compositing: specialized 1D–4D implementations to avoid generalized recursion.
- Masking utilities: unified loops over arbitrary rank (explicit 1D–4D branches).

This avoids repeated Python `if arr.ndim == ...` checks and reduces interpreter overhead.

Numerical Stability & Coercion
------------------------------
All incoming numeric dtypes (int, uint, float32, float64) are coerced to `f64` once.
Benefits:

- Single arithmetic path (no branching for dtype).
- Consistent floating precision for downstream calculations.
- Simplifies NaN handling (uses IEEE-754 semantics uniformly).

Small denominators in normalized difference style formulas are guarded with an `EPSILON`
constant to prevent division blow-ups; near-zero denominators default to 0 or neutral outcomes
as documented in docstrings.

Parallel Execution Strategy
---------------------------
Selective Rayon usage:

- Temporal loops: parallel over spatial indices (pixels) after slicing time series.
- Pairwise distances: parallel over outer dimension only when `N * M` exceeds a threshold.
- Median/Std/Mean: parallel chunking for 3D / 4D arrays.

Why thresholds? Aggressive parallelism on small arrays causes overhead that erodes gains.
Threshold tuning (e.g., `10_000` distance pair cutoff) can be adjusted with empirical benchmarking.

Memory Efficiency
-----------------
Common Python/NumPy pattern:
    (nir - red) / (nir + red)

Creates two temporaries unless optimized; Rust versions combine operations in a single loop.
Similarly, `ΔNDVI` and `ΔNBR` compute pre+post indices in fused passes (one set of ephemeral values).
Median computations limit series extraction to exact pixel slices, sorting compact vectors.

NaN Handling
------------
Temporal and median functions optionally `skip_na=True`:

- Filter NaNs before statistical aggregation.
- If all samples are NaN, result is NaN (propagate emptiness semantics).
- Avoids NumPy masked array overhead and branch-heavy Python loops.

Error Handling
--------------
Rust raises Python exceptions with explicit context:

- Shape mismatch (e.g., spectral band arrays differ in shape).
- Invalid parameter (e.g., Minkowski `p < 1.0`).
- Unsupported dimensionality (non 1D–4D arrays trigger `TypeError`).
- SCL masking receives invalid codes gracefully; missing code sets output to NaN.

Performance Comparative Summary
-------------------------------
Category           | Rust UDF Advantage | Fairness Caveat
------------------ | ------------------ | ---------------
Spectral Indices   | Fused arithmetic + GIL release reduce temporaries. | NumPy already vectorized; gains typically modest (≤1.3×) unless very large tiles.
Temporal Stats     | Parallel spatial reduction (Rayon) amortizes time axis iteration. | Small grids may see overhead; benchmark your T,Y,X before adopting.
Median Composite   | Native per‑pixel nth‑selection (planned) + parallel iteration. | Current implementation still sorts entire series; optimization ongoing.
Pairwise Distances | Streaming algorithm avoids allocating N×M×D broadcasts. | A “broadcast” NumPy baseline does extra work; compare both broadcast & streaming baselines.
Masking Utilities  | Single pass with minimal Python overhead. | If mask condition is simple, NumPy boolean indexing can match performance.

Important: A headline speedup >2× usually indicates an algorithmic difference (e.g., avoiding large broadcasts) rather than lower constant factors alone. Always verify with size sweeps.

Integration with XArray / Dask
------------------------------
Use `xarray.apply_ufunc` with `dask="parallelized"`:

.. code-block:: python

    import xarray as xr
    import dask.array as da
    from eo_processor import ndvi

    nir = xr.DataArray(da.random.random((6000, 6000), chunks=(750, 750)), dims=["y", "x"])
    red = xr.DataArray(da.random.random((6000, 6000), chunks=(750, 750)), dims=["y", "x"])

    ndvi_xr = xr.apply_ufunc(
        ndvi,
        nir,
        red,
        dask="parallelized",
        output_dtypes=[float],
    )
    out = ndvi_xr.compute()

The Rust function executes inside each Dask worker on chunk data without holding the GIL.
For large temporal stacks (time-first), chunk on spatial dimensions to maximize per-task parallelism.

Extensibility Guidelines
------------------------
When adding a new index or transform:

1. Implement in a domain-specific module (e.g., `indices.rs`).
2. Provide dimensional dispatch for relevant rank(s).
3. Add input shape assertions early.
4. Add parallelization only if O(N) or O(N*M) complexity is high enough.
5. Update Python exports + stubs + README + Sphinx autosummary list.
6. Add tests (shape mismatch, NaN scenario, typical value range).
7. Benchmark if claiming performance improvement (>20% threshold preferred).

Testing & Validation
--------------------
Rust unit tests validate core arithmetic and shape behavior.
Python tests (in `tests/`) confirm wrapper semantics and dtype coercion.
Coverage badge should be regenerated after logic additions.

Representative Benchmark Template & Empirical Results
-----------------------------------------------------

Template
~~~~~~~~
.. code-block:: python

    import numpy as np, time
    from eo_processor import temporal_mean

    cube = np.random.rand(24, 1024, 1024)  # (time, y, x)
    t0 = time.perf_counter()
    rust_mean = temporal_mean(cube)
    rust_t = time.perf_counter() - t0

    t0 = time.perf_counter()
    numpy_mean = np.nanmean(cube, axis=0)
    numpy_t = time.perf_counter() - t0

    print(f"Rust {rust_t:.3f}s vs NumPy {numpy_t:.3f}s (speedup {numpy_t / rust_t:.2f}x)")
    assert np.allclose(rust_mean, numpy_mean, atol=1e-12)

Methodology
~~~~~~~~~~~
- Platform: macOS (ARM64), CPython 3.10, release build of the Rust extension.
- Timings: single run wall-clock using time.perf_counter(), warm cache.
- Dtypes: float64 inputs (internal coercion matches typical use).
- Validation: np.allclose(...) with tight tolerances (1e-12–1e-9).
- Disclaimer: Results vary with CPU topology, memory bandwidth, and array shapes.
  Always benchmark on your target workload before making pipeline decisions.

Representative Results
~~~~~~~~~~~~~~~~~~~~~~
Large-array benchmarks (single run; no warm repeats):

+---------------------+----------------------------+-----------+-----------+-----------+
| Function            | Input Shape                | Rust (s)  | NumPy (s) | Speedup   |
+=====================+============================+===========+===========+===========+
| ndvi                | (5000, 5000)               | 0.080     | 0.112     | 1.40x     |
| median (temporal)   | (15, 2000, 2000)           | 1.313     | 1.947     | 1.48x     |
| temporal_mean       | (24, 2000, 2000)           | 0.891     | 0.474     | 0.53x     |
| euclidean_distance  | (3000, 64) × (3000, 64)    | 0.063     | 0.041     | 0.65x     |
+---------------------+----------------------------+-----------+-----------+-----------+

Interpretation
~~~~~~~~~~~~~~
- Fused Index Arithmetic: ndvi shows a clear gain from single-pass fused math + GIL release.
- Temporal Median: Sorting per pixel in native code + parallel spatial iteration yields
  >1.4× improvement versus NumPy’s nanmedian for this shape.
- Temporal Mean / Std: For medium-sized grids Rust may be slower if parallel thresholds
  trigger overhead. Larger time dimensions or spatial extents generally improve relative
  performance; tune chunking (e.g., via Dask) for best results.
- Pairwise Distances: For moderately sized matrices NumPy’s optimized BLAS-backed
  matrix multiply strategy can outperform the current Rust implementation. Rust gains
  appear for larger N×M where outer-loop parallelization amortizes overhead.

Guidance
~~~~~~~~
- Use Rust spectral indices (ndvi, nbr, etc.) for production-scale tile processing.
- Prefer the Rust median for large temporal stacks when robustness to outliers matters.
- Benchmark temporal_mean / temporal_std on your exact sizes; consider increasing
  spatial dimensions or enabling chunked execution for better scaling.
- For very large pairwise distance workloads (> millions of pairs) the Rust implementation
  can scale with additional cores; otherwise NumPy's vectorized BLAS approach may suffice.

Performance Claim Template
~~~~~~~~~~~~~~~~~~~~~~~~~~
When adding new optimized functions, follow the required documentation pattern:

.. code-block:: text

    Benchmark:
    Array size: 5000 x 5000
    Old: 1.42s
    New: 1.05s
    Speedup: 1.35x

    Methodology:
    Single run, warm cache discarded
    Timing via time.perf_counter()
    Validation: np.allclose(new, old, atol=1e-12)

Keep empirical claims conservative and reproducible. Attach code snippets or
reference a test harness (noting environment) for transparency.

Key Differences from Pure NumPy
-------------------------------
- Fewer temporaries in multi-step formulas.
- Parallelism not limited by GIL.
- Consistent float64 internal representation.
- Explicit shape checks produce clearer early errors than silent broadcasting misalignments.
- Integrated NaN filtering in statistical functions (no separate boolean masks required).

Future Optimization Opportunities
---------------------------------
(Subject to human review per repository guidelines)

- Adaptive parallel thresholds informed by runtime profiling.
- Optional streaming median (reducing per-pixel vector allocation for long time series).
- SIMD acceleration (requires careful evaluation; staying within safe Rust & avoiding `unsafe`).
- Advanced mask composition DSL (predicate fusion to reduce data passes).

Security & Safety Notes
-----------------------
- No `unsafe` code blocks in kernel implementations.
- No network or file I/O in the performance-critical path.
- All external inputs constrained to NumPy array interfaces; memory safety enforced by borrow semantics.

Reference of Core Modules
-------------------------
Module        | Purpose
------------- | --------------------------------------------
`indices`     | Spectral & change detection indices (NDVI, NDWI, SAVI, EVI, NBR, NDMI, NBR2, GCI, ΔNDVI, ΔNBR)
`temporal`    | Time-axis statistics (mean, std) with NaN skipping
`spatial`     | Median compositing + pairwise distance calculations
`masking`     | Value/range/sentinel/SCL masking utilities
`lib.rs`      | PyO3 module registration (bridge to Python)

Rust Acceleration Note
----------------------
The high-level rationale for Rust UDF acceleration (GIL release, dimensional dispatch, float64 coercion, parallel strategy) is also summarized on the main index page. For convenience:

.. admonition:: Summary
   :class: tip

   - GIL-free native loops enable true multi-core throughput.
   - Float64 coercion yields stable, predictable numerics.
   - Selective Rayon parallelism avoids overhead on small workloads.
   - Fused loop bodies minimize intermediate allocations.
   - Dimension-aware implementations remove Python branching overhead.
   - No `unsafe`; memory safety guaranteed by compiler.

Cross-References
----------------
- Function Index: :doc:`functions`
- README Overview: :doc:`../README`
- Quick Start: :doc:`../QUICKSTART`
- Contribution Guide: External link in navigation.

License
-------
MIT. See repository root for full text.

Benchmark Report (Generated)
----------------------------
The following benchmark summary (generated via ``scripts/benchmark.py``) is included for up-to-date performance measurements across spectral indices and related functions:

.. include:: benchmarks.rst
.. include:: dist_bench.rst

Python Version Benchmarks
-------------------------
The following per-interpreter benchmark reports are generated via version-specific tox
benchmark environments (py38-bench … py313-bench). Each file captures synthetic performance
snapshots for that Python version using identical data shapes and loop counts.

.. include:: benchmark-py38.rst
.. include:: benchmark-py39.rst
.. include:: benchmark-py310.rst
.. include:: benchmark-py311.rst
.. include:: benchmark-py312.rst
.. include:: benchmark-py313.rst

Benchmark Fairness & Methodology
--------------------------------
To ensure transparent comparisons we distinguish two baseline styles:

1. Broadcast Baseline (NumPy):
   - Uses broadcasting to form (N, M, D) intermediary (explicitly or implicitly via algebra).
   - High memory footprint; stresses allocator & cache.
2. Streaming Baseline (NumPy):
   - Pure Python loop over outer dimension; each iteration performs a vectorized reduction over D.
   - Low memory; more Python overhead.

Rust distance functions currently implement a streaming pattern with parallel outer-loop execution (when size threshold exceeded). When reporting speedups:
speedup_vs_numpy = baseline_mean / rust_mean
Values > 1 mean Rust faster. A large speedup versus the broadcast baseline may disappear or shrink versus the streaming baseline.

Reporting Guidelines:
- Always state which baseline was used (broadcast / streaming / both).
- Include array or point set shape and element count.
- Use ≥5 timing loops + ≥1 warmup; prefer median over mean if distribution skewed.
- Disclose algorithmic differences (e.g., fewer temporaries) rather than attributing all gains to “Rust”.
- For spectral indices, expect near bandwidth-limited behavior; huge (>1.5×) gains are unlikely on modern NumPy unless parallelism or algorithmic fusion changes.

Stress Benchmarks
-----------------
Stress benchmarks highlight scalability for Earth Observation “big tile” or large point set workloads (e.g., 4096×4096 spatial grids, time depth ≥48, distance matrices with N,M ≥ 10,000). Enable via:
  scripts/benchmark.py --group all --stress --compare-numpy --distance-baseline both --loops 7 --warmups 2 --size-sweep 2048x2048 4096x4096

Outputs should be included as generated reST fragments:
.. include:: stress-benchmarks.rst

Interpreting Stress Results:
- Spectral indices: near-linear scaling with element count until memory bandwidth saturates.
- Temporal reducers: benefit from more spatial pixels (parallel work units) and longer time axis (amortized slice overhead).
- Distances: Rust streaming avoids allocating (N,M,D) memory; broadcast NumPy baseline may degrade sharply with very large N,M,D.

Planned Improvements (Benchmarking):
- Add nth‑selection median to reduce O(T log T) sorts to O(T).
- Adaptive parallel threshold auto-tuned from calibration run.
- Optional SIMD path (still safe Rust) after empirical validation.

Documentation Update Protocol:
Whenever a new optimization lands:
1. Re-run size sweep (small, medium, large shapes).
2. Update fairness table (above) if algorithmic class changes.
3. Regenerate stress-benchmarks.rst.
4. Provide before/after code snippet and numerical equivalence check (np.allclose with tolerance).
5. Bump minor version if public API extended; patch if internal only.

End of Architecture Overview.