algorithm chain for ram-vram

2026-03-28 08:22:14 +00:00
parent ec2027e980
commit 876b8f4592
8 changed files with 883 additions and 172 deletions
--- a/docs/algorithms/algoritm.md
+++ b/docs/algorithms/algoritm.md
@@ -1,7 +1,338 @@
-# Aggoritm definition
+# Algorithm Infrastructure
+
+## Overview

 An algorithm in the uLib infrastructure is a class for containing a functional that can be dynamically loaded into memory as a plug-in.
-It derives from the base Object class and therefore can contain properties that define the serialization of operating parameters or the implementation of widgets for interactive parameter manipulation.
-The algorithm class is designed to be inserted into a Task, i.e., a class for managing the execution of scheduled operations. A task contains a Run and Stop method to start and stop execution. Furthermore, a task can be configured to work in cyclic or asynchronous mode: in cyclic mode it will be possible to define a cycle time, while in asynchronous mode a task can be hooked to a signal-slot of the Object structure or to a condition variable defined in the monitor pattern (Monitor.h).
+It derives from the base `Object` class (`Core/Object.h`) and therefore can contain properties that define the serialization of operating parameters or the implementation of widgets for interactive parameter manipulation.
+
+The algorithm class is designed to be inserted into an `AlgorithmTask`, a class for managing the execution of scheduled operations. A task contains `Run` and `Stop` methods to start and stop execution. A task can be configured to work in two modes:
+
+- **Cyclic mode**: the algorithm is executed periodically with a configurable cycle time.
+- **Asynchronous mode**: the task waits for a trigger before each execution. Triggers can come from the uLib signal-slot system (`Object::connect`) or from a condition variable as defined in the monitor pattern (`Core/Monitor.h`).
+
+The algorithm is defined as a template class on two types `T_enc` and `T_dec`. The encoder is a type for data input or another algorithm that is chained with this one and outputs data in a compatible format. The decoder is the type of data output or a downstream algorithm compatible with it.
+
+## Class Hierarchy
+
+```
+Object (Core/Object.h)
+  |
+  +-- Algorithm<T_enc, T_dec>        (Core/Algorithm.h)
+  |     |
+  |     +-- VoxImageFilter<VoxelT, CrtpImplT>  (Math/VoxImageFilter.h)
+  |           |
+  |           +-- VoxFilterAlgorithmLinear       (Math/VoxImageFilterLinear.hpp)
+  |           +-- VoxFilterAlgorithmMedian       (Math/VoxImageFilterMedian.hpp)
+  |           +-- VoxFilterAlgorithmAbtrim       (Math/VoxImageFilterABTrim.hpp)
+  |           +-- VoxFilterAlgorithmSPR          (Math/VoxImageFilterABTrim.hpp)
+  |           +-- VoxFilterAlgorithmThreshold    (Math/VoxImageFilterThreshold.hpp)
+  |           +-- VoxFilterAlgorithmBilateral    (Math/VoxImageFilterBilateral.hpp)
+  |           +-- VoxFilterAlgorithmBilateralTrim(Math/VoxImageFilterBilateral.hpp)
+  |           +-- VoxFilterAlgorithm2ndStat      (Math/VoxImageFilter2ndStat.hpp)
+  |           +-- VoxFilterAlgorithmCustom        (Math/VoxImageFilterCustom.hpp)
+  |
+  +-- Thread (Core/Threads.h)
+        |
+        +-- AlgorithmTask<T_enc, T_dec>  (Core/Algorithm.h)
+```
+
+## Algorithm (`Core/Algorithm.h`)
+
+### Template Parameters
+
+```cpp
+template <typename T_enc, typename T_dec>
+class Algorithm : public Object;
+```
+
+- **`T_enc`** (Encoder): the input data type. Can be a raw data type or a pointer to a data structure. When chaining algorithms, the upstream algorithm's `T_dec` must be compatible with this algorithm's `T_enc`.
+- **`T_dec`** (Decoder): the output data type. Produced by `Process()` and consumed by the next algorithm in the chain.
+
+### Core Interface
+
+| Method | Description |
+|--------|-------------|
+| `virtual T_dec Process(const T_enc& input) = 0` | Pure virtual. Implement the algorithm logic here. |
+| `T_dec operator()(const T_enc& input)` | Calls `Process()`. Enables functional syntax: `result = alg(data)`. |
+
+### Algorithm Chaining
+
+Algorithms can be linked in processing pipelines via encoder/decoder pointers:
+
+```cpp
+Algorithm* upstream;    // SetEncoder() / GetEncoder()
+Algorithm* downstream;  // SetDecoder() / GetDecoder()
+```
+
+This allows building chains like:
+
+```
+[RawData] --> AlgorithmA --> AlgorithmB --> [Result]
+               encoder         decoder
+```
+
+### Signals
+
+| Signal | Emitted when |
+|--------|-------------|
+| `Started()` | The algorithm begins processing (caller responsibility). |
+| `Finished()` | The algorithm completes processing (caller responsibility). |
+
+### Device Preference (CUDA)
+
+Algorithms report their preferred execution device via `GetPreferredDevice()`:
+
+| Method | Description |
+|--------|-------------|
+| `virtual MemoryDevice GetPreferredDevice() const` | Returns `RAM` or `VRAM`. Subclasses override. |
+| `void SetPreferredDevice(MemoryDevice dev)` | Manually set the device preference. |
+| `bool IsGPU() const` | Shorthand for `GetPreferredDevice() == VRAM`. |
+
+GPU-based algorithms are responsible for calling `cudaDeviceSynchronize()` inside their `Process()` implementation before returning, so that results are available to the caller or downstream algorithm.
+
+### Example: Defining a Custom Algorithm
+
+```cpp
+class MyFilter : public Algorithm<VoxImage<Voxel>*, VoxImage<Voxel>*> {
+public:
+    const char* GetClassName() const override { return "MyFilter"; }
+
+    VoxImage<Voxel>* Process(VoxImage<Voxel>* const& image) override {
+        // ... filter the image in-place ...
+        return image;
+    }
+};
+```
+
+## AlgorithmTask (`Core/Algorithm.h`)
+
+`AlgorithmTask` manages the execution of an `Algorithm` within a scheduled, threaded context. It inherits from `Thread` (`Core/Threads.h`) and uses `Mutex` (`Core/Monitor.h`) for synchronization.
+
+### Template Parameters
+
+```cpp
+template <typename T_enc, typename T_dec>
+class AlgorithmTask : public Thread;
+```
+
+Must match the `Algorithm<T_enc, T_dec>` it manages.
+
+### Configuration
+
+| Method | Description |
+|--------|-------------|
+| `void SetAlgorithm(AlgorithmType* alg)` | Set the algorithm to execute. |
+| `void SetMode(Mode mode)` | `Cyclic` or `Async`. |
+| `void SetCycleTime(int ms)` | Period for cyclic mode (milliseconds). |
+
+### Execution Modes
+
+#### Cyclic Mode
+
+The algorithm's `Process()` is called periodically. The cycle waits on a `condition_variable_any` with timeout, so `Stop()` can interrupt immediately without waiting for the full cycle.
+
+```cpp
+AlgorithmTask<int, int> task;
+task.SetAlgorithm(&myAlgorithm);
+task.SetMode(AlgorithmTask<int, int>::Cyclic);
+task.SetCycleTime(100);  // every 100ms
+task.Run(inputData);
+// ... later ...
+task.Stop();
+```
+
+#### Asynchronous Mode
+
+The task thread blocks on a condition variable until `Notify()` is called. Each notification triggers exactly one `Process()` invocation.
+
+```cpp
+task.SetMode(AlgorithmTask<int, int>::Async);
+task.Run(inputData);
+
+// Trigger manually:
+task.Notify();
+
+// Or connect to a signal:
+task.ConnectTrigger(sender, &SenderClass::DataReady);
+// Now each emission of DataReady() triggers one Process() call.
+```
+
+### Lifecycle
+
+| Method | Description |
+|--------|-------------|
+| `void Run(const T_enc& input)` | Starts the background thread with the given input. |
+| `void Stop()` | Requests stop and joins the thread. |
+| `bool IsRunning()` | Inherited from `Thread`. |
+
+### Signals
+
+| Signal | Emitted when |
+|--------|-------------|
+| `Stopped()` | The task thread has completed (after last `Process()` and before thread exit). |
+
+### Signal-Slot Triggering
+
+`ConnectTrigger()` connects any uLib `Object` signal to the task's `Notify()` method:
+
+```cpp
+task.ConnectTrigger(detector, &Detector::EventReady);
+```
+
+This uses the uLib signal system (`Core/Signal.h`), not Qt signals. The connection is type-safe and works with the `Object::connect` infrastructure.
+
+## VoxImageFilter (`Math/VoxImageFilter.h`)
+
+`VoxImageFilter` specializes `Algorithm` for kernel-based volumetric image filtering. It uses CRTP (Curiously Recurring Template Pattern) so that concrete filters provide their `Evaluate()` method without virtual dispatch overhead in the inner loop.
+
+### Template Parameters
+
+```cpp
+template <typename VoxelT, typename CrtpImplT>
+class VoxImageFilter : public Abstract::VoxImageFilter,
+                       public Algorithm<VoxImage<VoxelT>*, VoxImage<VoxelT>*>;
+```
+
+- **`VoxelT`**: the voxel data type (must satisfy `Interface::Voxel` — requires `.Value` and `.Count` fields).
+- **`CrtpImplT`**: the concrete filter subclass. Must implement:
+  ```cpp
+  float Evaluate(const VoxImage<VoxelT>& buffer, int index);
+  ```
+
+### How It Works
+
+1. `Process(image)` creates a read-only buffer copy of the input image.
+2. For each voxel in parallel (OpenMP), it calls `CrtpImplT::Evaluate(buffer, index)`.
+3. `Evaluate()` reads from the buffer using the kernel offsets and writes the result.
+4. The filtered image is returned (in-place modification).
+
+```
+Process(image)
+  |
+  +-- buffer = copy of image (read-only snapshot)
+  |
+  +-- #pragma omp parallel for
+  |     for each voxel i:
+  |       image[i].Value = CrtpImplT::Evaluate(buffer, i)
+  |
+  +-- return image
+```
+
+### Kernel System
+
+The `Kernel<VoxelT>` class stores convolution weights and precomputed index offsets:
+
+| Method | Description |
+|--------|-------------|
+| `SetKernelNumericXZY(values)` | Set kernel weights from a flat vector (XZY order). |
+| `SetKernelSpherical(shape)` | Set weights via a radial function `f(distance^2)`. |
+| `SetKernelWeightFunction(shape)` | Set weights via a 3D position function `f(Vector3f)`. |
+
+### CUDA Support
+
+Concrete filters can override `Process()` with a CUDA implementation:
+
+```cpp
+#if defined(USE_CUDA) && defined(__CUDACC__)
+VoxImage<VoxelT>* Process(VoxImage<VoxelT>* const& image) override {
+    if (this->GetPreferredDevice() == MemoryDevice::VRAM) {
+        // Launch CUDA kernel, synchronize, return
+    } else {
+        return BaseClass::Process(image);  // CPU fallback
+    }
+}
+#endif
+```
+
+The base class `GetPreferredDevice()` automatically returns `VRAM` when the image or kernel data resides on the GPU, enabling transparent device dispatch.
+
+Filters with CUDA implementations: `VoxFilterAlgorithmLinear`, `VoxFilterAlgorithmAbtrim`, `VoxFilterAlgorithmSPR`.
+
+### Concrete Filters
+
+| Filter | File | Description |
+|--------|------|-------------|
+| `VoxFilterAlgorithmLinear` | `VoxImageFilterLinear.hpp` | Weighted linear convolution (FIR filter). CUDA-enabled. |
+| `VoxFilterAlgorithmMedian` | `VoxImageFilterMedian.hpp` | Median filter with kernel-weighted sorting. |
+| `VoxFilterAlgorithmAbtrim` | `VoxImageFilterABTrim.hpp` | Alpha-beta trimmed mean filter. CUDA-enabled. |
+| `VoxFilterAlgorithmSPR` | `VoxImageFilterABTrim.hpp` | Robespierre filter: trimmed mean applied only to outlier voxels. CUDA-enabled. |
+| `VoxFilterAlgorithmThreshold` | `VoxImageFilterThreshold.hpp` | Binary threshold filter. |
+| `VoxFilterAlgorithmBilateral` | `VoxImageFilterBilateral.hpp` | Edge-preserving bilateral filter (intensity-weighted Gaussian). |
+| `VoxFilterAlgorithmBilateralTrim` | `VoxImageFilterBilateral.hpp` | Bilateral filter with alpha-beta trimming. |
+| `VoxFilterAlgorithm2ndStat` | `VoxImageFilter2ndStat.hpp` | Local variance (second-order statistic). |
+| `VoxFilterAlgorithmCustom` | `VoxImageFilterCustom.hpp` | User-supplied evaluation function via function pointer. |
+
+### Example: Using a Filter with AlgorithmTask
+
+```cpp
+// Create filter and configure kernel
+VoxFilterAlgorithmLinear<Voxel> filter(Vector3i(3, 3, 3));
+std::vector<float> weights(27, 1.0f);  // uniform 3x3x3
+filter.SetKernelNumericXZY(weights);
+
+// Direct use
+filter.SetImage(&image);
+filter.Run();
+
+// Or via Algorithm interface
+VoxImage<Voxel>* result = filter.Process(&image);
+
+// Or scheduled in a task
+AlgorithmTask<VoxImage<Voxel>*, VoxImage<Voxel>*> task;
+task.SetAlgorithm(&filter);
+task.SetMode(AlgorithmTask<VoxImage<Voxel>*, VoxImage<Voxel>*>::Cyclic);
+task.SetCycleTime(500);
+task.Run(&image);
+```
+
+## Structural Benefits
+
+### 1. Uniform Processing Interface
+
+Every algorithm — from a simple threshold to a GPU-accelerated convolution — exposes the same `Process(input) -> output` interface. Client code does not need to know the concrete type:
+
+```cpp
+Algorithm<VoxImage<Voxel>*, VoxImage<Voxel>*>* alg = &anyFilter;
+alg->Process(&image);
+```
+
+### 2. Pipeline Composition
+
+The encoder/decoder chaining allows building data processing pipelines where each stage transforms data and passes it to the next. Type safety is enforced at compile time through template parameters.
+
+### 3. Scheduled and Event-Driven Execution
+
+`AlgorithmTask` decouples the algorithm from its execution schedule. The same algorithm can be:
+- Called directly (`Process()`)
+- Run periodically (Cyclic mode for monitoring/acquisition)
+- Triggered by events (Async mode for reactive processing)
+
+### 4. Transparent CPU/GPU Dispatch
+
+The `MemoryDevice` preference and `GetPreferredDevice()` virtual allow the same algorithm interface to dispatch to CPU or GPU implementations. The `DataAllocator` transparently manages RAM/VRAM transfers, and concrete filters override `Process()` with CUDA kernels when data is on the GPU.
+
+### 5. Integration with the Object System
+
+Since `Algorithm` inherits from `Object`, algorithms gain:
+- **Properties**: serializable parameters via the `Property<T>` system, enabling persistent configuration and GUI widget generation.
+- **Signals**: `Started`/`Finished` notifications for connecting to monitoring or logging.
+- **Serialization**: save/load algorithm configuration via Boost archives.
+- **Instance naming**: `SetInstanceName()` for runtime identification in contexts.
+
+### 6. CRTP Performance for Inner Loops
+
+`VoxImageFilter` uses CRTP to dispatch to `Evaluate()` without virtual function overhead. The per-voxel evaluation runs at full speed inside OpenMP parallel loops, while the outer `Process()` method remains virtual for polymorphic use through the Algorithm interface.
+
+## Dependencies
+
+```
+Core/Object.h          — base class, properties, signals, serialization
+Core/Signal.h          — signal-slot connection infrastructure
+Core/Monitor.h         — Mutex, condition variables, ULIB_MUTEX_LOCK
+Core/Threads.h         — Thread base class for AlgorithmTask
+Core/DataAllocator.h   — MemoryDevice enum, RAM/VRAM data management
+Math/VoxImage.h        — volumetric image container
+Math/VoxImageFilter.h  — kernel-based filter framework
+```
+

-The algorithm in particular is defined as a template class on two types T_enc, T_dec. The encoder is a type for data input or another algorithm that is chained with this one that outputs data in the format compatible with input. The decoder is the type of data output or an algorithm compatible with it.
--- a/src/Core/Algorithm.h
+++ b/src/Core/Algorithm.h
@@ -30,14 +30,9 @@
 #include <chrono>
 #include <condition_variable>

-#ifdef USE_CUDA
-#include <cuda_runtime.h>
-#endif
-
 #include "Core/Object.h"
 #include "Core/Monitor.h"
 #include "Core/Threads.h"
-#include "Core/Property.h"
 #include "Core/DataAllocator.h"

 namespace uLib {
@@ -51,6 +46,10 @@ namespace uLib {
 * dynamically loaded as a plug-in. It derives from Object and supports
 * properties for serialization and interactive parameter widgets.
 *
+ * Algorithms are responsible for their own GPU synchronization: if Process()
+ * launches CUDA kernels, it must call cudaDeviceSynchronize() before returning
+ * so that the result is available to the caller or downstream algorithm.
+ *
 * @tparam T_enc Encoder type: the input data type, or a chained algorithm
 *               whose output is compatible with this algorithm's input.
 * @tparam T_dec Decoder type: the output data type, or a chained algorithm
@@ -72,37 +71,41 @@ public:

    virtual const char* GetClassName() const override { return "Algorithm"; }

+    // Processing ///////////////////////////////////////////////////////////////
+
    /**
     * @brief Process input data and produce output.
     * Override this in subclasses to implement the algorithm logic.
+     * GPU-based implementations must synchronize before returning.
     */
    virtual T_dec Process(const T_enc& input) = 0;

-    /**
-     * @brief Operator form of Process for functional chaining.
-     */
+    /** @brief Operator form of Process for functional chaining. */
    T_dec operator()(const T_enc& input) { return Process(input); }

+    // Chaining /////////////////////////////////////////////////////////////////
+
    void SetEncoder(Algorithm* enc) { m_Encoder = enc; }
    Algorithm* GetEncoder() const { return m_Encoder; }

    void SetDecoder(Algorithm* dec) { m_Decoder = dec; }
    Algorithm* GetDecoder() const { return m_Decoder; }

+    // Device preference ////////////////////////////////////////////////////////
+
    /**
     * @brief Returns the preferred memory device for this algorithm.
     * CUDA-capable algorithms should override to return VRAM when their
-     * data resides on the GPU. AlgorithmTask uses this to synchronize
-     * appropriately.
+     * data resides on the GPU.
     */
    virtual MemoryDevice GetPreferredDevice() const { return m_PreferredDevice; }
    void SetPreferredDevice(MemoryDevice dev) { m_PreferredDevice = dev; }

-    /**
-     * @brief Returns true if this algorithm prefers GPU execution.
-     */
+    /** @brief Returns true if this algorithm prefers GPU execution. */
    bool IsGPU() const { return GetPreferredDevice() == MemoryDevice::VRAM; }

+    // Signals //////////////////////////////////////////////////////////////////
+
 signals:
    virtual void Started() { ULIB_SIGNAL_EMIT(Algorithm::Started); }
    virtual void Finished() { ULIB_SIGNAL_EMIT(Algorithm::Finished); }
@@ -121,9 +124,13 @@ protected:
 /**
 * @brief AlgorithmTask manages the execution of an Algorithm within a
 * scheduled context. Uses uLib::Thread for execution and uLib::Mutex for
- * synchronization. Supports cyclic mode (with configurable period) and
- * asynchronous mode (triggered by Object signal-slot or condition variable
- * from Monitor.h).
+ * synchronization.
+ *
+ * Two execution modes:
+ *  - Cyclic:  executes Process() periodically with configurable cycle time.
+ *  - Async:   waits for Notify() or a connected signal before each execution.
+ *
+ * GPU synchronization is the algorithm's responsibility (see Algorithm::Process).
 */
 template <typename T_enc, typename T_dec>
 class AlgorithmTask : public Thread {
@@ -145,6 +152,8 @@ public:

    virtual const char* GetClassName() const override { return "AlgorithmTask"; }

+    // Configuration ////////////////////////////////////////////////////////////
+
    void SetAlgorithm(AlgorithmType* alg) { m_Algorithm = alg; }
    AlgorithmType* GetAlgorithm() const { return m_Algorithm; }

@@ -154,6 +163,8 @@ public:
    void SetCycleTime(int milliseconds) { m_CycleTime_ms = milliseconds; }
    int GetCycleTime() const { return m_CycleTime_ms; }

+    // Lifecycle ////////////////////////////////////////////////////////////////
+
    /**
     * @brief Start the task execution in a separate thread (via Thread::Start).
     * In Cyclic mode, the algorithm is executed periodically.
@@ -167,19 +178,17 @@ public:
        Start();
    }

-    /**
-     * @brief Stop the task execution and join the thread.
-     */
+    /** @brief Stop the task execution and join the thread. */
    void Stop() {
        m_StopRequested.store(true);
-        {
        ULIB_MUTEX_LOCK(m_WaitMutex, -1) {
            m_Condition.notify_all();
        }
-        }
        if (IsJoinable()) Join();
    }

+    // Async triggering /////////////////////////////////////////////////////////
+
    /**
     * @brief Notify the task to execute one iteration (Async mode).
     * Can be called from a signal-slot connection or externally.
@@ -200,36 +209,25 @@ public:
        return Object::connect(sender, sigf, [this]() { Notify(); });
    }

+    // Signals //////////////////////////////////////////////////////////////////
+
 signals:
    virtual void Stopped() { ULIB_SIGNAL_EMIT(AlgorithmTask::Stopped); }

 protected:
-    /**
-     * @brief Thread entry point — dispatches to cyclic or async loop.
-     */
+    /** @brief Thread entry point — dispatches to cyclic or async loop. */
    void Run() override {
-        if (m_Mode == Cyclic) {
+        if (m_Mode == Cyclic)
            RunCyclic();
-        } else {
+        else
            RunAsync();
-        }
        Stopped();
    }

 private:
-    void ExecuteAlgorithm() {
-        if (!m_Algorithm) return;
-        m_Algorithm->Process(m_Input);
-#ifdef USE_CUDA
-        if (m_Algorithm->IsGPU()) {
-            cudaDeviceSynchronize();
-        }
-#endif
-    }
-
    void RunCyclic() {
        while (!m_StopRequested.load()) {
-            ExecuteAlgorithm();
+            if (m_Algorithm) m_Algorithm->Process(m_Input);
            std::unique_lock<std::timed_mutex> lock(m_WaitMutex.GetNative());
            m_Condition.wait_for(lock,
                std::chrono::milliseconds(m_CycleTime_ms),
@@ -245,7 +243,7 @@ private:
            });
            if (m_StopRequested.load()) break;
            m_Triggered.store(false);
-            ExecuteAlgorithm();
+            if (m_Algorithm) m_Algorithm->Process(m_Input);
        }
    }

--- a/src/Math/VoxImageFilter.h
+++ b/src/Math/VoxImageFilter.h
@@ -34,6 +34,9 @@

 namespace uLib {

+////////////////////////////////////////////////////////////////////////////////
+// Kernel shape interface (static check for operator()(float) and operator()(Vector3f))
+
 namespace Interface {
 struct VoxImageFilterShape {
  template <class Self> void check_structural() {
@@ -43,30 +46,47 @@ struct VoxImageFilterShape {
 };
 } // namespace Interface

+////////////////////////////////////////////////////////////////////////////////
+// Forward declaration
+
 template <typename VoxelT> class Kernel;

+////////////////////////////////////////////////////////////////////////////////
+// Abstract interface (type-erased, used by python bindings)
+
 namespace Abstract {
 class VoxImageFilter {
 public:
  virtual void Run() = 0;
-
  virtual void SetImage(Abstract::VoxImage *image) = 0;
-
 protected:
  virtual ~VoxImageFilter() {}
 };
 } // namespace Abstract

-template <typename VoxelT, typename AlgorithmT>
+////////////////////////////////////////////////////////////////////////////////
+// VoxImageFilter — kernel-based voxel filter using CRTP + Algorithm
+//
+// Template parameters:
+//   VoxelT     — voxel data type (must satisfy Interface::Voxel)
+//   CrtpImplT  — concrete filter subclass (CRTP), must provide:
+//                 float Evaluate(const VoxImage<VoxelT>& buffer, int index)
+//
+// Inherits Algorithm<VoxImage<VoxelT>*, VoxImage<VoxelT>*> so that filters
+// can be used with AlgorithmTask for scheduled/async execution, and chained
+// via encoder/decoder.
+
+template <typename VoxelT, typename CrtpImplT>
 class VoxImageFilter : public Abstract::VoxImageFilter,
                       public Algorithm<VoxImage<VoxelT>*, VoxImage<VoxelT>*> {
-
 public:

-  virtual const char * GetClassName() const { return "VoxImageFilter"; }
+  virtual const char* GetClassName() const { return "VoxImageFilter"; }

  VoxImageFilter(const Vector3i &size);

+  // Algorithm interface ////////////////////////////////////////////////////////
+
  /**
   * @brief Process implements Algorithm::Process.
   * Applies the filter in-place on the input image and returns it.
@@ -79,9 +99,9 @@ public:
   */
  void Run();

-  /**
-   * @brief Returns VRAM if image or kernel data is on GPU, RAM otherwise.
-   */
+  // Device awareness ///////////////////////////////////////////////////////////
+
+  /** @brief Returns VRAM if image or kernel data is on GPU, RAM otherwise. */
  MemoryDevice GetPreferredDevice() const override {
    if (m_Image && m_Image->Data().GetDevice() == MemoryDevice::VRAM)
      return MemoryDevice::VRAM;
@@ -90,38 +110,31 @@ public:
    return MemoryDevice::RAM;
  }

+  // Kernel setup ///////////////////////////////////////////////////////////////
+
  void SetKernelNumericXZY(const std::vector<float> &numeric);
-
  void SetKernelSpherical(float (*shape)(float));
-
  template <class ShapeT> void SetKernelSpherical(ShapeT shape);
-
  void SetKernelWeightFunction(float (*shape)(const Vector3f &));
-
  template <class ShapeT> void SetKernelWeightFunction(ShapeT shape);

-  inline const Kernel<VoxelT> &GetKernelData() const {
-    return this->m_KernelData;
-  }
-  inline Kernel<VoxelT> &GetKernelData() { return this->m_KernelData; }
+  // Accessors //////////////////////////////////////////////////////////////////

-  inline VoxImage<VoxelT> *GetImage() const { return this->m_Image; }
+  const Kernel<VoxelT> &GetKernelData() const { return m_KernelData; }
+  Kernel<VoxelT> &GetKernelData() { return m_KernelData; }

+  VoxImage<VoxelT> *GetImage() const { return m_Image; }
  void SetImage(Abstract::VoxImage *image);

 protected:
-  float Convolve(const VoxImage<VoxelT> &buffer, int index); // remove //
-
  void SetKernelOffset();
-
  float Distance2(const Vector3i &v);

-  // protected members for algorithm access //
  Kernel<VoxelT> m_KernelData;
  VoxImage<VoxelT> *m_Image;

 private:
-  AlgorithmT *t_Algoritm;
+  CrtpImplT *m_CrtpImpl;
 };

 } // namespace uLib
--- a/src/Math/VoxImageFilter.hpp
+++ b/src/Math/VoxImageFilter.hpp
@@ -33,7 +33,9 @@

 namespace uLib {

-// KERNEL //////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+//// KERNEL ////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////

 template <typename T> class Kernel : public StructuredData {
  typedef StructuredData BaseClass;
@@ -41,13 +43,12 @@ template <typename T> class Kernel : public StructuredData {
 public:
  Kernel(const Vector3i &size);

-  inline T &operator[](const Vector3i &id) { return m_Data[Map(id)]; }
-  inline T &operator[](const int &id) { return m_Data[id]; }
-  inline int GetCenterData() const;
+  T &operator[](const Vector3i &id) { return m_Data[Map(id)]; }
+  T &operator[](const int &id) { return m_Data[id]; }
+  int GetCenterData() const;

-  inline DataAllocator<T> &Data() { return this->m_Data; }
-
-  inline const DataAllocator<T> &ConstData() const { return this->m_Data; }
+  DataAllocator<T> &Data() { return m_Data; }
+  const DataAllocator<T> &ConstData() const { return m_Data; }

  void PrintSelf(std::ostream &o) const;

@@ -60,12 +61,14 @@ Kernel<T>::Kernel(const Vector3i &size) : BaseClass(size), m_Data(size.prod()) {
  Interface::IsA<T, Interface::Voxel>();
 }

-template <typename T> inline int Kernel<T>::GetCenterData() const {
+template <typename T>
+int Kernel<T>::GetCenterData() const {
  static int center = Map(this->GetDims() / 2);
  return center;
 }

-template <typename T> void Kernel<T>::PrintSelf(std::ostream &o) const {
+template <typename T>
+void Kernel<T>::PrintSelf(std::ostream &o) const {
  o << " Filter Kernel Dump [XZ_Y]: \n";
  Vector3i index;
  o << "\n Value: \n\n"
@@ -96,33 +99,42 @@ template <typename T> void Kernel<T>::PrintSelf(std::ostream &o) const {
  }
 }

+////////////////////////////////////////////////////////////////////////////////
+//// VOXIMAGEFILTER IMPLEMENTATION /////////////////////////////////////////////
 ////////////////////////////////////////////////////////////////////////////////

-#define _TPL_ template <typename VoxelT, typename AlgorithmT>
-#define _TPLT_ VoxelT, AlgorithmT
+template <typename VoxelT, typename CrtpImplT>
+VoxImageFilter<VoxelT, CrtpImplT>::VoxImageFilter(const Vector3i &size)
+    : m_KernelData(size)
+    , m_Image(nullptr)
+    , m_CrtpImpl(static_cast<CrtpImplT *>(this))
+{}

-_TPL_
-VoxImageFilter<_TPLT_>::VoxImageFilter(const Vector3i &size)
-    : m_KernelData(size), t_Algoritm(static_cast<AlgorithmT *>(this)) {}
-
-_TPL_
-VoxImage<VoxelT>* VoxImageFilter<_TPLT_>::Process(VoxImage<VoxelT>* const& image) {
+template <typename VoxelT, typename CrtpImplT>
+VoxImage<VoxelT>* VoxImageFilter<VoxelT, CrtpImplT>::Process(
+    VoxImage<VoxelT>* const& image) {
  if (m_Image != image) SetImage(image);
  VoxImage<VoxelT> buffer = *m_Image;
 #pragma omp parallel for
  for (int i = 0; i < m_Image->Data().size(); ++i)
-    m_Image->operator[](i).Value = this->t_Algoritm->Evaluate(buffer, i);
+    m_Image->operator[](i).Value = m_CrtpImpl->Evaluate(buffer, i);
 #pragma omp barrier
  return m_Image;
 }

-_TPL_
-void VoxImageFilter<_TPLT_>::Run() {
+template <typename VoxelT, typename CrtpImplT>
+void VoxImageFilter<VoxelT, CrtpImplT>::Run() {
  Process(m_Image);
 }

-_TPL_
-void VoxImageFilter<_TPLT_>::SetKernelOffset() {
+template <typename VoxelT, typename CrtpImplT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetImage(Abstract::VoxImage *image) {
+  m_Image = reinterpret_cast<VoxImage<VoxelT> *>(image);
+  SetKernelOffset();
+}
+
+template <typename VoxelT, typename CrtpImplT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetKernelOffset() {
  Vector3i id(0, 0, 0);
  for (int z = 0; z < m_KernelData.GetDims()(2); ++z) {
    for (int x = 0; x < m_KernelData.GetDims()(0); ++x) {
@@ -134,10 +146,10 @@ void VoxImageFilter<_TPLT_>::SetKernelOffset() {
  }
 }

-_TPL_
-float VoxImageFilter<_TPLT_>::Distance2(const Vector3i &v) {
+template <typename VoxelT, typename CrtpImplT>
+float VoxImageFilter<VoxelT, CrtpImplT>::Distance2(const Vector3i &v) {
  Vector3i tmp = v;
-  const Vector3i &dim = this->m_KernelData.GetDims();
+  const Vector3i &dim = m_KernelData.GetDims();
  Vector3i center = dim / 2;
  tmp = tmp - center;
  center = center.cwiseProduct(center);
@@ -147,12 +159,9 @@ float VoxImageFilter<_TPLT_>::Distance2(const Vector3i &v) {
                 0.25 * (3 - (dim(0) % 2) - (dim(1) % 2) - (dim(2) % 2)));
 }

-_TPL_
-void VoxImageFilter<_TPLT_>::SetKernelNumericXZY(
+template <typename VoxelT, typename CrtpImplT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetKernelNumericXZY(
    const std::vector<float> &numeric) {
-  // set data order //
-  StructuredData::Order order = m_KernelData.GetDataOrder();
-  // m_KernelData.SetDataOrder(StructuredData::XZY);
  Vector3i id;
  int index = 0;
  for (int y = 0; y < m_KernelData.GetDims()(1); ++y) {
@@ -163,38 +172,39 @@ void VoxImageFilter<_TPLT_>::SetKernelNumericXZY(
      }
    }
  }
-  // m_KernelData.SetDataOrder(order);
 }

-_TPL_
-void VoxImageFilter<_TPLT_>::SetKernelSpherical(float (*shape)(float)) {
+template <typename VoxelT, typename CrtpImplT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetKernelSpherical(
+    float (*shape)(float)) {
  Vector3i id;
  for (int y = 0; y < m_KernelData.GetDims()(1); ++y) {
    for (int z = 0; z < m_KernelData.GetDims()(2); ++z) {
      for (int x = 0; x < m_KernelData.GetDims()(0); ++x) {
        id << x, y, z;
-        m_KernelData[id].Value = shape(this->Distance2(id));
+        m_KernelData[id].Value = shape(Distance2(id));
      }
    }
  }
 }

-_TPL_ template <class ShapeT>
-void VoxImageFilter<_TPLT_>::SetKernelSpherical(ShapeT shape) {
+template <typename VoxelT, typename CrtpImplT>
+template <class ShapeT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetKernelSpherical(ShapeT shape) {
  Interface::IsA<ShapeT, Interface::VoxImageFilterShape>();
  Vector3i id;
  for (int y = 0; y < m_KernelData.GetDims()(1); ++y) {
    for (int z = 0; z < m_KernelData.GetDims()(2); ++z) {
      for (int x = 0; x < m_KernelData.GetDims()(0); ++x) {
        id << x, y, z;
-        m_KernelData[id].Value = shape(this->Distance2(id));
+        m_KernelData[id].Value = shape(Distance2(id));
      }
    }
  }
 }

-_TPL_
-void VoxImageFilter<_TPLT_>::SetKernelWeightFunction(
+template <typename VoxelT, typename CrtpImplT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetKernelWeightFunction(
    float (*shape)(const Vector3f &)) {
  const Vector3i &dim = m_KernelData.GetDims();
  Vector3i id;
@@ -202,20 +212,19 @@ void VoxImageFilter<_TPLT_>::SetKernelWeightFunction(
  for (int y = 0; y < dim(1); ++y) {
    for (int z = 0; z < dim(2); ++z) {
      for (int x = 0; x < dim(0); ++x) {
-        // get voxels centroid coords from kernel center //
        id << x, y, z;
        pt << id(0) - dim(0) / 2 + 0.5 * !(dim(0) % 2),
            id(1) - dim(1) / 2 + 0.5 * !(dim(1) % 2),
            id(2) - dim(2) / 2 + 0.5 * !(dim(2) % 2);
-        // compute function using given shape //
        m_KernelData[id].Value = shape(pt);
      }
    }
  }
 }

-_TPL_ template <class ShapeT>
-void VoxImageFilter<_TPLT_>::SetKernelWeightFunction(ShapeT shape) {
+template <typename VoxelT, typename CrtpImplT>
+template <class ShapeT>
+void VoxImageFilter<VoxelT, CrtpImplT>::SetKernelWeightFunction(ShapeT shape) {
  Interface::IsA<ShapeT, Interface::VoxImageFilterShape>();
  const Vector3i &dim = m_KernelData.GetDims();
  Vector3i id;
@@ -223,45 +232,16 @@ void VoxImageFilter<_TPLT_>::SetKernelWeightFunction(ShapeT shape) {
  for (int y = 0; y < dim(1); ++y) {
    for (int z = 0; z < dim(2); ++z) {
      for (int x = 0; x < dim(0); ++x) {
-        // get voxels centroid coords from kernel center //
        id << x, y, z;
        pt << id(0) - dim(0) / 2 + 0.5 * !(dim(0) % 2),
            id(1) - dim(1) / 2 + 0.5 * !(dim(1) % 2),
            id(2) - dim(2) / 2 + 0.5 * !(dim(2) % 2);
-        // compute function using given shape //
        m_KernelData[id].Value = shape(pt);
      }
    }
  }
 }

-_TPL_
-void VoxImageFilter<_TPLT_>::SetImage(Abstract::VoxImage *image) {
-  this->m_Image = reinterpret_cast<VoxImage<VoxelT> *>(image);
-  this->SetKernelOffset();
-}
-
-_TPL_
-float VoxImageFilter<_TPLT_>::Convolve(const VoxImage<VoxelT> &buffer,
-                                       int index) {
-  const DataAllocator<VoxelT> &vbuf = buffer.ConstData();
-  const DataAllocator<VoxelT> &vker = m_KernelData.ConstData();
-  int vox_size = vbuf.size();
-  int ker_size = vker.size();
-  int pos;
-  float conv = 0, ksum = 0;
-  for (int ik = 0; ik < ker_size; ++ik) {
-    pos = index + vker[ik].Count - vker[m_KernelData.GetCenterData()].Count;
-    pos = (pos + vox_size) % vox_size;
-    conv += vbuf[pos].Value * vker[ik].Value;
-    ksum += vker[ik].Value;
-  }
-  return conv / ksum;
-}
-
-#undef _TPLT_
-#undef _TPL_
-
 } // namespace uLib

 #endif // VOXIMAGEFILTER_HPP
--- a/src/Math/VoxImageFilterCustom.hpp
+++ b/src/Math/VoxImageFilterCustom.hpp
@@ -30,8 +30,6 @@
 #include "VoxImageFilter.h"
 #include <Math/Dense.h>

-#define likely(expr) __builtin_expect(!!(expr), 1)
-
 ////////////////////////////////////////////////////////////////////////////////
 /////  VOXIMAGE FILTER CUSTOM  /////////////////////////////////////////////////
 ////////////////////////////////////////////////////////////////////////////////
@@ -50,7 +48,7 @@ public:
      : BaseClass(size), m_CustomEvaluate(NULL) {}

  float Evaluate(const VoxImage<VoxelT> &buffer, int index) {
-    if (likely(m_CustomEvaluate)) {
+    if (m_CustomEvaluate) {
      const DataAllocator<VoxelT> &vbuf = buffer.ConstData();
      const DataAllocator<VoxelT> &vker = this->m_KernelData.ConstData();
      int vox_size = vbuf.size();
--- a/src/Math/VoxImageFilterThreshold.hpp
+++ b/src/Math/VoxImageFilterThreshold.hpp
@@ -23,8 +23,6 @@

 //////////////////////////////////////////////////////////////////////////////*/

-
-
 #ifndef VOXIMAGEFILTERTHRESHOLD_HPP
 #define VOXIMAGEFILTERTHRESHOLD_HPP

@@ -39,40 +37,24 @@
 namespace uLib {

 template <typename VoxelT>
-class VoxFilterAlgorithmThreshold :
-        public VoxImageFilter<VoxelT, VoxFilterAlgorithmThreshold<VoxelT> > {
+class VoxFilterAlgorithmThreshold
+    : public VoxImageFilter<VoxelT, VoxFilterAlgorithmThreshold<VoxelT>> {

-    typedef VoxImageFilter<VoxelT, VoxFilterAlgorithmThreshold<VoxelT> > BaseClass;
-    //    ULIB_OBJECT_PARAMETERS(BaseClass) {
-    //        float threshold;
-    //    };
+  typedef VoxImageFilter<VoxelT, VoxFilterAlgorithmThreshold<VoxelT>> BaseClass;

  float m_threshold;

 public:
-    VoxFilterAlgorithmThreshold(const Vector3i &size) : BaseClass(size)
-    {
-        //        init_parameters();
-        m_threshold = 0;
+  VoxFilterAlgorithmThreshold(const Vector3i &size)
+      : BaseClass(size), m_threshold(0) {}
+
+  void SetThreshold(float th) { m_threshold = th; }
+
+  float Evaluate(const VoxImage<VoxelT> &buffer, int index) {
+    return static_cast<float>(buffer.ConstData().at(index).Value >= m_threshold);
  }
-
-    inline void SetThreshold(float th) { m_threshold = th; }
-
-    float Evaluate(const VoxImage<VoxelT> &buffer, int index)
-    {
-        return static_cast<float>(buffer.ConstData().at(index).Value >=
-                                  //                                  parameters().threshold);
-                                  m_threshold );
-    }
-
 };

-//template <typename VoxelT>
-//inline void VoxFilterAlgorithmThreshold<VoxelT>::init_parameters()
-//{
-//    parameters().threshold = 0;
-//}
-
-}
+} // namespace uLib

 #endif // VOXIMAGEFILTERTHRESHOLD_HPP
--- a/src/Math/testing/AlgorithmCudaChainTest.cpp
+++ b/src/Math/testing/AlgorithmCudaChainTest.cpp
@@ -0,0 +1,408 @@
+/*//////////////////////////////////////////////////////////////////////////////
+// CMT Cosmic Muon Tomography project //////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+
+  Copyright (c) 2014, Universita' degli Studi di Padova, INFN sez. di Padova
+  All rights reserved
+
+  Authors: Andrea Rigoni Garola < andrea.rigoni@pd.infn.it >
+
+  ------------------------------------------------------------------
+  This library is free software;  you  can  redistribute  it  and/or
+  modify it  under the  terms  of  the  GNU  Lesser  General  Public
+  License as published  by  the  Free  Software  Foundation;  either
+  version 3.0 of the License, or (at your option) any later version.
+
+  This library is  distributed in  the hope that it will  be useful,
+  but  WITHOUT ANY WARRANTY;  without  even  the implied warranty of
+  MERCHANTABILITY  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+  Lesser General Public License for more details.
+
+  You should have received a copy of  the GNU Lesser General  Public
+  License along with this library.
+
+//////////////////////////////////////////////////////////////////////////////*/
+
+#include "testing-prototype.h"
+
+#include "Core/Algorithm.h"
+#include "Math/VoxImage.h"
+#include "Math/VoxImageFilter.h"
+
+#include <iostream>
+#include <thread>
+#include <chrono>
+
+using namespace uLib;
+
+struct TestVoxel {
+  Scalarf Value;
+  unsigned int Count;
+};
+
+int main() {
+  BEGIN_TESTING(AlgorithmCudaChain);
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 1: Single filter — GetPreferredDevice reflects data location
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 1: GetPreferredDevice reflects data location ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(10, 10, 10));
+    image[Vector3i(5, 5, 5)].Value = 1;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filter(Vector3i(3, 3, 3));
+    std::vector<float> weights(27, 1.0f);
+    filter.SetImage(&image);
+    filter.SetKernelNumericXZY(weights);
+
+    // Before VRAM move: should prefer RAM
+    TEST1(filter.GetPreferredDevice() == MemoryDevice::RAM);
+    TEST1(!filter.IsGPU());
+    std::cout << "  RAM mode: PreferredDevice=RAM, IsGPU=false  OK\n";
+
+    // Move image data to VRAM
+    image.Data().MoveToVRAM();
+
+    // After VRAM move: should prefer VRAM
+    TEST1(filter.GetPreferredDevice() == MemoryDevice::VRAM);
+    TEST1(filter.IsGPU());
+    std::cout << "  VRAM mode: PreferredDevice=VRAM, IsGPU=true  OK\n";
+
+    // Move back to RAM
+    image.Data().MoveToRAM();
+    TEST1(filter.GetPreferredDevice() == MemoryDevice::RAM);
+    std::cout << "  Back to RAM: PreferredDevice=RAM  OK\n";
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 2: Kernel data on VRAM also triggers GPU preference
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 2: Kernel on VRAM triggers GPU preference ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(8, 8, 8));
+    VoxFilterAlgorithmLinear<TestVoxel> filter(Vector3i(3, 3, 3));
+    std::vector<float> weights(27, 1.0f);
+    filter.SetImage(&image);
+    filter.SetKernelNumericXZY(weights);
+
+    TEST1(filter.GetPreferredDevice() == MemoryDevice::RAM);
+
+    // Only kernel on VRAM
+    filter.GetKernelData().Data().MoveToVRAM();
+    TEST1(filter.GetPreferredDevice() == MemoryDevice::VRAM);
+    std::cout << "  Kernel on VRAM: PreferredDevice=VRAM  OK\n";
+
+    filter.GetKernelData().Data().MoveToRAM();
+    TEST1(filter.GetPreferredDevice() == MemoryDevice::RAM);
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 3: Algorithm interface — Process through base pointer
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 3: Process through Algorithm base pointer ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(10, 10, 10));
+    image[Vector3i(5, 5, 5)].Value = 10;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filter(Vector3i(3, 3, 3));
+    std::vector<float> weights(27, 1.0f);
+    filter.SetImage(&image);
+    filter.SetKernelNumericXZY(weights);
+
+    // Use through Algorithm base class pointer
+    Algorithm<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*>* alg = &filter;
+
+    VoxImage<TestVoxel>* result = alg->Process(&image);
+    TEST1(result == &image);
+    std::cout << "  Process through base pointer returned correct image  OK\n";
+
+    // Verify filter actually ran (center voxel should be averaged)
+    // With uniform 3x3x3 kernel and single non-zero voxel at center,
+    // the center value should be 10/27 ≈ 0.37
+    TEST1(image[Vector3i(5, 5, 5)].Value < 10.0f);
+    std::cout << "  Filter modified voxel values  OK\n";
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 4: Encoder/decoder chain — two filters linked
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 4: Encoder/decoder chain ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(10, 10, 10));
+    image[Vector3i(5, 5, 5)].Value = 100;
+
+    // First filter: linear smoothing
+    VoxFilterAlgorithmLinear<TestVoxel> filter1(Vector3i(3, 3, 3));
+    std::vector<float> weights1(27, 1.0f);
+    filter1.SetImage(&image);
+    filter1.SetKernelNumericXZY(weights1);
+
+    // Second filter: threshold
+    VoxFilterAlgorithmThreshold<TestVoxel> filter2(Vector3i(1, 1, 1));
+    filter2.SetThreshold(0.5f);
+    filter2.SetImage(&image);
+    // 1x1x1 kernel with value 1
+    std::vector<float> weights2(1, 1.0f);
+    filter2.SetKernelNumericXZY(weights2);
+
+    // Chain: filter1 → filter2
+    filter1.SetDecoder(&filter2);
+    filter2.SetEncoder(&filter1);
+
+    TEST1(filter1.GetDecoder() == &filter2);
+    TEST1(filter2.GetEncoder() == &filter1);
+    std::cout << "  Chain linked: filter1 -> filter2  OK\n";
+
+    // Execute chain manually (encoder first, then decoder)
+    filter1.Process(&image);
+    float smoothed_center = image[Vector3i(5, 5, 5)].Value;
+    std::cout << "  After linear: center = " << smoothed_center << "\n";
+
+    filter2.Process(&image);
+    float thresholded_center = image[Vector3i(5, 5, 5)].Value;
+    std::cout << "  After threshold: center = " << thresholded_center << "\n";
+
+    // After threshold, values should be 0 or 1
+    TEST1(thresholded_center == 0.0f || thresholded_center == 1.0f);
+    std::cout << "  Chain execution produced valid results  OK\n";
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 5: CUDA chain — VRAM data through chained filters
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 5: VRAM data through chained filters ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(10, 10, 10));
+    image[Vector3i(5, 5, 5)].Value = 50;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filter1(Vector3i(3, 3, 3));
+    std::vector<float> weights1(27, 1.0f);
+    filter1.SetImage(&image);
+    filter1.SetKernelNumericXZY(weights1);
+
+    VoxFilterAlgorithmAbtrim<TestVoxel> filter2(Vector3i(3, 3, 3));
+    std::vector<float> weights2(27, 1.0f);
+    filter2.SetImage(&image);
+    filter2.SetKernelNumericXZY(weights2);
+    filter2.SetABTrim(1, 1);
+
+    // Chain
+    filter1.SetDecoder(&filter2);
+    filter2.SetEncoder(&filter1);
+
+    // Move data to VRAM
+    image.Data().MoveToVRAM();
+    filter1.GetKernelData().Data().MoveToVRAM();
+    filter2.GetKernelData().Data().MoveToVRAM();
+
+    // Both filters should report VRAM preference
+    TEST1(filter1.GetPreferredDevice() == MemoryDevice::VRAM);
+    TEST1(filter2.GetPreferredDevice() == MemoryDevice::VRAM);
+    TEST1(filter1.IsGPU());
+    TEST1(filter2.IsGPU());
+    std::cout << "  Both filters detect VRAM preference  OK\n";
+
+    // Verify the chain's device consistency
+    auto* encoder = filter2.GetEncoder();
+    TEST1(encoder != nullptr);
+    TEST1(encoder->IsGPU());
+    std::cout << "  Encoder in chain also reports GPU  OK\n";
+
+#ifdef USE_CUDA
+    // With CUDA: filters execute on GPU via Process()
+    image.Data().MoveToRAM(); // reset for clean test
+    image[Vector3i(5, 5, 5)].Value = 50;
+    image.Data().MoveToVRAM();
+
+    filter1.Process(&image);
+    TEST1(image.Data().GetDevice() == MemoryDevice::VRAM);
+    std::cout << "  CUDA: data stays in VRAM after filter1  OK\n";
+
+    filter2.Process(&image);
+    TEST1(image.Data().GetDevice() == MemoryDevice::VRAM);
+    std::cout << "  CUDA: data stays in VRAM after filter2  OK\n";
+#else
+    // Without CUDA: verify Process still works via CPU fallback
+    image.Data().MoveToRAM();
+    image[Vector3i(5, 5, 5)].Value = 50;
+
+    filter1.GetKernelData().Data().MoveToRAM();
+    filter2.GetKernelData().Data().MoveToRAM();
+
+    filter1.Process(&image);
+    filter2.Process(&image);
+    std::cout << "  No CUDA: CPU fallback executed correctly  OK\n";
+#endif
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 6: AlgorithmTask with VRAM-aware filter
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 6: AlgorithmTask with VRAM-aware filter ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(8, 8, 8));
+    image[Vector3i(4, 4, 4)].Value = 20;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filter(Vector3i(3, 3, 3));
+    std::vector<float> weights(27, 1.0f);
+    filter.SetImage(&image);
+    filter.SetKernelNumericXZY(weights);
+
+    // Set up task
+    AlgorithmTask<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*> task;
+    task.SetAlgorithm(&filter);
+    task.SetMode(AlgorithmTask<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*>::Cyclic);
+    task.SetCycleTime(50);
+
+    // Run task for a few cycles
+    task.Run(&image);
+    std::this_thread::sleep_for(std::chrono::milliseconds(200));
+    task.Stop();
+
+    // After cyclic execution, the filter should have smoothed values
+    TEST1(image[Vector3i(4, 4, 4)].Value < 20.0f);
+    std::cout << "  Task cyclic execution modified image  OK\n";
+    std::cout << "  Center value after smoothing: "
+              << image[Vector3i(4, 4, 4)].Value << "\n";
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 7: AlgorithmTask async with chained filters
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 7: AlgorithmTask async with filter ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(8, 8, 8));
+    image[Vector3i(4, 4, 4)].Value = 30;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filter(Vector3i(3, 3, 3));
+    std::vector<float> weights(27, 1.0f);
+    filter.SetImage(&image);
+    filter.SetKernelNumericXZY(weights);
+
+    AlgorithmTask<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*> task;
+    task.SetAlgorithm(&filter);
+    task.SetMode(AlgorithmTask<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*>::Async);
+
+    float before = image[Vector3i(4, 4, 4)].Value;
+
+    task.Run(&image);
+
+    // Trigger one execution
+    task.Notify();
+    std::this_thread::sleep_for(std::chrono::milliseconds(100));
+
+    task.Stop();
+
+    float after = image[Vector3i(4, 4, 4)].Value;
+    TEST1(after < before);
+    std::cout << "  Async trigger: value " << before << " -> " << after << "  OK\n";
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 8: Device preference propagation in chain
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 8: Device preference propagation check ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(8, 8, 8));
+    image[Vector3i(4, 4, 4)].Value = 10;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filterA(Vector3i(3, 3, 3));
+    VoxFilterAlgorithmAbtrim<TestVoxel> filterB(Vector3i(3, 3, 3));
+    VoxFilterAlgorithmThreshold<TestVoxel> filterC(Vector3i(1, 1, 1));
+
+    std::vector<float> w27(27, 1.0f);
+    std::vector<float> w1(1, 1.0f);
+
+    filterA.SetImage(&image);
+    filterA.SetKernelNumericXZY(w27);
+    filterB.SetImage(&image);
+    filterB.SetKernelNumericXZY(w27);
+    filterB.SetABTrim(1, 1);
+    filterC.SetImage(&image);
+    filterC.SetKernelNumericXZY(w1);
+    filterC.SetThreshold(0.1f);
+
+    // Chain: A → B → C
+    filterA.SetDecoder(&filterB);
+    filterB.SetEncoder(&filterA);
+    filterB.SetDecoder(&filterC);
+    filterC.SetEncoder(&filterB);
+
+    // All on RAM
+    TEST1(!filterA.IsGPU());
+    TEST1(!filterB.IsGPU());
+    TEST1(!filterC.IsGPU());
+    std::cout << "  All filters on RAM  OK\n";
+
+    // Move image to VRAM — filters A and B should detect it
+    image.Data().MoveToVRAM();
+    TEST1(filterA.IsGPU());
+    TEST1(filterB.IsGPU());
+    // filterC with 1x1x1 kernel doesn't have CUDA override, but still detects VRAM
+    TEST1(filterC.IsGPU());
+    std::cout << "  Image on VRAM: all filters report GPU  OK\n";
+
+    // Can walk the chain and check device consistency
+    auto* step = static_cast<Algorithm<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*>*>(&filterA);
+    bool all_gpu = true;
+    while (step) {
+      if (!step->IsGPU()) all_gpu = false;
+      step = static_cast<Algorithm<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*>*>(step->GetDecoder());
+    }
+    TEST1(all_gpu);
+    std::cout << "  Chain walk: all steps report GPU  OK\n";
+
+    image.Data().MoveToRAM();
+  }
+
+  ////////////////////////////////////////////////////////////////////////////
+  // TEST 9: Process through chain with Algorithm interface
+  ////////////////////////////////////////////////////////////////////////////
+  {
+    std::cout << "\n--- Test 9: Sequential chain processing via Algorithm interface ---\n";
+
+    VoxImage<TestVoxel> image(Vector3i(10, 10, 10));
+    // Set a pattern: single bright voxel
+    image[Vector3i(5, 5, 5)].Value = 100;
+
+    VoxFilterAlgorithmLinear<TestVoxel> filterA(Vector3i(3, 3, 3));
+    std::vector<float> w(27, 1.0f);
+    filterA.SetImage(&image);
+    filterA.SetKernelNumericXZY(w);
+
+    VoxFilterAlgorithmLinear<TestVoxel> filterB(Vector3i(3, 3, 3));
+    filterB.SetImage(&image);
+    filterB.SetKernelNumericXZY(w);
+
+    // Chain
+    filterA.SetDecoder(&filterB);
+    filterB.SetEncoder(&filterA);
+
+    // Process chain through base pointer
+    using AlgType = Algorithm<VoxImage<TestVoxel>*, VoxImage<TestVoxel>*>;
+    AlgType* chain = &filterA;
+
+    // Walk and process
+    AlgType* current = chain;
+    while (current) {
+      current->Process(&image);
+      current = static_cast<AlgType*>(current->GetDecoder());
+    }
+
+    // After two rounds of smoothing, the peak should be smaller than original
+    float final_val = image[Vector3i(5, 5, 5)].Value;
+    TEST1(final_val < 100.0f);
+    std::cout << "  Two-stage smoothing: peak = " << final_val << "  OK\n";
+  }
+
+  END_TESTING;
+}
--- a/src/Math/testing/CMakeLists.txt
+++ b/src/Math/testing/CMakeLists.txt
@@ -16,6 +16,7 @@ set(TESTS
        QuadMeshTest
        BitCodeTest
        UnitsTest
+        AlgorithmCudaChainTest
 )

 set(LIBRARIES
@@ -28,6 +29,6 @@ set(LIBRARIES
 uLib_add_tests(Math)

 if(USE_CUDA)
-    set_source_files_properties(VoxImageTest.cpp VoxImageCopyTest.cpp VoxImageFilterTest.cpp VoxRaytracerTest.cpp VoxRaytracerTestExtended.cpp PROPERTIES LANGUAGE CUDA)
+    set_source_files_properties(VoxImageTest.cpp VoxImageCopyTest.cpp VoxImageFilterTest.cpp VoxRaytracerTest.cpp VoxRaytracerTestExtended.cpp AlgorithmCudaChainTest.cpp PROPERTIES LANGUAGE CUDA)
    set_source_files_properties(VoxRaytracerTest.cpp VoxRaytracerTestExtended.cpp PROPERTIES CXX_STANDARD 17 CUDA_STANDARD 17)
 endif()