Advancing Constrained Monotonic Neural Networks: Universal Approximation Beyond Bounded Activations

1. Department of Information Engineering, University of Padova, Padova (PD), Italy
2. Human Inspired Technology Research Centre, University of Padova, Padova (PD), Italy

* Equal contribution

                    
@inproceedings{SinigagliaSartorSusto2025,
    author       = {Davide Sartor, Alberto Sinigaglia, Gian Antonio Susto},
    title        = {Advancing Constrained Monotonic Neural Networks: Achieving Universal Approximation Beyond Bounded Activations},
    booktitle    = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
    year         = {2025},
}

TL;DR

MLPs with activations that saturate on alternating sides and non-negative weights are universal approximators for monotonic functions.
Thanks to 1, we show that MLPs with non-positive weights and convex monotone activations are also universal approximators. This can come as a surprise, as it's counterpart with non-negative weights does not hold.
We propose a novel formulation that allows the network to adjust its activations according to the sign of the weights, easing initialization and improving training stability.

Conventional techniques for imposing monotonicity in MLPs by construction involve the use of non-negative weight constraints and bounded activation functions, which pose well-known optimization challenges. In this work, we generalize previous theoretical results, showing that MLPs with non-negative weight constraint and activations that saturate on alternating sides are universal approximators for monotonic functions. Additionally, we show an equivalence between the saturation side in the activations and the sign of the weight constraint. This connection allows us to prove that MLPs with convex monotone activations and non-positive constrained weights also qualify as universal approximators, in contrast to their non-negative constrained counterparts. Our results provide theoretical grounding to the empirical effectiveness observed in previous works while leading to possible architectural simplification. Moreover, to further alleviate the optimization difficulties, we propose an alternative formulation that allows the network to adjust its activations according to the sign of the weights. This eliminates the requirement for weight reparameterization, easing initialization and improving training stability. Experimental evaluation reinforces the validity of the theoretical results, showing that our novel approach compares favourably to traditional monotonic architectures.

Theoretical Foundations

Definition

A function \( \sigma: \mathbb{R} \rightarrow \mathbb{R} \) is right-saturating if \( \lim_{x \rightarrow +\infty} \sigma(x) \in \mathbb{R} \), and left-saturating if \( \lim_{x \rightarrow -\infty} \sigma(x) \in \mathbb{R} \).
We denote the sets of right- and left-saturating activations by \( \mathcal{S}^+ \) and \( \mathcal{S}^- \), respectively.

Proposition

For any MLP with non-negative weights and activation \( \sigma(x) \), and for any \( a \in \mathbb{R}_+, b \in \mathbb{R} \), there exists an equivalent MLP with non-negative weights and activation \( a\sigma(x) + b \).

Consequently, we assume without loss of generality that all activations saturate to zero.

Theorem

An MLP \( g_\theta: \mathbb{R}^d \rightarrow \mathbb{R} \) with non-negative weights and 3 hidden layers can interpolate any monotonic non-decreasing function \( f(x) \) on a finite set of \( n \) points, provided that the activations are monotonic non-decreasing and alternate saturation sides:

\( \sigma^{(1)} \in \mathcal{S}^-, \sigma^{(2)} \in \mathcal{S}^+, \sigma^{(3)} \in \mathcal{S}^- \)
or \( \sigma^{(1)} \in \mathcal{S}^+, \sigma^{(2)} \in \mathcal{S}^-, \sigma^{(3)} \in \mathcal{S}^+ \)

Lemma 1

Let \( \alpha \in \mathbb{R}_+^d \), \( \beta \in \mathbb{R}^d \). Define: \[ A^+ = \{ x : \alpha^\top (x - \beta) > 0 \}, \quad A^- = \{ x : \alpha^\top (x - \beta) < 0 \} \] Then the \( i \)-th neuron of the first hidden layer of an MLP with non-negative weights can approximate: \[ h^{(1)}_i(x) \approx \begin{cases} \sigma^{(1)}(+\infty), & x \in A^+ \\ \sigma^{(1)}(-\infty), & x \in A^- \\ \sigma^{(1)}(0), & \text{otherwise} \end{cases} \]

Lemma 2

Consider \( A = \bigcap_{i=1}^n A_i \), where each \( A_i \subset \mathbb{R}^d \). Let \( \gamma \) be in the image of \( \sigma^{(k)} \). Then a single neuron \( h^{(k)}_j \) in the \( k \)-th layer can approximate: \[ h^{(k)}_j(x) \approx \gamma \cdot \mathbf{1}_A(x) \] provided:

\( h^{(k-1)}_i(x) \approx 0 \) for \( x \in A_i \), and
either \( \sigma^{(k)} \in \mathcal{S}^- \) with \( h^{(k-1)}_i(x) < 0 \) for \( x \notin A_i \),
or \( \sigma^{(k)} \in \mathcal{S}^+ \) with \( h^{(k-1)}_i(x) > 0 \) for \( x \notin A_i \).

Proof Sketch of Theorem

Let \( x_1, \dots, x_n \) be the ordered points with \( f(x_i) \le f(x_j) \) for \( i < j \). The proof proceeds layer-wise:

Layer 1: For each pair \( (x_i, x_j) \), define half-spaces using \( \alpha_{j/i} \in \mathbb{R}_+^d \), \( \beta_{j/i} \in \mathbb{R}^d \), such that \( x_j \in A^+_{j/i} \), \( x_i \in A^-_{j/i} \). Using Lemma 1, implement indicator-like units.
Layer 2: For each \( i \), construct \( A^{(2)}_i = \bigcap_{j>i} A^-_{j/i} \). Use Lemma 2 with \( \sigma^{(2)} \in \mathcal{S}^+ \) to encode the inclusion of \( x_i \) and exclusion of \( x_j \) for \( j > i \).
Layer 3: Define \( A^{(3)}_i = \bigcap_{j < i} \overline{A^{(2)}_j} \). Again apply Lemma 2 with \( \sigma^{(3)} \in \mathcal{S}^- \). This step encodes the "level set" of \( f(x_i) \le f(x_j) \).
Layer 4: Construct the output as a telescoping sum: \[ g_\theta(x) = b + \sum_{j=1}^n \left( f(x_j) - f(x_{j-1}) \right) \cdot \mathbf{1}_{A^{(3)}_j}(x) \] with \( f(x_0) = b \). This exactly recovers \( f(x_i) \) for all inputs \( x_i \).

This completes the proof of universal approximation under alternating saturation. For a more formal proof, please refer to the paper

Relaxing Weight Constraints with Activation Switches

Activation switch illustration When employing weight-constrained monotonic MLPs, the choice of activation saturation sides remains a non-trivial hyperparameter. However, it is possible to remove this requirement—and relax the weight constraints entirely—by reordering the computational steps.

Consider a single-layer transformation \( f(x) = \sigma(|W|x + b) \), where absolute weights enforce non-negativity. We can decompose this affine transformation as: \[ |W|x + b = W^+ x - W^- x + b, \] where \( W^+ = \max(W, 0) \) and \( W^- = \min(W, 0) \).

Applying the activation function independently to each affine component yields the following parameterization: \[ \hat{f}(x) = \sigma(W^+ x + b) - \sigma(W^- x + b). \] We refer to this as the pre-activation switch.

Proposition

Any function representable by an affine transformation with non-negative weights followed by either \( \sigma \) or its reflection \( \sigma' \), can also be represented using the pre-activation switch formulation, up to an additive constant.

Proof sketch: If all weights share the same sign, then one of the terms collapses to a constant:

If \( W \ge 0 \), then \( \hat{f}(x) = \sigma(Wx + b) - \sigma(b) \).
If \( W \le 0 \), then \( \hat{f}(x) = \sigma(b) - \sigma(-Wx + b) = \sigma'(Wx + b) + \text{const} \).

The residual constant can be absorbed by the bias in the next layer.

This demonstrates that monotonic MLPs composed of at least 4 such layers are universal approximators. Moreover, the formulation is strictly more expressive than the weight-constrained variant, as the latter is a special case.

By reversing this logic from the final layer, we obtain an alternative formulation, referred to as the post-activation switch: \[ \hat{f}(x) = W^+ \sigma(x) + W^- \sigma(-x) + b, \]

The method only adds one matrix multiplication and leverages existing GPU-parallelized infrastructure. In practice, no significant computational overhead was observed in the tested architectures.

Advancing Constrained Monotonic Neural Networks:
Achieving Universal Approximation Beyond Bounded Activations

TL;DR

Abstract

Theoretical Foundations

Definition

Proposition

Theorem

Lemma 1

Lemma 2

Proof Sketch of Theorem

On the surprising effectiveness of non-positive weights

ReLU + non-negative weights

ReLU + non-positive weights

Relaxing Weight Constraints with Activation Switches

Proposition

PyTorch Implementation