4 min read

Building GBDT Inference on Encrypted Data with CKKS

Building GBDT

The Challenge: Privacy-Preserving Ensembles

Gradient Boosting Decision Trees (GBDT) are a workhorse of modern machine learning, but deploying them in a “Machine Learning as a Service” (MLaaS) model creates a privacy deadlock: clients are unwilling to disclose sensitive features, and providers must protect their proprietary models.

Homomorphic Encryption (HE) offers a way out, allowing a service provider to compute on encrypted data. However, there is a persistent gap between academic HE papers and production-minded implementations. Papers tend to compress the hard parts—mapping branching logic to arithmetic, controlling multiplicative depth, and bridging ML stacks with HE runtimes—into a few sentences.

In this guide, we’ll walk through the architecture of building a GBDT inference engine on encrypted data using the CKKS (Cheon-Kim-Kim-Song) scheme, based on the foundational approach proposed by Xiao et al. (2019).

This is the first in a series of posts on privacy-preserving machine learning. You can find the reference implementation at github.com/chuckchen/ckks-gbdt.

Summary of Results

Before diving into the architecture, here is how the implementation performed compared to the plaintext baseline on held-out test slices:

DatasetMetricPlaintext (Baseline)Encrypted (Reproduction)Gap
CreditMicro-F10.73440.7188-0.0156
MNISTMicro-F11.00000.8750*-0.1250

*Note: MNIST encrypted result is from the HE-style surrogate (polynomialized approximation) used to verify the routing logic without the high computational cost of a full CKKS run on multiclass ensembles.

1. The Core Idea: Polynomialization

The central problem in HE is that it only supports addition and multiplication. In a standard GBDT, a split at an internal node vv is a discrete choice:

Branch(xj,sv)={leftif xjsvright  otherwise\text{Branch}(x_j, s_v) = \begin{cases} \text{left} & \text{if } x_j \le s_v \\ \text{right} &\; \text{otherwise} \end{cases}

To evaluate this under CKKS, we must transform this control flow into a continuous arithmetic circuit.

Approximating the Comparison

We first replace the discrete indicator with a “soft” routing variable ava_v. Ideally, av=sign(svxj)a_v = \text{sign}(s_v - x_j), but since the sign function is non-continuous, we approximate it using a parameterized sigmoid function:

σ(x;t)=11+exp(tx)\sigma(x; t) = \frac{1}{1 + \exp(-t \cdot x)}

where tt is a steepness factor (typically 128). This gives us avσ(svxj;t)a_v \approx \sigma(s_v - x_j; t), where av1a_v \to 1 if the condition is met and av0a_v \to 0 otherwise.

The Decision Tree Polynomial

Theorem 1 in the Xiao et al. paper guarantees that any decision tree TT can be uniquely represented as a polynomial P(T)P(T). For a node vv with left child TLT_L and right child TRT_R, the output is defined recursively:

P(T)=avP(TL)+(1av)P(TR)P(T) = a_v \cdot P(T_L) + (1 - a_v) \cdot P(T_R)

By unrolling this recursion, the final score of a tree is the sum of contributions from every leaf LL:

P(T)=LLeavesValue(L)vPath(L)cvP(T) = \sum_{L \in \text{Leaves}} \text{Value}(L) \cdot \prod_{v \in \text{Path}(L)} \mathbf{c}_v

where the condition term cv\mathbf{c}_v is ava_v if the path goes left at node vv, and (1av)(1 - a_v) if it goes right. This transformation turns the discrete tree into a sum-of-products circuit that CKKS can evaluate.

Why Chebyshev Polynomials?

To evaluate the sigmoid function homomorphically, we use a degree-24 Chebyshev polynomial. Unlike Taylor series, which are local approximations, Chebyshev polynomials minimize the maximum error (LL_\infty norm) over an entire interval (typically [1,1][-1, 1]). This ensures the approximation is stable even for values far from the threshold, preventing incorrect routing in deeper layers.

2. Optimizing for Scalability: O(logn)O(\log n) Depth

A naive recursive evaluation of a tree polynomial leads to a multiplicative depth that grows linearly with the tree’s depth. In FHE, multiplicative depth is our most expensive resource.

Xiao et al. demonstrate that we can reduce this to O(logn)O(\log n) using a Balanced Product Tree strategy. Instead of multiplying path terms sequentially:

(a1a2)a3ad(a_1 \cdot a_2) \cdot a_3 \dots \cdot a_d

We pair them up and perform parallel multiplications:

(a1a2)(a3a4)(a_1 \cdot a_2) \cdot (a_3 \cdot a_4) \dots

This strategy is what makes the circuit depth manageable, allowing us to support deeper ensembles without exhausting the CKKS modulus chain.

3. Systems Optimization: Vertical Packing

Efficiency in CKKS comes from SIMD (Single Instruction, Multiple Data) operations. Most ML implementations are row-oriented (one sample, all features). To build an efficient encrypted runtime, we must think “column-oriented”.

In Vertical Packing, we pack the same feature across many different samples into a single ciphertext.

  • The Benefit: When a node splits on feature jj, we only load one ciphertext (ctjct_j).
  • The Scale: A single homomorphic operation now processes thousands of samples simultaneously (e.g., 8,192 slots).
  • Decoupled Latency: Batch latency becomes almost entirely dependent on model complexity, rather than batch size (up to the slot limit).

4. Implementation Architecture

Building a production-minded HE system requires a clean boundary between the ML stack and the cryptographic circuit.

Layer 1: The Pre-processing Pipeline (Python)

We use lightgbm to train the model, then export it into a flat, library-agnostic contract (model.txt).

  • Folding Shrinkage: We multiply the learning rate (shrinkage) directly into the leaf values at export time. This saves one encrypted multiplication and rescale step per tree—a critical optimization that often determines success in deep circuits.

Layer 2: The Homomorphic Runtime (C++)

Built on the original HEAAN library, the runtime loads the tree bundle and executes the O(logn)O(\log n) circuit using the Paterson-Stockmeyer algorithm to minimize non-scalar multiplications.

5. Performance and Error Bounds

Xiao et al. provided a critical theoretical bound for the absolute error e(G)|e(G)|:

e(G)i=1nGaiδ(ai)|e(G)| \le \sum_{i=1}^n \left| \frac{\partial G}{\partial a_i} \right| \cdot \delta(a_i)

Because GBDT is an ensemble of many trees, the partial derivatives Gai\frac{\partial G}{\partial a_i} are generally small, helping the overall error remain bounded even as the number of features nn grows.

Debugging the Arithmetic Budget

The hardest bugs were not typical C++ bugs (like segfaults) but HE arithmetic failures. For example, a division by zero in _ntl_gdiv failure from NTL is often a symptom of modulus exhaustion. In FHE, “cryptographic budgeting” replaces “FLOP counting”—the fix rarely involves changing the logic; it involves resizing CKKS parameters or reordering the path-product tree to save a level of depth.

Closing Thought

The most satisfying part of building this system was watching the abstractions of the paper survive contact with real-world constraints—file formats, build systems, and modulus budgets. Bridging the gap between theory and implementation is where the most useful engineering insights are found.

In the next post, we will explore the MNIST Challenge: scaling this architecture to multi-class ensembles and handling the exponential increase in multiplicative depth requirements.