Architecture

Hybrid C/C++ Architecture

Combining raw C performance with modern C++17 flexibility for optimal results

Design Philosophy

UltraBalancer uses a hybrid architecture that leverages C for performance-critical paths and C++17 for complex logic and maintainability. This approach delivers maximum throughput while maintaining code quality and extensibility.

Performance First

Critical paths written in C with zero-copy networking and lock-free data structures

Modular Design

Clean separation between frontend, core engine, and backend components

Production Ready

Battle-tested reliability with comprehensive error handling and graceful degradation

Component Architecture

Three-tier architecture for optimal performance and maintainability

Frontend Layer (C)

High-performance client connection handling and protocol detection

Listener Manager

Socket creation and binding
SO_REUSEPORT for multi-core
TCP/UDP protocol support

Connection Acceptor

epoll/kqueue event loop
Non-blocking I/O
Connection rate limiting

Protocol Detector

HTTP/WebSocket detection
SSL/TLS handshake
Protocol upgrade handling

Core Engine (C/C++)

Request routing, load balancing algorithms, and connection management

Connection Pool

Backend connection reuse
Idle connection cleanup
Connection limits per backend

Request Router

Load balancing algorithms
Session persistence
Request retry logic

Metrics Aggregator

Real-time statistics
Prometheus metrics export
Performance counters

Backend Layer (C)

Backend server management, health monitoring, and failover

Server Manager

Backend server registry
Dynamic server addition/removal
Weight and priority management

Health Checker

Active health probes
Passive failure detection
Automatic failover

Session Stickiness

Cookie-based affinity
IP-based persistence
Session table management

Performance Optimizations

Advanced techniques for maximum throughput and minimal latency

Lock-Free Data Structures

Critical data structures use atomic operations and compare-and-swap (CAS) instead of mutexes, eliminating lock contention and enabling true parallelism across CPU cores.

Lock-free ring buffers for request queues
Atomic counters for statistics
RCU for read-heavy data structures

Zero-Copy Networking

Data is transferred directly between network buffers and application memory without intermediate copies, reducing CPU overhead and memory bandwidth consumption.

splice() system call for TCP proxy
sendfile() for static content
Direct I/O for large transfers

NUMA-Aware Memory

Memory is allocated on the same NUMA node as the CPU processing the data, minimizing cross-node memory access latency on multi-socket systems.

Per-core memory pools
CPU affinity for worker threads
NUMA-local connection handling

Kernel Bypass (Optional)

Optional DPDK integration moves packet processing to userspace, bypassing the kernel network stack for extreme performance scenarios.

Userspace packet processing
Poll mode drivers (PMD)
Huge page support for memory

Threading Model

Multi-threaded architecture optimized for modern multi-core processors

Worker Thread Architecture

One worker thread per CPU core for optimal parallelism

Design Principles

Shared-Nothing: Each worker has its own event loop and connection pool
CPU Affinity: Workers pinned to specific cores to maximize cache locality
SO_REUSEPORT: Kernel load balances incoming connections across workers
Minimal Synchronization: Lock-free communication between threads

Thread Types

Worker Threads

Handle client connections, request routing, and backend communication

Health Check Thread

Performs active health checks on backend servers

Metrics Thread

Aggregates statistics and exports metrics

Admin Thread

Handles configuration reloads and management API