Architecture

Hybrid C/C++ Architecture

Combining raw C performance with modern C++17 flexibility for optimal results

Design Philosophy

UltraBalancer uses a hybrid architecture that leverages C for performance-critical paths and C++17 for complex logic and maintainability. This approach delivers maximum throughput while maintaining code quality and extensibility.

Performance First
Critical paths written in C with zero-copy networking and lock-free data structures
Modular Design
Clean separation between frontend, core engine, and backend components
Production Ready
Battle-tested reliability with comprehensive error handling and graceful degradation

Component Architecture

Three-tier architecture for optimal performance and maintainability

Frontend Layer (C)
High-performance client connection handling and protocol detection

Listener Manager

  • Socket creation and binding
  • SO_REUSEPORT for multi-core
  • TCP/UDP protocol support

Connection Acceptor

  • epoll/kqueue event loop
  • Non-blocking I/O
  • Connection rate limiting

Protocol Detector

  • HTTP/WebSocket detection
  • SSL/TLS handshake
  • Protocol upgrade handling
Core Engine (C/C++)
Request routing, load balancing algorithms, and connection management

Connection Pool

  • Backend connection reuse
  • Idle connection cleanup
  • Connection limits per backend

Request Router

  • Load balancing algorithms
  • Session persistence
  • Request retry logic

Metrics Aggregator

  • Real-time statistics
  • Prometheus metrics export
  • Performance counters
Backend Layer (C)
Backend server management, health monitoring, and failover

Server Manager

  • Backend server registry
  • Dynamic server addition/removal
  • Weight and priority management

Health Checker

  • Active health probes
  • Passive failure detection
  • Automatic failover

Session Stickiness

  • Cookie-based affinity
  • IP-based persistence
  • Session table management

Performance Optimizations

Advanced techniques for maximum throughput and minimal latency

Lock-Free Data Structures

Critical data structures use atomic operations and compare-and-swap (CAS) instead of mutexes, eliminating lock contention and enabling true parallelism across CPU cores.

  • Lock-free ring buffers for request queues
  • Atomic counters for statistics
  • RCU for read-heavy data structures
Zero-Copy Networking

Data is transferred directly between network buffers and application memory without intermediate copies, reducing CPU overhead and memory bandwidth consumption.

  • splice() system call for TCP proxy
  • sendfile() for static content
  • Direct I/O for large transfers
NUMA-Aware Memory

Memory is allocated on the same NUMA node as the CPU processing the data, minimizing cross-node memory access latency on multi-socket systems.

  • Per-core memory pools
  • CPU affinity for worker threads
  • NUMA-local connection handling
Kernel Bypass (Optional)

Optional DPDK integration moves packet processing to userspace, bypassing the kernel network stack for extreme performance scenarios.

  • Userspace packet processing
  • Poll mode drivers (PMD)
  • Huge page support for memory

Threading Model

Multi-threaded architecture optimized for modern multi-core processors

Worker Thread Architecture
One worker thread per CPU core for optimal parallelism

Design Principles

  • Shared-Nothing: Each worker has its own event loop and connection pool
  • CPU Affinity: Workers pinned to specific cores to maximize cache locality
  • SO_REUSEPORT: Kernel load balances incoming connections across workers
  • Minimal Synchronization: Lock-free communication between threads

Thread Types

Worker Threads
Handle client connections, request routing, and backend communication
Health Check Thread
Performs active health checks on backend servers
Metrics Thread
Aggregates statistics and exports metrics
Admin Thread
Handles configuration reloads and management API