Powering through challenges: DCENT's successful collaboration with CURIO

Tell me more

In the world of high-performance computing, transitions between software systems can be fraught with complexities. At DCENT, we recently embarked on an ambitious journey to migrate from our legacy LOTUS software to the cutting-edge CURIO stack. This article delves into our collaborative efforts with the CURIO development team, highlighting the technical hurdles we overcame and the significant performance enhancements we achieved.

The migration landscape

Our transition to CURIO demanded a complete overhaul of our existing infrastructure. The new PC1/PC2 system requirements were substantial:

  • HPE ProLiant DL385 Gen10 Plus V2 Server
  • Single AMD EPYC 7H12 64-Core CPU
  • 512GB DDR4 3200MHz RAM
  • 100GbE network
  • 2 x NVIDIA A5000 GPU
  • 8 x KIOXIA CD8-V 6.4TB Data Center NVMe’s

Our goal? To build a system capable of achieving 10-20M IOPS and handling a pipeline of 128 parallel tasks. This setup was crucial for efficiently processing the large datasets integral to DCENT's operations.

Navigating technical turbulence

As we delved deeper into the migration process, we encountered several technical challenges:

Hardware Compatibility: Achieving optimal performance while ensuring correct server configurations for CURIO's architecture proved to be a complex task.Software Optimizations: We grappled with memory and I/O bottlenecks, particularly on multi-processor systems, leading to inefficiencies in the hashing process.NUMA Node Allocation: Proper allocation of hugepages on NUMA nodes was critical for system performance, but initially presented significant issues.

Collaborative problem-solving

The CURIO development team were instrumental in helping us navigate these challenges. Their expertise guided us through:

  • Reconfiguring memory slots and CPU threads for maximum hardware performance
  • Debugging NVMe drive handling and ensuring proper management by SPDK
  • Adjusting configurations for correct hugepage allocation on NUMA nodes

Despite these efforts, we faced intermittent performance issues, including slow sector processing times and difficulties with curio sealing operations. These required further fine-tuning of both software and hardware parameters.

Achieving performance breakthroughs

After extensive troubleshooting and optimization, we successfully brought both our Alpha and Beta (Supra-workers) machines online. While some performance concerns persisted, we made significant strides:

  • Simplified Configuration: We reduced system complexity for demonstration purposes, focusing on a single-processor setup to streamline debugging.
  • Optimized Beta System: Configured with kernel patches and optimized for Zen2/3 CPUs, our Beta system showed promising results.

All machines are now operating as expected, marking a major milestone in our transition to CURIO.

Scaling for the future

As we look ahead, we're focusing on two key areas:

  1. Sector Expiry Management: We're addressing anomalies in handling sector expiry terms, which affected our snap pipeline efficiency.
  2. Infrastructure Scaling: With plans to add eight new SPs to our cluster, we're exploring the setup of a high-availability (HA) worker dedicated to piece store and boost connections.

These initiatives will ensure smooth operations as we expand our infrastructure and capitalize on CURIO's advanced capabilities.

The road ahead

Our collaboration with the CURIO team has been both challenging and rewarding. While we've made significant progress, optimization remains an ongoing process. We're committed to fully integrating CURIO into DCENT's infrastructure, ensuring a smooth and efficient workflow for our large-scale mining operations.

This partnership has not only enhanced our technical capabilities but also demonstrated the power of collaborative problem-solving in overcoming complex technological challenges.