Advanced Research Computing extending the Hawk network fabric
24 August 2020
Work will be undertaken in September to continue the expansion of the “Hawk” supercomputing system at Cardiff University.
This will be accomplished by extending the associated network fabric enabled by the pluggable infrastructure system architecture.
The “pluggable” infrastructure model has characterised Cardiff’s supercomputing systems from 2014 and the early days of Raven, with the systems designed such that Cardiff researchers can supplement the core research infrastructure by using Grant funding to procure additional nodes dedicated to their own research projects. Such an approach relies on the system interconnect fabric – the “glue” that integrates the nodes into the overall system – being extensible, with the ability to capitalise on the latest generations of interconnect fabrics in cost-effective and rapid fashion.
The initial Hawk system was designed in 2018 to accommodate such an expansion based on Mellanox’s high throughput, low latency EDR (100Gb/s) InfiniBand connectivity, and has been at the heart of the expansion of Hawk from the 8,040 core system in July 2018 to the current system comprising 19,416 cores.
With the current EDR-based fabric effectively saturated, and with the release of an upgraded HDR Fabric from Mellanox (200 Gb/s), the challenge arose on how best to integrate both EDR and HDR networks without isolating the existing EDR network.
Following discussions with our external technology partners, Mellanox, Dell and Atos, the architectural design solution involves the introduction of two so-called Lustre Networking Layer (LNET) bridges. These are servers with the ability to straddle, both EDR and HDR networks, using multiple ConnectX6 IB controllers, therefore creating a bridge between the two networks. Furthermore, the Network File Storage (NFS) storage will also be upgraded by adding a dual port ConnectX6 adaptor so that it too can straddle both IB networks.
The expansion to enable all future Hawk system installations to utilise HDR (200 Gb/s) is to be completed during September and will accommodate a further 60 nodes. This next step in the evolution of the pluggable infrastructure approach will deliver a rapid and successful expansion of Hawk, mirroring the progress accomplished at the start of service in July 2018.