Looking forward to returning to Multicore World — (New Zealand February 2025)

5 min readNov 16, 2024

A curious scissor sergeant fish enjoying Bora Bora with me — Picture by Adrian

There are a few conferences I’ve attended around the world that are more about building communities than the content (although they often have the best content). They are predominantly single track, small events, run by one or two opinionated individuals rather than a committee or a corporation. It’s where I like to try out brand new material, as there’s a lot of opportunity for feedback. I’m thinking of Monitorama, Gluecon (rebranded SW2 con this year), the High Performance Transaction Systems workshop (hpts.ws), and in February 2024 I attended Multicore World in Christchurch, New Zealand for the first time. I’m going back again in February 2025.

I was introduced to Nicolás Erdödy at Super Computing 22, by my colleague at OrionX Shahin Khan, who encouraged me to attend his event. I couldn’t schedule it that year, but got it onto my calendar for February 2024. The attendees are a mixture of local community and technologists from New Zealand, and leaders from the global high performance computing community.

I decided that I wanted to explore an area that I was curious about, but not expert in, and see if I could get some interest and feedback from real experts in the audience. I like to understand the basic workload characteristics of systems, and I’ve seen many workloads over the years, but I haven’t worked on LLM training workloads at large scale. I notice that the network bandwidth configured on these systems is extremely high by any normal standards, so I decided to try and figure out why, and how it us used.

The biggest LLM training runs are performed by companies like OpenAI, Google and Meta and use tens of thousands of GPUs running for months. The result is shared as a service like OpenAI’s ChatGPT. When organizations experiment to build an LLM based service, they often start with the shared ChatGPT service or the underlying API because it produces the best results. However it’s relatively expensive to use as a service, can be slow, and may be down for maintenance at inconvenient times. The alternative is to train your own smaller LLM and try to create a more specialized expert in the subjects you need. The result may be good enough quality, far cheaper to run, and you can manage it’s performance and availability directly for your own use case. A collection of smaller more specialized LLMs working together is another useful approach. To create such a model, it’s common to start with a pre-trained model like Meta’s Llama3, and train using perhaps a few hundred GPUs for a few days, then test and iterate until it’s working well enough.

Two examples of this that I’ve looked at are ClimateGPT by erasmus.ai and the Mixture of Intelligent Agents architecture used by flip.ai to build their incident analysis tool. The ClimateGPT paper is a very nicely explained walk through of how to build a climate expert by feeding it a body of scientific information, with what seems to be a state of the art approach to validation, translation and operational issues.

Flip.ai are at the bleeding edge, pushing the boundaries of what can be built by composing a real time incident analysis service from what they call a system of intelligent agents, with a very experienced team. I used to work with their CTO Sunil Mallya at AWS, and learned a lot from him. We visited a robot racing meetup together and dreamed up the AI driven robot racing series that launched at reInvent 2017, then Sunil got the AWS RoboRace challenge going, using a custom built car and groundbreaking reinforcement learning techniques in 2018, and he also created and ran several AWS AI services.

There are quite a lot of people running small training workloads repeatedly, and they are not cheap to run, several thousand dollars an hour for a few days is in the region of $100K per run. Optimizing that workload so it runs more efficiently and at lower cost seems like a worthwhile area to understand. My approach was to read a lot, ask a few people for ideas, and to gather my thoughts on a Miro board as I put together the story for my presentation at Multicore World 2024. I then presented an updated version of the talk at SW2con in May 2024.

To understand what is happening I went deep into the open source technology stack that NVIDIA uses, figuring out that the networking is driven via the NCCL library that batches patterns of large transfers over whatever transport is available, it’s not at all like the kind of socket based request/response mechanisms most developers use. NCCL has added features over the years and there are excellent videos explaining how it works.

The transport could be Infiniband directly, or for cloud providers there is a Libfabric interface that in turn abstracts different underlying mechanisms. AWS routes Libfabric over it’s EFA transport, that uses a custom underlying protocol over many ethernet connections in parallel. This doesn’t have as low minimum latency as Infiniband, but it has similar bandwidth and lower variance for large transfers as it avoids head of line blocking effects.

One aspect of computer architecture that appears to be an emerging trend is that the CPU isn’t the “Central Processing Unit” any more. It’s no longer at the center of the architecture. The CPU is an IO Processor handling filesystem and control plane work, hanging out around the edge of the central collection of GPUs that are managing memory and networking directly. If one NVIDIA Hopper or Blackwell GPU wants to talk to another GPU, it connects directly over a coherent memory bus within a rack. If one of the Grace CPUs in this configuration wants to communicate directly with another Grace CPU, it has to go through two Hopper or Blackwell GPUs to get there. This could be written off as an obscure architecture feature if it wasn’t central to the highest revenue and most successful chip vendor, NVIDIA. This is where most of the money in IT is currently being spent, so optimizing for the Grace Hopper and Blackwell based architectures is going to be where most of the benefits are going to lie over the next year or so.

Grace Hopper Architecture — Grace to Grace via Hopper within a cluster

I’m going to talk more about this topic in a new talk at Multicore World 2025, and I’m happy to discuss it more as I figure out what I’m going to say. New Zealand is a very cool place to visit, and it’s worth taking some extra time there to explore the sights and Maori culture.

Looking forward to returning to Multicore World — (New Zealand February 2025)

Written by adrian cockcroft

No responses yet