Bringing shape and life to building a supercomputer
Dan Randow
I help people collaborate, when it's hard, so they can create value and have a good experience together –?usually in tech.
My nine months as Strategic Projects Manager at NeSI were a great privilege, and a highlight of my career. I will always be grateful for this opportunity to contribute to an inspiring kaupapa (purpose), working with incredible people, in a strategic management role.
My task in the engagement was to help NeSI gain momentum with building a supercomputer, and evolving the organisation. These two were exquisitely intertwined. My contribution was to make both –?and the resonance between them –?a little more visible,?and to foster activity to advance them.
I'll share some context before telling my story of joining NeSI on its journey, and sharing some reflections from that.
High performance computers powering science for Aotearoa
The National eScience Infrastructure (NeSI), provides high performance computing (HPC) services to scientific researchers across Aotearoa (New Zealand). This means that researchers who run complex models with large datasets, can get results in a few hours, instead of the months it would take on their laptop. As more and more science involves computation, NeSI is increasingly key in Aotearoa's participation in the global research community, and the prosperity generated by research.
Computational research is not just expanding, it's becoming more diverse. Software development is becoming a core competency for researchers, and there is a new profession of specialists in Research Software Engineering (RSE). RSEs require more flexibility than is offered by traditional HPC. They need to use specialised tools, and build custom workflows and pipelines using DevOps approaches. They also work in new collaborative modes, in groups, communities and partnerships, sometimes at scale. To support these modes of HPC, NeSI has built a new Flexible HPC Platform supporting organisational tenancies and the Research Developer Cloud. FlexiHPC is built on OpenStack cloud infrastructure, optimised for research computing.
NeSI's case studies provide a glimpse of their mahi (work), as do NeSI's partnerships with the tikanga-based Genomics Aotearoa and Rakeiora initiatives, and with AgResearch.
When I joined NeSI in February 2024, I was inspired and excited to be able to contribute to this mahi.
A platform refresh bringing performance, flexibility and extensibility
In response to the growing demand for capacity and flexibility in their service, NeSI was embarking on building a new supercomputer. With a significant investment, the new facility would provide massively increased capacity and performance, including with GPUs for AI and ML. It would also offer researchers much greater flexibility in their use of the service, and be easily extensible to support future partnerships.
This meant adopting a new multi-vendor approach, sourcing the best technologies from specialised vendors, and a ground-up build in a data centre, as well as adopting DevOps approaches to building and configuring the system. It also provided the opportunity for a complete rethink of how to architect and integrate those technologies to build a supercomputer for the future for NeSI and eScience in Aotearoa. And pallets of new equipment were already arriving.
Get us collaborating cross-functionally
My job, as Strategic Projects Manager, was to help with this platform refresh, and the organisational transformation needed to achieve it. Working closely with the Director and SLT, the role was to fostering agile values and principles, connecting teams, individuals and their work across products and services, with overall purpose and outcomes.
In my interview, Director Nick Jones summarised the job as get us collaborating cross-functionally. With a fixed term contract, it was like a consultancy, with some rolling up of sleeves to get stuff done.
I had to laugh at the job title. For the previous five years I had been trying to expunge the concept of projects from the software space. Was HPC really so predictable that up front planning and GANTT charts were applicable? Didn't as-code mean this was software? I resolved to bring a product mindset to this mahi. With three capable Product Managers in NeSI already, that wasn't my role –?except perhaps for the platform refresh as a whole.
How would I approach Product Management when it wasn't my role? How about: as I would approach it, if it was! But first I had to get up to speed.
It can take 18 months to get your head around NeSI
Soon after I arrived, someone said that new people often take a year or two to really understand NeSI. Both organisationally and technically, there is a lot going on. I only had nine months to spin up and deliver something!
Clarification by iterative sketching
I leapt into learning and delivering value at the same time. This often meant describing, as best I could, something I barely understood, to prompt others to correct me. Gradually, my resolution improved. Sometimes, with my broad brush, I even made it easier for others to see the wood, for the many trees.
I realised that clarification was a key contribution I could make. If I couldn't understand it, or find the documentation, there was a chance that others couldn't either. If the work wasn't clear, how could people organise around it to deliver it?
There were lots and lots of Jira and Mural boards, but people told me they couldn't see the path ahead. As I began facilitating rituals for the overall rebuild, I saw what they meant. Work items were different sizes, variably categorised, often not elaborated, and in many cases missing.
It was an early stage of the design so there were many unknowns. Was the roadmap as good as it could be, given the ambiguity of the work? Or, could we create a map for addressing that ambiguity? Could people use that to focus on one thing at a time to accelerate progress?
I made lots of sketches, with Mural, with paper held to the Zoom camera. Is it something like this? Oh, right maybe it's more this? Who is this for? What outcome are they trying to achieve? How might they will use this to achieve that?
Making the map as I went, I entered the depths of the system, and organisation.
Testing my "doesn't matter how technical" claim
Though I am not a technical specialist, I am usually confident that I can figure out enough to be useful in any technical domain. Figuring out NeSI meant doing this in ten areas at once. I will list some...
On top of the inherent complexity of HPC, NeSI supports a series of specialised applications and tenancies. Then there are NeSI's training, consultancy and conference-convening activities. With a constant flow of research workloads to support and systems to maintain, BAU is significant. Add to that, NeSI is not one organisation but a collaboration between four, with staff spread across those institutions.
Supercomputer or lots of computers working closely together
With high performance computing (HPC), a job runs fast because it is distributed (parallelised) across tens or hundreds of cores (CPUs). This means the job is broken into small pieces that run across multiple nodes (computers), all constantly talking with each other. This makes HPC a special case of computing at scale, requiring very high speed and low latency communication between the nodes. Because the jobs can involve many terabytes of data, HPC must have extremely fast and high capacity storage. Resources like these are expensive, so the whole system requires significant optimisation, and sophisticated queueing of the work coming in. Traditionally, HPC was only possible with specialised massively parallel computers. Recently, it has become possible to build a supercomputer with top end, but off-the-shelf equipment. Working as a single computer, however, still requires a very high degree of internal cohesion. Almost everything is in some way connected to everything else.
Threading together people and pipelines
The dense web of dependencies almost paralyses progress. To decide one thing, you have to decide the dependencies, each bound by their own dependencies.
To address this, I advocated the Steel Thread approach that I learned working with Sentify. The story goes that back in the day, the first step of building a bridge was to throw a steel thread across the gap to be bridged. That allowed you to draw increasingly heavier lines until you had a cable.
In software and organisational change, my view of a steel thread is that is it both technical and social. It's a minimal but working end-to-end solution that is created by a developing cross-functional team. For a start it does no more than say "hi" to a user. All dependencies are mocked up or ignored but it works. Importantly, it is built as-code so you can repeat the build, trying different things, and gradually adding functionality.
It is easy to criticise a steel thread as the initial user value is so low, and it is oblivious to constraints. And it's easy for the nascent team to counter that with "what would you like to see next?" and to demo that a week later.
The Leadership Team supported the idea and pioneering Product Manager Jun Huh assembled a small part-time team to start building. The team demoed several iterations of a working system. As more became clear, they pivoted, expanded and used the same approach for a more ambitious initiative, this time taking an idea from conception right through to going live.
Sleeves-up, hands-on in the deep end
As I learned my way around, I was asked in to help with risk management, board reporting, stakeholder analysis, comms and engagement, code-standards and developer practices, vendor acceptance and even some UX testing.
领英推荐
I began diving in to help progress work across various technical areas. It was all hands to the pump, including my novice ones.
I am grateful for the help of, and opportunity to work with NeSI's talented and passionate people including Nick Jones , Blair Bethwaite , Georgina Rae , Catherine Henderson , José Higino , Matt Chamberlain , Jana Makar , Jun Huh , Claire Rye , Nathalie Giraudon , Thomas Berger , Greg Hall , Michael Karich and many others.
As I learned and researched, I relentlessly documented, trying to reduce the number of times decisions were?visited.
But there was something bigger I was there to contribute.
Product management: making or facilitating the decisions?
It seems to me there are two main styles of product management, each taking a different approach to decisions. Those with the more decisive style tend to make decisions themselves. They are visionary and entrepreneurial and use their positional authority to drive things forward. More facilitative product managers broker agreement among the various stakeholders. That way, they believe, decisions will be better, and people will be motivated to implement them.
Being the facilitative type myself, I needed no title or authority to be a product manager. I don't think "now I will do some servant leadership", I just ask "How can I help you?" and "Who needs to be involved in this design or decision?". Being surrounded by people who knew vastly more about NeSI than I did, there was no way I could make better decisions than them anyway.
I didn't care what decisions people made. I cared that they made them, and how they made them. I cared that people knew what to decide next, decided it with a clear rationale, documented that, and got on with the next one.
Could the organisation and its system both be more modular?
A bit like the system itself, I noticed that to get things done, almost everyone in NeSI needed to talk with almost everyone else at some time. Communication webbed across slack channels, Zoom calls and chat, email, Drive, Jira, Confluence and Gitlab. Many corners of the system seemed to have its own expert, with deep knowledge gained over time.
The exceptions to this were Director Nick Jones and Solutions Manager Blair Bethwaite. Both have prodigious knowledge right across NeSI's systems, and are seemingly everywhere all at once. Together, they make an incredible contribution to keeping all the parts and people connected.
Now I am no IT architect, but something inside me said surely this system didn't need to be monolithic. And how could teams collaborate on it, if it was?
Experts like Manuel Pais and Matthew Skelton say the shape of the system will follow the shape of the organisation. Even if you just start talking about the system as if it were modular, people will organise around those modules, interact with people organising around other modules and the system will gradually assume the same form. I quietly coined the term WYDIWG: What You Define Is What You Get.
I did not expect that to come easily. It is seldom stated but I believe the reverse of Conway's law is also true. The shape of an organisation is constrained by the shape of its system. If everything in the system is in some way connected to everything else, then to maintain it, the organisation has to be like that too. Any change is going to be chicken and egg.
Even if the organisation and system could not become more modular, we needed some way to break the system into chunks so that we had something to arrange into a roadmap, backlogs and documentation.
But where to start untangling this large bowl of spaghetti. There were many angles from which to observe it, none of which afforded a glimpse of the loosely linked elements I was looking for. I decided to try something, run it past people, learn and repeat. As a guide, I looked for clusters of elements that had a common group of users, all trying to generate a similar outcome.
Gradually, I began to find three main clusters of things that were changing: services that researchers use, back-end infrastructure, and the various services connecting those.
A roadmap for outcome-aligned collaboration
As the chunks of system shimmered into view, I began to map out some paths forward, and seek feedback on them.
The first roadmap was as sketchy as my minimal understanding of the system, and yet somehow even something sketchy could test assumptions. Using Jira and Mural, often linked, the roadmap evolved over my time at NeSI.
By drawing strands from delivery boards to the overall roadmap, the whole journey, and how to contribute, gradually became more visible to everyone in NeSI. The roadmap only partly mapped to the various teams and Jira boards that already existed, but it showed that mapping. People began to see the links between their small piece of work and the wider picture. And the wider roadmap linked to the progress happening with those small pieces.
Towards the end of my engagement, the roadmap even began to provide a means to measure velocity, and thus,? with some evidence, predict future delivery. It also provided some ideas about how teams might organise more clearly around the work –?not just for the refresh but for its future of continuing evolution.
The map was neither perfect nor done. It was enough, many told me, that they could see where they were heading and how to contribute to getting there.
Agile rituals, baby steps and general positive advocacy
As the roadmap evolved, the refresh-wide rituals that I facilitated took on more structure. We could focus on the work, its progress and dependencies. Though progress seemed slow, we could see where it was occurring, and the priorities to address next.
I often thought of my role as "the blind leading the sighted". Though I learned an incredible amount in a short time, it was others who had the technical expertise. My contribution was opportunities to collaborate. And my naive provocations were often enough to precipitate progress.
I convened meetings, conjured agendas and crafted diagrams, documentation and work items. I monitored and prompted for progress. One person described my role as "General Positive Advocate".
I saw it as my job to ask hard questions, too. Many prompted new ways of thinking and talking about the work. I learned the hard way about the line across which hard questions can hurt.
One thing I successfully established was collaborative work sessions. People were planning and reviewing the work together in rituals, but when would the work actually get done, especially when decisions needed wide input? Despite jam-packed calendars, I started regular work sessions with a variable agenda. People put them in their calendar, knowing that they may or may not end up attending. Soliciting and then signalling the agenda in advance meant they could attend if they were a key contributor, or were otherwise interested –?or have the time back.
Thank you, kanban change management principles
In the highly structured SAFe environment at ACC, where I previously worked, I thought Scrum worked pretty well, and never quite saw how kanban would fit. At NeSI, agile approaches were much more organic. Teaming up with Agile Portfolio Manager Matt Chamberlain , I turned to kanban principles such as "Start with what you do now", "pursue improvement through evolutionary change", and "encourage acts of leadership at all levels".
Waving the kanban flag at team level, it was hard not to be opinionated. I saw efforts at Scrum falling well short of enough to get any value from it. What would happen if you just prioritised those work items, and got them out the door until something trumps them?
The work follows the teams follow the work
We all know that organisations change slowly –?especially under delivery pressure, while managing BAU. I saw no widespread adoption of my ideas, even the baby steps.
As I wrote a year ago about my experience at ACC:
the organisation, and the teams will rub along more or less the way they will rub along. Their culture and ways of working are way more influenced by the people and the system around them than they ever will be influenced by me
But I did see enough adoption to keep going, to keep trying new things – to keep advocating for the vision, for progress, and for collaboration. And that is the work.
Also undented is my faith in the homomorphism between the communication patterns and the system an organisation builds. If we want the World to be better, we have to build better things. To build better things, we have to organise in better ways, and that's hard. It takes leadership and facilitation, and it takes mapping the way ahead, aligning delivery with strategy and aligning strategy with outcomes for all.
Senior Digital Workplace Specialist | Facilitator | Capability builder | Optimist
1 个月Dan Randow - as you've done for many years, you think and contribute and share. LEGEND. Several things stood out for me "In software and organisational change, my view of a steel thread is that is it both technical and social." is a super perceptive quote. And a thread (pardon the pun) that you wove through the entire project. Thanks so much for sharing.
Freelance Frontend Engineer
2 个月Sounds like an amazing experience! I loved the steel thread analogy
CEO at Conflux / Co-author of Team Topologies ??- Disrupting transformation via Team Topologies, fast flow, and Adapt Together??
2 个月Fascinating read, Dan - thanks for sharing. It was good to read that the book Team Topologies ?? inspired you and helped navigate the challenges. I'd love to learn more!
Founder of Accelerant.dev. On the planet to build a better planet.
2 个月I'm sure that the team deeply appreciated your time. Thank you on behalf of New Zealand's research community.
Research, Innovation and Leadership for Positive Aging
2 个月Great read Dan. Go well landing the next mission! And Meri kirihimete!