VanessaSaurus

Treading Water

Tue, 14 Apr 2026 00:11:50 +0000

There were once a man and a woman, and they were in a big, beautiful, and deep sea. It was the color of nautical twilight – a deep purple that was stained with elegance, memory, and the softness of a heartbeat. They were treading water. It is unclear why the man was treading water, but the woman chose to stay with him.

She could not touch him, and could not help. But she chose to tread water beside him to make sure he knew he was not alone. She wanted to bring substance to what would otherwise be empty. What is heavy is not always burdensome if it carries meaning. What is light is not always freedom. The heaviness of the task was the ultimate fulfillment. The lightness of the quiet was only lonely when the lapping of the water declared its presence.

He is treading and waiting for something different. Maybe a boat. She could not be a boat. And so she waits with him, also treading water. At some point, she will become tired and be forced to return to shore. But she will not regret the time spent next to him. And the story is OK, because she does not want to be a boat. She will return to the shore, and walk along it collect shells. What is important is not an ultimate state of finding a boat. What is meaningful is making the choice to wait, floating not in water but laughter, and knowing that the joy and strength resides within her.

Observation

Fri, 10 Apr 2026 00:22:00 +0000

A good life is a balance of different kinds of thinking. We have our preferences for thinking style and content, and mine tend to vary between technical and behavioral, where often I am experiencing a gradient of both but leaning stronger in one direction. I have been leaning technical for over a month now – starting with the HPSF Conference in Chicago, and returning and becoming completely immersed in not one, but two paper that led to simulation experiments for which (ultimately) I ran over 20K simulations for agentic job negotiation and selection paired with resource provider discovery software, a hierarchical server architecture to orchestrate the work, and algorithms and method to test it all out. It was an exciting, and full set of weeks! However, today I would like to talk about the other side of my dimension. I want to talk about observation, acceptance, and learning. Let’s dive in.

Going off a Cliff

You have a dream where you are in a car, and in a forested and quiet area in the night. It’s a beautiful car, but it does not behave as expected when you push the accelerator. It goes backwards and then forwards at high speed. You find yourself going off the road, through the trees, and plummet off of a cliff. There is a moment of weightlessness and time in a standstill. How do you a feel?

A younger (or real-world) version of you would be terrified. This is the point in the nightmare when you wake up, and are a mix of relieved and upset. Today, you are overtaken with calm. You still wake up, but you are are not fearful or experiencing a negative emotion. You are curious. Why?

Acceptance Before Experience

The calm results from adopting acceptance before experience. It has resulted from years of meta-cognition to understand yourself and the world. Let me explain. When we are younger, we are reactive. We experience the world, ruminate, and either react or choose to suppress. A child will likely react in the most behaviorally logical way. When you touch a hot stove and it hurts, you cry out. A young adult might learn the social fabric wrapped around their reactions, and learn to suppress. They touch the stove, understand that it is “weak” to show pain, and hide it. Many of us do not progress from either of these states to react or suppress. If you watch the famous scene from American Chopper (the one that resulted in the meme, yes) you will notice the second, younger man, go from a state of expressionless shock into immediate reactive anger. We know well from the meme that it turns into a yelling match and he storms out. He reacted. What are alternative endings for this interaction?

He might have suppressed. He might have decided to internalize his anger, like a packet of negative energy forced into a tiny pocket and shoved deep inside his chest. This is a strategy that works in the short term because it allows you to handle a hard situation. Often quick, emotional reactions can be unpacked later, and then discussed when emotions have cooled down. The problem is when we never unpack. We continue filling up our internal pockets until every crevice is stuffed. And then the (typically) negative and unprocessed emotions explode out. Exploding pockets are collections of so many life experiences they no longer have specific direction, only valence. We direct the energy inwards to ourselves, and onto others. It is the reason that someone might seem to unexpectedly explode, and the magnitude of the reaction often does not make sense.

The Alternative of Observation

The other alternative is to observe. Observation is based on a pattern of experiencing the world, ruminating, understanding thoughts and behavior, and then making a decision. The nuances of the decision are what set people apart. You might decide on a strategy of control. How do I control my environment so this does not happen again? Does it mean avoidance? Does it mean setting a boundary? As a young woman I practiced this a lot. My negative emotion of choice was anxiety, and primarily oriented to social. A bad social experience could simply be nullified by avoiding people. If I had a bad experience traveling, I could stay at home. I can understand how agoraphobia develops. What happens with this mindset is that we inadvertently create boundaries to protect ourselves. It’s a false sense of protection. It’s a self-inflicted cage. The safety measures we impose prevent us from living fully, from being authentic and vulnerable with others, and connecting meaningfully. Our life experience gets smaller. That is, of course, what happens when you live in a box.

There is no Control

The insight is that it is not possible to have control. Events occur with different valences and probabilities of happening again, many of which are outside of our awareness. Logic can help here too, and we can identify our own inconsistencies. Look up the probability of dying in a plane versus a car crash, and then think about the prevalence of being afraid of driving versus flying. It is not logical. We have a false sense of control while driving because our hand happens to be on the wheel. We cannot control events, nor can we inflict control on the thinking of behavior of others.

Let’s start again with the sequence. We are hit with life experiences, we ruminate, and possibly get stuck in our heads. We do not choose reaction, suppression, or avoidance. For each new experience, we start to recognize patterns. It might be that we are defensive for a particular topic. Is it because we have an insecurity about it? Where does that insecurity stem from? Is the foundational belief that underlies it true? We unwrap our (typically) unconscious thinking to form understanding, and find that we react less. The experiences feel less directed at as, and more like streams flowing by us. Our default state is one of curiosity. We are not suppressing. We are understanding. The world is no longer dangerous, but interesting. We start to notice more, process, and learn about ourselves, our needs, and create hypotheses about the underlying mechanisms of what we observe. We become better at applying similar logic to others. We notice more, to the point we are no longer resorting to impulsive reaction. We stop trying to control. This is meta-cognition – the ability to step outside of ourselves, observe, and critique our thoughts.

The epiphany that many of us might have as young, ruminating adults is that there is no control. However, there is wisdom in the Serenity Prayer. Lack of control does not extinguish lack of choice. The degree to which something affects us is a choice. We can choose to stop reacting and trying to control our environment, and just observe it. Observing becomes the primary state of being, and we accept things before we have experienced them. Their existence or outcome is separate from a sense of self. The events that happen around us do not define us. The car goes off the cliff, and we observe the experience. We have accepted it before it happened.

Unintentional Impact

Marcus Aurelius talks about the soul being dyed the color of our thoughts in his personal writings. I read this, and I loved the metaphor. The practice of observing our experiences and changing our perspective can color the mind a more beautiful hue. It builds on itself as we face the world with more joy, and it reflects back to us.

There are further benefits. The first is appreciation. We often do not actively see the things that are right in front of us. We forgot that we are wearing a watch or bracelet because it is biologically more efficient. Or perhaps we do not appreciate what is on our wrist because we are perceiving a future better model. We do not realize the value until our wrist is bare. We notice absence of a previously fulfilled need more strongly than its presence. And that is the basis of nostalgia – a realization of appreciation that only is possible when the object or person is no longer present to appreciate. This is the benefit of observation. It reveals the details. We move from a mindset of wanting what we don’t have to realizing what we have is exactly what we want.

To go back to American Chopper, the alternate ending to that scene is that he observes, he listens, says OK and stands up and walks out. He understands his own needs and what is happening in the environment, accepts it, and decides it is time to walk away.

Closing

It is an admirable goal to practice being an observer. The choice of how to react (control, suppress, or observe) is the first level of a hierarchy that we choose from. The degree to which we are able to, in real time, observe and change our thoughts is meta-cognition. When our perception is not anchored in fear, we can live more fully.

I Love You, All of Your Pieces

Sun, 22 Mar 2026 00:22:00 +0000

We are hard on ourselves. We can consistently push our bodies and minds, and put others first and strive for compassion and integrity in our lives. But we often forget ourselves. Or we are forced into hyper-independence that makes us long for the care or validation of someone else. We are often told by the very people we love that we are not enough.

What do we do? We take responsibility for ourselves and the power of our words. We love everything a little bit more knowing it will not last. If time is subjective, we decide to perceive the details of a moment. We start to see that some light only reaches us because something got in the way. We realize that others being incapable of meeting us even some of the way, or providing the intimacy we crave does not deem us too much, but having higher resonance that requires an equivalent match. We realize that we are the protagonist of an exciting, wild life story. Our silence and space becomes intentional, and our moments of fragility elegant. We find forgiveness for all that we see wrong in ourselves, and realize that acceptance only comes when we stop fighting reality. We see the essence of life’s duality, that perfection does not exist without imperfection, and light does not exist without darkness. We decide to be a light.

Regardless of how anyone judges you, you get to decide your definition of strength and beauty. There is beauty in our scars, and imperfections. They quietly tell the story of our lives. You get to decide that you have worth, leaning toward people that support you to grow, and walking away with dignity from those that do not respect it. You get to decide to flip every conventional love song on its head, and love all of your pieces. You are the strongest person you know. Your strength does not need to be conventional, nor does it need validation by anyone. You become your decisions. You can find false protection by trying to control others and your environment, or true strength by learning to sit comfortably in the unknown.

Never give up on yourself.

I am not trained to sing (and I am not very good) but I enjoy it as I do dancing, and I recorded this thinking of someone that I love. I hope you appreciate it.

Your browser does not support the audio element.

This song was always special to me. It tells a story of finding magic and story in life, in the small things, always. I often wish that I could share my view of the world with people that I love, because it feels like a beautiful place.

The Fifth Decade

Fri, 02 Jan 2026 00:09:00 +0000

Happy 2026! Let’s start the year off with some dancing. 💃 This year I promise:

To continue to live with authenticity and integrity.
To be inspired, diving deeply into ideas and learning.
To adventure, and look for beauty in unexpected places.
To not make myself smaller to make others comfortable.
To push myself physically and do things that are hard.
To age with elegance, and appreciate myself as I am.
To believe a person when they show me who they are.
To dance, laugh, and be curious about the world.
To prioritize self-care and rest.

To little funny looking, goblin me: you finally grew above 4 feet. You’ll feel out of place for a long time, and life is going to be hard when you enter adulthood. You’ll grow from it, and you’ll be OK. You will learn that beauty and value are within you, and Sovereign. They do not require validation or being chosen. They exist because you are a person who notices the world.

This closing year, 2025 had a lot of adventures with dancing, Flux tutorials, running and biking and beautiful places!

Onward to the adventures of 2026, still as a goblin, but taller! 💪

Gifts

Mon, 22 Dec 2025 00:00:00 +0000

Presents are interesting to think about. It’s an exchange of a physical item as to say, “I value you.” And don’t get me wrong - they can be fun, and informative about a person or relationship. I’ve historically been someone that puts time into thinking about the right gifts for specific people. That goes back to sewing pajama pants and blankets in high school to custom gifts for friends and family to (as I’ve been older) niche food or experiences I know the person will enjoy. I have spent upwards of a year carefully preparing (what I deem to be) highly meaningful gifts. Experiential gifts are also a lot of fun, especially for a group (e.g., a pinata or food item to enjoy together). The exception to that are things that the person really wants or needs, which is something else you can do well when you know someone well. Receiving a gift of something you really like (and will use) that you maybe would not get yourself is the sweet spot (e.g., high end chocolate).

Where it starts to go wrong is expectation. When birthdays or holidays roll around and you are expected to send a gift, what do you do if it isn’t in your heart? What do you do if there isn’t anything the person needs, or if you just don’t feel like it? There is too much expectation that is created by advertisement and a generally consumerist society. It starts to feel badly to feel trapped into having to send something. It feels equally bad to receive something that you feel was forced or actually indicates that the person doesn’t know you at all. And let’s take this a step further. Gift giving, if normally done, when withheld, turns into a pathological means to express a lack of approval. The quality or quantity becomes a point of comparison, and I’m guilty of making this comparison and feeling like I wasn’t valued quite as much.

As I’ve gotten older, I’ve realized something important. The most meaningful gift that someone can give me is their time. Time is limited, and precious. It reflects intellectual curiosity, openness to experience, and choice. Two people choosing to share time is a mutual desire and not an obligation. There is no need to assert value with words, because it is shown through action. There is a mutual shared value of conversation, which typically comes down to emotional and/or intellectual connection. I know I’m valued not because I receive an obligatory present in the mail, but because I just spent many hours with someone engaged with me, laughing, and my internal cup of connection is overflowing.

I will still enjoy (and give) presents, but I no longer feel an obligation. And I’m also investing more in myself, both in terms of self-care and tangible items. My cup has overflowed this year, and I’m grateful to all the people that have been a part of that. That said, when you receive a present from me, know that it comes from my heart. When you spend time with me, you probably already knew that. ❤️

You may also have noticed that I have not written as much this year. The reason is because the year has been too rich to want to spend the time. I am finding that my adventures, whether they be to new cities, intellectual dives into ideas with colleagues, balanced rest, or to scale a mountain with my best friend, are how I want to spend the time. I am branching out, taking care of myself, and expanding my world in ways I never did before, and possibly a lot of the ideas that wound up in writing here are being spoken in a much more engaged context. It was a year of adventure, learning, and growth. On to another in 2026! 🥳

Agentic Orchestration of an HPC Workload in Cloud

Fri, 21 Nov 2025 09:00:00 +0000

One of the most satisfying and learning-rich pieces of work from this year is represented in this white paper, “Agentic Orchestration of HPC Applications: A study using Google Gemini in Cloud.”

First, I’ll provide a little bit of back-story. We were using simple models to convert JSON job specifications or batch jobs between formats. I had a sense that the agents (specifically, Gemini) could do much more, and dove in. At first I was not sure the agent could successfully build a Docker container. It did. And then I was not sure about deployment and optimization in Kubernetes. That worked! Of course, there was a lot of nuanced detail with respect to how the orchestration was done, and how me (the human) interacted with the agents as a team. The learning from this early work is represented in this white paper.

In summary, we used an agentic team (with Google’s Gemini) to build, deploy, optimize, and run scaling studies for HPC applications in Kubernetes. Work is underway (and most of the software done) to do similar experiments using AutoGen, LangChain, and a more formalized state machine design with Model Context Protocol (MCP). This work is immensely exciting because we have more ideas for extending agents to scheduling, topology, and job design. We released this as a white paper since we wanted to extend it before any kind of journal submission, and (for me) I care more about sharing the work than getting it into some high-end venue.

Rootless User-Space Kubernetes with GPU

Sat, 22 Feb 2025 00:00:00 +0000

This is a first prototype to get GPU devices working in User-space Kubernetes, or Usernetes. I am calling it a prototype because it is a “works for the first time” and will be improved upon. For our use case, we will be testing and using on clouds that have NVIDIA GPU devices, however we will need to support other device types in production, and this will be future work. I want to create this write-up while everything is fresh in my mind, because I just had 2.5 days of working through the complexity of components, and learning a lot.

A bit of background

We want to test User-space Kubernetes “Usernetes” ability to run a GPU workload, and compare between Kubernetes (as provided by a cloud) and the equivalent user-space setup deployed with the same resources on the VM equivalent. Google Cloud has excellent tooling for deploying GPU and installed drivers for GKE, so I was able to get this vanilla setup working and tested in under an hour. The setup of the same stack, but on user-space Kubernetes on Compute Engine deployed with a custom VM base on Terraform, would prove to be more challenging.

I’ve designed various driver installers for previous work, including infiniband on Azure Kubernetes Service and more experimental ones like deploying a Flux instance alongside the Kubelet. NVIDIA GPU drivers are typically installed in a similar fashion, in the simplest case with nvidia device plugin but now NVIDIA has exploded their software and Kubernetes tooling so everything but the kitchen sink is installed with the GPU Operator. Getting this working in user-space was uncharted territory, because we had two layers to work through - first the physical node to the rootless docker node (control plane or kubelet) and then from that node to the container in a pod deployed by containerd. Even just for the case of one layer of abstraction, I found many unsolved issues on GitHub and no single source of truth for how to do it. Needless to say, I wasn’t sure about the complexity that would be warranted to get this working, or if I could do it at all.

Resources and Cloud

For this environment, we are working on Google Cloud, and specifically with V100 GPUs, because I can get them in very small numbers (on the order of 1-4 per node, and for a few nodes). To develop with a few GPU on a node it would be a reasonable cost, about $12.00/hour (for reference, the cost of each GPU is $2.48, and then the corresponding instance is ~1.75). This is good example of how tiny bits of resources can go a long way if you are a developer, and (personally speaking) I like clouds best for development environments that I can control over all other use cases. I needed these up for a long time for development, and created easily 50 different setups over a few days. When I had nodes up for most of a day, the total cost was about $150.0. When I realized I would need to do a lot more work, I cut down the number of GPU per node to 2 (my pytorch workflow has a master and worker).

Virtual Machine

When I first started, I took a strategy of using what Google provided. When you select the V100 and navigate to OS, it gives you an option to select one of their ML optimized images. These images are very old (Debian 11 is the newest, which I think dates to 2021) and they only go up to CUDA 12.3. I thought that would be OK to start, but in retrospect it made the environment more error prone. I had to remove and reinstall docker as rootless, and there wasn’t transparency about how the initial Debian base was customized. A good strategy for building these images is to start from an empty slate to the highest extent possible to maximize transparency of what changes have been made.

What ultimately worked was to start with an ubuntu 24.04 image and install my own drivers and CUDA, and then I could choose versions selectively (CUDA 12.8, and I seem to remember the driver version being used was 560.xxx). I was a bit nervous about this because the recommendation was lower than that for the V100 on the n1-standard family, but their provided ML image wasn’t working for me so I had an open mind. You can see the driver install commands here.

Usernetes

The install of Usernetes was typical. You need to enable several kernel modules, cgroups v2, and install a rootless container technology. I chose rootless docker, although on HPC systems you would be forced to use podman. I also set most limits (e.g,. nproc, memlock, stack, nofile, etc.) to unlimited.

One gotcha in this setup that is specific to Google Cloud is how logins to machines are handled. You will typically get OS login, or otherwise login as your email / username. The problem with this is that you get assigned a really high id, and one that isn’t present for any ranges in /etc/subuid. What happened for me on the first day is that rootlesskit was failing (somewhat silently, or at least I missed looking in places to check for it) so I was running rootful docker. The problem was not only that uidmap wasn’t installed, but that the user didn’t have a range. It was actually the debug output of nerdctl that I tested on a whim that pointed me to the issue with the setup, and shout-out to Akihiro for again excellent work. I decided to use the default ubuntu user, with id 1000, akin to what I did on AWS and Azure.

Rooty Docker!

Using “ubuntu” poses a bit of an issue for ssh-ing in. The “gcloud” client is not going to easily allow ubuntu. What I needed to do was first ssh in with my os login, add a public key for my machine to /home/ubuntu/.ssh/authorized_keys and then ssh in as ubuntu. For the terraform setup that doesn’t expose an ephemeral IP, I needed to edit the instance with the control plane, to add an ephemeral IP for ssh.

As a side note - it wasn’t hard getting GPU devices to work with rootful User space kubernetes (yeah, that doesn’t make sense, does it)? I couldn’t use this setup, even as a mock, because rootful breaks usernetes on a multi-node setup. I was able to create a GitHub CI test that reproduces the issue, and hopefully it will be fixed soon! I’m thinking it’s probably related to a rootful docker not properly working with slirp4netns, but I am not an expert there and haven’t looked into it.

The last customizations for docker needed on the host VM were to install the nvidia container toolkit “nvidia-ctk” and configure it to use the docker runtime, and with CDI (the container device interface). For this step I allowed it to generate devices for the development machine I was on, and note that these need to be regenerated when that VM base is used for Terraform.

Docker Compose

Most of the issues on GitHub and instructions for rootless docker and NVIDIA GPU had one indirection in mind - getting the devices on the host to show up in a single docker container. We have two indirections, because we need to map the host devices into a node on the VM running the kubelet, and then that node (a rootless docker container) has containerd that needs to further pass those devices to containers running in the User space Kubernetes cluster. This means we need to solve the problem twice, and essentially have every component in the stack (e.g., the nvidia runtime config file and nvidia toolkit install) duplicated. I found a lot of GitHub issues (here is one open since 2023) that would suggest setting no-cgroups = true in the nvidia container runtime config at “/etc/nvidia-container-runtime/config.toml” and I did try that, but found that it failed after the second indirection.

A Gotcha with the Nvidia Runtime

Many instructions directed to tweak the nvidia runtime to have “cdi” enabled as a feature, and then point to the nvidia-container-runtime executable for the runtime. In fact, there is a command to easily do that. What I realized was that with rootless docker, it wasn’t picking up the default location of the daemon.json, or where the user space one was expected to be. I looked into the service and found that it was running “/usr/bin/dockerd-rootless.sh” and tweaked the entrypoint of that file to explicitly target the config, like this:


  exec "$dockerd" "--config-file"	"/etc/docker/daemon.json" "$@"

That was a manual change I had to make on the VM (that is saved as the base image for Terraform). It’s important to validate that the nvidia runtime is present (detected) along with rootless before moving forward:

  docker info | grep -i runtimes
   Runtimes: io.containerd.runc.v2 nvidia runc

  docker info | grep root
    rootless

You should also test the nvidia runtime before moving forward. You should be able to use it with a vanilla ubuntu container and have “nvidia-smi” working and seeing devices!

  docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
  GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-798e9725-623d-ca7f-f15d-b1908ec8bb0d)
  GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-be5719da-cd52-8a40-09bb-0007224e9236)

docker-compose yaml

The tweaks to the default usernetes docker-compose.yaml are minimal. I had tested added permissions (caps, for example) but ultimately just needed to specify using the nvidia runtime, and then the list of devices. You can see the setup here, and note that I think (but have not tested) that adding “devices” vs. the “deploy” directive do the same thing. Note that if you try to start the control plane (or a worker) with “make up” without using the nvidia runtime and asking for devices, without the “no-groups = true” you will get an error, specifically this one about bpf_prog_query with failed permissions. That is another issue that has been open since 2022. 🙃

Usernetes node

A Usernetes node can be the control plane or a worker. The general procedure for the control plane is to bring it up, run kubeadm init, install flannel, make the kubeconfig, and then prepare a join command to send to workers. The worker nodes also need to be brought up with the same setup, but then they just need to have the join-command (it’s a text file that is executed as a command for kubeadm join). The additional step I needed to add to this was a Makefile command to “make nvidia” that would setup CDI to be used inside the node.

 .PHONY: nvidia
 nvidia:
	$(NODE_SHELL) nvidia-ctk system create-dev-char-symlinks --create-all
	$(NODE_SHELL) nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml --device-name-strategy=uuid
	$(NODE_SHELL) nvidia-ctk cdi list
	$(NODE_SHELL) nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi
	$(NODE_SHELL) nvidia-ctk runtime configure --runtime=containerd --cdi.enabled --set-as-default
	...	
	$(NODE_SHELL) systemctl restart containerd

In the above, we create a set of symlinks that I found were needed in practice, but I would see errors if they didn’t exist. For the GPU operator, I found that there was an environment variable in the validator that needed to be set to disable trying to make them, which would fail in usernetes with a permissions error. I didn’t wind up pursuring that path further (using the GPU Operator) because it was highly error prone, and making changes that led to a broken state for Usernetes. We are also generating the cdi file “nvidia.yaml” in “/etc/cdi” and setting the nvidia container runtime mode to use it. Finally, we are configuring the nvidia container runtime to work with containerd, and (still) with CDI enabled. The sed commands (not shown) are uncommenting and enabling different settings I found that would (at least at face value) possibly help in rootless mode. Finally, we restart containerd.

It took me a lot of testing and learning (I have no experience with CDI or working with these tools beyond installs of the nvidia device plugin that have just worked on clusters in the past) to get to the above. You can see the full Makefile here. At this point, we have the node configured also with the nvidia container toolkit, and containerd updated (and restarted) to use it.

NVIDIA Device Plugin and GPU Operator

It’s typically easy to apply the nvidia device plugin to have devices detected and working. This gave me quite a bit of trouble, because at first (when I wasn’t using CDI) it only detected anything when I specified “tegra.” That would have the labels show up on the nodes, but then when I tried to create pods they would fail not knowing what tegra is (and understandably, that’s the wrong setup). Changing it to use nvml would fail to find the library, and “auto” didn’t work at all (at least at the onset). Before I got the CDI just right I went through half a day of going back and forth between using (and trying to tweak) this yaml file and testing the GPU operator, and found a lot of really weird states.

Several times I could deploy the first, have the GPUs show up, but then fail in the cluster, and then apply the GPU Operator. A few times that seemed to work, and other times (most times) it just led to more errors, and not even getting so far to get labels for the GPU. I don’t know how this worked once, but when I tried to reproduce it, I would get containerd operation not permitted errors, along with an error about a PID. There were at least 5 times when something would work, and then I would save an image of the VM with the changes, bring up the Terraform setup to reproduce, and reach the “moment of truth” with deploying pytorch and be faced with another new error. That usually felt bad. 😞 My best guess based on this work is that we were having interactions between components and slight differences in GPU operator components coming up that led to inconsistent state.

What I ultimately decided is that the GPU Operator was too complex to understand easily. I tried customizing the values.yaml install with helm to disable un-needed components (for example, I don’t need MIG here to split GPUs, and the V100s don’t support that anyway) to try to simplify (and make it understandable) but my intuition told me that it was too complex. I didn’t like that it seemed to break the setup, give me inconsistent states, and there were so many init containers and dependencies it wasn’t clear if there were race conditions. All I can say is that on the rare cases something started working, it didn’t reliably reproduce in this rootless setup. In several of those cases it worked for a first run, and then broke for subsequent with new errors. This is what led me to not use it, and focus on the details of the CDI and the simple device plugin daemonset deployment. That ultimately worked like a charm, despite not being recommended for production setups.

The Application

The ML app wasn’t without issues. Specifically, the entrypoint for the master or a worker might look something like this:


  python3 /opt/pytorch-mnist/mnist.py --epochs=10 --backend=nccl --batch-size=128

I hit several errors about not finding GPUs, or (in the case of rootful Docker with usernetes) the networking never working. I had a mangly device error that was resolved when I updated the container to the latest version (now 2 years old). Another unexpected issue was with respect to data. I had prepared data to use from the old container, and when it was attempted to be used with the newer version, it wouldn’t validate and would try to download. Given that the download links weren’t working at the time, I couldn’t run anything. I had to ensure that the data matched the container. There is more on that here. We ultimately build our own container with the data to ensure it is available, and we won’t take time during our experiments to download it. Side note - in that exercise I learned that I could convert a Python egg to a zip, unzip to explore, and then make changes and repackage into an egg! YOLO!

Summary

The final setup is using an ubuntu 24.04 base with CUDA 12.8 and drivers 560.xxx, and using a strategy of rootless docker for usernetes with the nvidia runtime exposing devices via CDI. Setting no-cgroups to true or not using the nvidia runtime will not work, either due to needing cgroups later or the bfp permissions error I noted earlier. Once in the container, we need to again prepare CDI to be used with containerd, and ensure that we generate symlinks in advance. The GPU operator results in a broken state, and the nvidia device plugin, on its own, can best expose devices on the nodes to be available to the pods. Here are a few images for posterity that show everything working. First, the nvidia device plugin:

And this is when the devices show up (just one GPU) on our tiny nodes. For context, we can’t get many GPU on Google Cloud, so we are maximizing the number of GPU per node, since we primarily need to test Usernetes for network.

And pytorch working - go epoch, go!

Reflecting

This process from zero to working that lasted 2.5 days (with a bit of sleep between days) was uncharted territory, and I knew would be challenging from the getgo. It is an experience that I must strangely enjoy? I say this because I have moments of joy and anguish, and several times in the period of time I “decided” to give up. But somehow (even when there are many other things I should have be doing) I found myself continually returning to the setup. It meant waking up with “just one more idea” or bringing down a setup, finally eating, and then during mind wandering deciding that I wasn’t done yet.

I don’t know if this is mental strength or just stubbornness (I think likely the latter). It’s a brute force approach, when I think about it, because I almost refuse to stop until I physically fall over. This didn’t happen the first night, but did the second night. It’s also a kind of problem that I know I won’t get help with. Not to get too philosophical, but I’ve realized through life experience that if I want something done, I need to do it. If I want change, I need to figure out the steps to take and take them. It’s easy to defer responsibility or blame, and I won’t do it. Often it’s not productive, because it accomplishes nothing. That kind of approach applies in everyday life when it comes to making choices about taking care of oneself, and also for learning and solving hard problems. Knowing this was an important problem to solve for my institution, team, and community, even the start of a solution, I felt that responsibility. I’m grateful for the inspiration and support I get from my small team to have the inner fire that drives me. It makes the work challenging, and when we solve problems together, fulfilling and fun.

All in all, I can’t say what the essential fix was, but I will say this is a complex setup. In retrospect, my advice here is to follow intuition, try to build components that you have the most transparency (and control) over, and choose simplicity over complexity over what can be software bloat for a simple use case. And on that note - I am going to leave my dinosaur cave and go outside! And then likely I really do need to come back and work on those slides, which I have been very successfully putting off for 3 days now. 😉

Reflections for 2024

Sun, 22 Dec 2024 00:00:00 +0000

It’s Friday evening, and the holiday is just beginning. I’m relaxing in the quiet of my apartment after the work day, a good nap, and although I could delve into some things I am hungry to work on, I know I have two weeks for that. I want to take a moment to reflect. This will likely take me a few days to write, and I’ll intersperse that writing with work.

A rich life is a good life

I finish this year having confidence in who I am, and what I need and desire for a rich life. I say “rich” as opposed to using the term “happy” because a complete life experience requires an entire gamut of emotional and physical experiences. Pain, sadness, loss, and loneliness are the complements to their opposites, and the most interesting people I’ve met know them intimately. It is a blessing to survive adversity because, although you may come out with scars, you come out with the ability and knowledge that you can heal. This ironically helps in software engineering as well, which has consistent challenge and uncertainty. You can dive into anything not having confidence of understanding, but having confidence in your ability to eventually get there.

The embracing of authenticity and proactivity are also still important to me. We decide who we want to be, are genuine about it, and pursue it without making excuses. When we make mistakes, we ponder them, decide to change, and do it. A victim mentality looks at the world and finds reasons that it has been unfair to it. An empowered mentality takes responsibility for those same observations and becomes the impetus of change. This is the standard that I strive for, and I am committed to continue to work on myself, physically, emotionally, and intellectually, to pursue growth and avoid complacency.

This year, my heart was full. I saw beauty all around me. I embraced reflection, quiet, and imagination. I’d like to share some of these thoughts today.

I learned my limits

I have a superpower of productivity. I can’t explain it, but I can focus and get a lot done, and in a short amount of time. I am mentally strong in that it is hard for my environment, whether that be social or academic pressure, to touch me. I romanticize this story sometimes, imagining myself moving through chaos, time slowed down, and brushing aside the bullets. My heart is inspired by the things that fill it with love. It also can make it hard to relate to other people, because I might understand and see the same stressors, but I don’t feel them as strongly. In a culture where it is trendy and common to be busy, I am not. I am dancing in my head, and I follow my nose and pursue things that are inspiring, or bring me joy. This year, however, I would learn the physical limits of my gift.

My team embarked on a large performance study in August. We had a very short amount of time to do a lot of work, and there are only a handful of us. I was excited and determined to have a successful outcome, so I made an explicit decision to turn on my turbo mode, and not take it off. This is an interesting quality of my productivity – that I can turn it on and off at will, and strategically direct it along with my attention. However, instead of a few hours in a day, it was turbo mode for the entirety of the day, and for weeks. We ultimately completed most of what we set out to do. I felt proud, both of myself and my team.

But after it ended, and I looked down at myself, I realized that the effort physically took something out of me. There was less of me by the end of the month, 12 pounds to be exact. My frame has always been long with lean muscle, and this doesn’t make me a huge person, so that was too much to lose. It felt terrible. It was the same lesson that I had learned when I was a freshman and new to running, and would push myself to collapse, either on the track, or completely blacking out in a cross country race. My mind has always been a lot stronger than my body, which feels tiny in comparison. I learned that my mind, which often wants to conquer mountains, needs to be more considerate of the bag of bones and meat that carry it around. Or flip that around, and the lesson is that I need to make my body stronger to match the demands of my mind.

I learned to say no

The end of the study combined with the end of a fiscal year, and shifting of work that led to a lot of rumination. I thought about my efforts the previous year, and while the majority were fulfilling, I couldn’t ignore a creeping feeling of sadness from some of the pursuits where I was working really hard, and trying desperately to be valued. I could complete entire projects on my own, make documentation, presentations, and give talks at high profile venues, but I still felt disconnected. I decided that I needed to introspect on this feeling, and that the experiences of the last year were not OK for me to repeat.

For most of the year I blamed myself for not finding the right way to connect. Many of us have experienced this. We try to fit into group dynamics, and give huge amounts of energy and time. What I’ve come to realize is that often there is no maliciousness or negative intent on anyone’s part. Communication is a two way street. It is simply the case that people vary in the degree to which they can successfully communicate, and the extent to which they try. The latter can make up for the first, but requires time that isn’t always possible to give. I was spreading myself too thin trying for universal connection when more realistically, I needed to prune my graph. Constantly showing up and hoping for connection when the dynamic was not there was not something I was not going to do anymore.

I decided to change

Pruning a connection graph means tweaking the scoring algorithm at each node. My previous algorithm was faulty, and included a variable of wanting to be accepted. The trade-off was placing less emphasis on consideration of my own time and value. I decided to try a new algorithm – one that would place focus on the projects that I was passionate about, and listening for connection instead of trying to force it. When the noise of expectations becomes more quiet, the signal from people that value our contributions can be more strongly heard. It was akin to moments this year when I found myself in the middle of nature and heard – felt – the quiet. It was there all along.

I decided to change in many ways, and this was the first: prioritizing relationships not based on expectation or idealistic desire, but tangible evidence. And it needs to go both ways – we need to invest in the other person as well. If you need an algorithm for knowing if someone is important to you, think of how often you think of them. If you find them on your mind, or they are the person you want to brainstorm ideas with? That speaks for itself. I used to associate missing with sadness, because it meant something I treasured was not there. I’ve changed that perspective to a more positive one. Missing someone is a beautiful thing, because it means you were lucky to find connection. In many cases you will see them again, and that is something to look forward to. If you won’t? You can be grateful to have had shared experiences with them, and carry those memories with you.

When I refocused my energy, the sadness not only dissipated, but my metaphorical connection tree is thriving, growing stronger roots and fewer, healthier and more verdant leaves. As I am able to focus on signal that I think is important for our community, I am becoming a better technical leader. I don’t fit any templates for what that is supposed to look like, and I realize that I won’t, and that in and of itself is important for the ecosystem. After this subtle change I’ve found a level of joy and satisfaction in my work that is unparalleled even to the best times I had before.

I decided to adventure

Rumination might start from one thread, but has a quality of trickling easily into other parts of life. As I thought about how I spent my time, I didn’t like the idea that life would be as it is now, forever. Routine provides consistency and safety, but we can also get stuck in them. I have the duality of liking routine, which is safe, contrasted with a heart that craves adventure. That manifests in careful reflection and decision to pursue experiences that are novel and allow me to break free from comfort zones.

As a result, at the end of this year I sought adventure. I found that I needed to be brave, because I had grown up learning a mindset of fear. I embarked on an over 1000 mile road trip, found myself exploring mountains, flying down steep hills on my bike, and unapologetically experiencing all the beauty that the world has to offer. In retrospect, I was able to turn a knob to control my own level of risk aversion to be oriented to risk seeking. I embraced uncertainty. It was not accidental, but done through careful decision, and I can attest I had moments where I would sit in quiet thought and not stand up until I felt a shift in my perspective. I didn’t know that was possible. It’s fascinating that the mind is capable of that. I learned that joy comes from immersion in new experiences and adventures, and it’s relatively easy to decide to pursue them more often. I am excited for a life where I have these adventures to look forward to.

I learned friendships are tiny and powerful

At the end of last year (2023), I had an epiphany that I needed people. This was a surprise for me because during times of my life when things have been hard, I’ve learned to find strength from within myself. What I wasn’t sure about at the end of last year was the details of that. How many people? Under what context? What specifically were my needs? The last is an important question, because I’ve noticed it is common to have a need or request but not think carefully through what you are actually asking of others.

I learned this year that I only need a few, and I value close friendships that are open, vulnerable, and supportive over any volume of people that often forces more superficial interaction. My desire is for direct communication, and interactions of sharing stories, learning, and joy. I want laughing fits, and I want psychological safety. If there is conflict, that means that the other person metaphorically embraces you and shows you through their actions that you are going to figure it out. They will not abandon you or give up on you, and disagreement is tackled with thoughtful discussion and kindness. They consistently show up when you don’t ask them to, in the good times and more the bad, and they absolutely don’t have to. Once I experienced (and fully realized the power) behind this kind of communication, in contrast to blaming, shaming, and feeling like you are put on the defense? I metaphorically fell to my knees. I am tearing up as I write this. I am so grateful.

I learned to set boundaries

We can often tolerate things because we are supposed to. Because if we don’t, we are publicly shamed. Or we compare ourselves to an ideal and feel that we have failed. We might keep trying, and blame ourselves. I had an epiphany this year when I realized that this self-blame was counterproductive toward protecting myself. In the same way that some people feed our souls and we might nourish these relationships, we equivalently need to be aware of those that drain us. I decided this year that I would do a better job to protect myself, and set boundaries. A quick feeling of dread and anxiety in specific situations was a harbinger of something that was not good for me. It is not selfish to not want these negative experiences anymore. We don’t have to engage. We can walk away.

It would be akin to being in a room with a stove, and having to touch it. Sometimes it’s off, and it’s OK. But then it often burns you. It hurts. You know the pain is coming again, but it’s unpredictable. The stove in this metaphor might be a person or experience, and either way, may not even be aware of its influence. The stoves in our lives are there to teach us lessons. The insight I had is that it is my choice for the stove to have that power over me, or not. I decided to learn from the experience, and leave that room. This choice was empowering, and I encourage others to set boundaries for the stoves in their lives, whatever they may be.

I learned to love

There is a common association of love with ownership and possession. While I won’t say that’s wrong, what I realized this year is that truly selfless love has no quality of that. I believe that truly loving someone means deriving joy from their happiness, and not expecting anything in return. In the best of worlds it is sharing experiences and laughter, and feeling a sense of connection that feeds your soul. It is wanting to support them, and give them your time and every facet of your superpower. You would run mountains for them. You would wait a thousand years to show up for them. Love is often portrayed to be primarily romantic, but it can be found in friendships, and in friends that turn into your family. Importantly, we must be able to apply that love to ourselves. When things fall apart, it is a hug of internal strength and determination that ultimately gets us through. This was a thought I had this year that gave me a feeling of immense safety and peace.

Thank you, 2024

It was a beautiful year. I’ve learned how to embrace what I have instead of focusing on what I do not, approach life with curiosity and desire for growth, and how daily fulfillment and taking care of the self is more of a decision than anything else. I danced and laughed so much this year, and that is a strong attestment of my joy 🥰. I think if this were it, I could look back on my life and find that I have had it all. I am convinced that it’s only going to get better. Happy holidays, folks, and onward to new adventures and learning in the New Year.

For Coach

Mon, 25 Nov 2024 00:00:00 +0000

I went on a bike ride this weekend, and in passing a parking lot of a trail head I saw an older man standing next to a car, laughing with a friend, and he looked just like my old high school running coach, “Coach.” Yes, we called him exactly that. He was (at least in front of us, a rowdy group of high school students, 13-18 years of age) stoic and commandeering, easily earning our respect so we would abide to his orders to run many laps around the track, or loops around a course. He led us through the cross country, winter track, and spring track seasons.

This weekend after I saw this reflection of him in someone else, I couldn’t see through my goggles. The tears flowed, and not quietly – it was the kind of aching, sobbing that emits from your soul unexpectedly when something you’ve shoved down deep forces itself out into the open. I found myself flying under a bridge, not able to breathe, and having opened up a well of sorrow that I hadn’t allowed myself to experience.

Coach died in 2013 while I was in graduate school. I received notice a few days before it happened, and even if it had been enough time to think about coming to say goodbye, I couldn’t go. I faced a lot of health issues, and combined with the stress of graduate school, it was too much. I was ashamed at the time for him to have seen me – I was thin and sickly, and a shadow of my previous self.

As I rode and allowed myself to experience the suppressed emotion, I remembered many things. I remembered how he would pull me aside before and after races, when I was anxious and stuck in my head, and give me a combination of tough love and logic to bring me back into the present. And side note - that particular class I championship I was stacked for events (800, 1600, and two relays) and although I didn’t do great for all of them, the 4x400 at the end was one of the fastest times of my (very short) 400 career - close to about one minute. I wasn’t a 400 meter running or sprinter, but I did love the relay. I remembered practices. Regardless of what happened during the day, at 3pm sharp I’d report to the track and find the comfort of routine. Coach helped to create the environment at practices that made running, albeit it was challenging, such a joy. I carry that joy with me to this day, which is almost 25 years later.

I realized this weekend how much I took him for granted, and the fact that he was a source of stability during my entire transition from adolescence into young adult – freshman to senior year. I saw him every day after school, and for many hours, for the Fall, Spring, and Winter, and often in the summer for training. He was the one that saw potential in me when I was undeniably the slowest on the team as a freshman. He was the one that pulled me aside when I grew 5-6 inches before my sophomore year, and told me how to add dried fruit and nuts to my died to put on some weight, because I needed to be strong. I saw him early on Saturdays for meets. He designed workouts that catered to my running style – middle distance. He was a source of support if I ever came to him with complaint. He told me he was proud of me.

I cried today because I never properly mourned his loss, the loss of someone that had been so important to me, and cared for us so deeply. I cried because I never got to say goodbye, and tell him that he gave me a backbone of stability during high school that I wouldn’t have had without him. To Coach, wherever you are, there are few people that touch our lives deeply, and you are one of them for me. I am so grateful for the time that I got to spend with you. It’s common to think that people value us based on financial support or gifts, or some requirement based on biological relation. But at the end of the day, the most valuable asset that someone can give us is their time. I will keep your memory with me always, and for the people that I love, I will also give them my support and time. And I will do a better job to tell them more frequently how important they are to me.

A Future for HPC and Cloud: Collaboration Across Boundaries

Sun, 17 Nov 2024 08:30:00 +0000

The Developer Stories Podcast recently released an episode with Dan Reed (“HPC Dan”) that talked about the future of High Performance Computing. While there was ample conversation on resources and some policy, we only touched on some of the ideas about what to do it about it, or more specifically, how we should be working together. In this post, I want to talk about some of the problems I see with our current academic culture that prevent us from more successfully collaborating across the space. I like to think about these possible futures that don’t exist yet. Let’s jump in.

Traditional practices do not scale to cross-community collaboration

In academia, we are accustomed to writing papers. We are told (and expect) that publishing a paper in a highly respected venue is what will get the most attention, and thus have the most impact. And perhaps it still is true that this means to share information will be distributed to the academic community, and be a sound strategy to give us “career credits” or a metric of value for career advancement. But it is problematic. And you can’t even point this out because you’ll get in trouble for it.

Here is one problem with the above. While it works if you live in the isolation of an academic community, it doesn’t scale well beyond that. The issue is that today we need not just be talking to the academic community, but to the larger cloud community. We are in a present day and entering into a future where cloud is the leader, from an economic standpoint, and we are in somewhat of a competition for resources and talent. The two communities have been presented as a dichotomy, and at worst in an adversarial light. I hated this perspective – when someone on a panel would raise their voice and say “We cannot afford you!” or point fingers. It wasn’t productive – pointing fingers and blaming someone does not make progress. That same energy can go toward proactive action to try ideas and do something about solving the problem.

What it takes to be influential

Now we can talk about what it takes to be influential. Indeed, when you are the “little guy” in the face of an economic powerhouse, it ‘s easy to feel powerless. And if you are a pessimist, maybe this is how you see it. But if you are an optimist, you might recognize that while it’s out of your direct control, it is within your indirect control. You can have influence, even if you are just one person. You have a voice.

How to have influence? We first need to define a line between two kinds of work – the conceptual piece that might include algorithms, design and architecture, and then an implementation, which ideally is of more production quality. In the academic system that promotes publication first, we lean heavy on the conceptual. Ideas are often presented without an implementation, or if there is an implementation, it is a weak point. It’s a mistake to design something elegant but never turn it into a product that catches attention. It’s also a mistake to implement something that looks flashy but has a poor design. The strongest work will be a balance of those two – an elegant and well-thought out algorithm and a production-level implementation that further adds evidence (via implementation) that the idea works. This brings up another problem about skill. Often the researcher coming up with the algorithm isn’t a programmer. The paper might entirely be math. If the researcher knows how to program, they most likely have never written production-quality software. It’s a wide gap to span.

If there are scattered people in the academic community capable of spanning this divide, they often don’t have or cannot make the time. Given a reward system based on publication, and that a publication is sufficient with the algorithm, the implementation is not a priority. To add challenge to that, in order to have influence, you often need to generate many of these paired ideas and implementations. From the standpoint of work, it’s a lot, and most of it won’t lead to an outcome of change. I think that’s likely why we don’t see many of these winning combinations from our community. It’s hard to do, has a high failure rate, and there is no direct reward in place for the work. It’s much easier to fold the self into the more quick, turn around reward systems to write papers that show incremental conceptual improvement that are accepted and published at frequent venues.

Engagement is not well defined, so nobody does it

Influence often has to start with establishing a voice. Establishing a voice often means speaking up, and being persistent. It’s easy to think that the voices of the few cannot be heard and have impact, but I’ve found this to not be true – one or few people can inspire change if they send out a consistent signal. This means showing up to working groups and (often) being the only one from the HPC community, posting on group lists to ask questions or engage, and taking time to listen to podcasts and watch talks from venues that are not traditionally in the HPC space to learn new ideas. It often means finding connections between what you know in your community and these “other” spaces, and then being forward to reach out to individuals in the other space to ask to talk about something. Many times, these conversations might not come to anything. But when they do? That’s where you have influence. It means leaving our silos. We are most comfortable in silos. But those that leave silos (and zones of comfort) to share ideas across boundaries will have the most impact (influence).

Examples from the HPC community

I can give direct examples from our community for individuals that I think have bridged this gap and had great success. The first is Ricardo Rocha of CERN, who is (obviously) firmly rooted in HPC – CERN “The European Organization for Nuclear Research” is the largest particle physics laboratory in the world. Ricardo has been a leader in voice and work that has spanned the cloud-native and HPC communities for years, most recently giving a Keynote at Kubecon North America about multi-cluster scheduling with Kueue. Another example is Torsten Hoefler, head of the Scalable Parallel Computing Laboratory (SPCL) that I’ve stumbled on recently learning about Ultra Ethernet. If you look at his lab’s YouTube channel (yes, that is notable in and of itself, how many labs do that?) Torsten very notably is not presenting recordings from venues, he is taking talks from venues (and beyond) and recording them to share intentionally. He is adding the branding for his laboratory. They also have an active presence on social media, which is also notable. I’ve noticed that some academics tend to be very active on social media, and others either pretend it doesn’t exist, or turn their nose up to it. I’m not saying that social media venues are healthy for society or a good use of time (they can really steal attention in a terrible way) but they are a means to reach a wide audience of people. Just making a post when you have something important to say, which is what I try to do (often linking to my full thoughts here) is strategic to getting a message across, regardless of how you feel about the services.

I believe that this is something we need to do more of – putting out information (and advocating for it) without having it be of direct benefit to us (publication, conference proceedings, etc), and putting out information when we have something to say. I’ve been experimenting with this idea recently with a few talks on container pulling in Kubernetes and scheduling to containers in Kubernetes. I got tired of the “wait for a venue and ask for permission” to share ideas. Ironically, the second talk (now over 4 months old) would not have been presented until this weekend if I had submitted it to the Canopie HPC venue. I also would have been limited to a tiny bit of time, and it’s unlikely it would have been shared beyond a single room of predominantly one demographic, one community. Is that really the best outcome?

Openness and transparency are a hallmark of collaboration

Another feature that must come from the venues themselves is transparency. It almost doesn’t matter if a community has annual, flashy events if they are venues of privileged – you must pay to enter, and to access information, and beyond that, it’s closed. From the example above with Torsten, the first talk I watched was his recording of a talk he gave at Salishan. This (to me) comes across as one of these high-end, invite only HPC events that I (and most) would never be privy to attend. When presentations at these venues are not shared publicly, and yes, on places like YouTube, this is knowledge that will be forgotten. It doesn’t matter how impressive your work is if you present it to a room of 30 people and that’s the end of it. It makes me sad to read blogs of prominent people in our community that reference talks from these closed events, and know that I’ll never be able to see or learn from them. If we are championing reproducibility, transparency, and openness, we are not practicing what we preach. The argument about needing to attract attendees and keep a conference profitable doesn’t cut it. Look at Kubecon – it’s an absolute beast in terms of attendance. They have their talks up before the conference is even over!

Speaking of Kubecon, one of my favorite things to do is watch talks from it for weeks (and more) after they come out. I find interesting projects and reach out to people, and this is an opportunity to grow network and thinking space. If the talks weren’t on YouTube, my portal to that world would be closed. We are missing on that opportunity for others to reach us by not sharing. I feel that I get to experience some of the learning of the event despite not being there. The organizers of Kubecon I suspect recognize that not everyone can attend, for reasons that vary between people, and they don’t want to close off knowledge. I respect and champion this perspective, and hope that the HPC community can eventually catch up.

What the HPC community needs to get better at is the open sharing of knowledge. There are specific projects that do this well, but our conferences (generally) do not. The researchers and labs that are going to have impact and be successful not only do great and impressive work, but they are actively sharing it. I know about Torsten’s work and lab because I listened to him talk on a Podcast about Ultra Ethernet, and then I found his YouTube channel and Twitter feeds. My network and space for learning has grown because he has put his work out there.

Routine for engagement is missing

It’s problematic that the HPC community has no established routine to know how to engage. This is often why solutions cannot be offered up for the problems at hand – it’s not clear how to act when there is absence of instructions for the thought and engagement process to begin with, let alone solutions themselves. Maybe that is where creativity comes in – which broadly speaking is generating something from nothing. But that takes time and freedom to think (I’ll talk about this later). For the first problem – “how to engage” – learning and engaging in ways that don’t fit a traditional routine for an academic are hard to do. The academic mindset is one of permission. Do others think this is a good idea? Can I get permission from my boss to work on it? At best, we submit proposals (with creative thought) but they still need to be approved. When they are not, we abandon them for the time being in favor of whatever we are given permission to do.

Influence is deciding to bake fruit-cake

But much of what has to be done will never be granted permission because it’s either too risky or unknown and questionable. Much of what needs to be done just needs someone that decides to do it, and then shows people after that. You don’t ask permission to bake a new fruit cake you think will actually taste good, you bake it, and then offer others a taste. They might realize that it tastes good, but if you asked them in advance they would say “No way, fruit cake is terrible. Don’t do it.” In the second case, you’d never had made the cake. And my thinking of fruit cake comes directly from this post on Dan Reed’s blog. He has strong feelings about fruit cake! 🍰

A lot of good ideas are also accidental – you start doing one thing, and maybe it’s even just for fun and learning. You start building something, and stumble on an insight or something even cooler along the way. That goes against the academic desire to write down a plan a priori, get it stamped and approved, and then start working on it. You also need to have time to explore and play like that. So high level:

the models of thinking and working that are often needed for innovation and ideas that are different and useful to influence a larger power don’t fit with what we are expected or trained to do.

They don’t fit into the time or schedule we are afforded based on our established routines.

Our reward systems don’t encourage relaxed, creative thinking

It seems like a lot of academics are on a treadmill to meet deadlines. There is some promise that the treadmill will slow down, but in practice, I never see that it does. This makes time hard to come by, and so the things that get prioritized are those that fall into a comfortable, established routine. If there is something that falls outside of what we deem the highest bang for the academic credit buck it’s not invested in. You don’t make the time.

Collaboration is leaving the comfort of your local market

Let’s pretend that we are all bakers in a town. Our highest reward comes from baking our recipes, possibly with slight deviation so they are known to be tasty, and taking them to the local market to sell for profit. It would be hugely (temporally) costly to walk to neighboring towns looking for bakers working on similar recipes, and then spending time testing new, often very different combinations of ingredients. We might come back tired, broke, and not having found a great recipe. On the other hand, maybe we don’t have an immediate success, but we are invited to other markets. We taste test a much broader range of goods. We grow in so many more ways than if we stayed in our little town.

And maybe before communication afforded it, that would be the likely outcome. But unbeknownst to us, the network of bakers in other town have discovered Twitter, YouTube, and a use for other (sometimes terrible) social media services that allow them to quickly iterate on ideas and work together. Not only have they caught up to the tastiness of our recipes, they have surpassed us, and are designing robots to make the recipes for them. And we are still here, fudging around with the amount of cinnamon in our oatmeal raisin cookies. We still haven’t figured out we could join their communication channels, and bring the story of cinnamon to the larger community to iterate on much faster.

If you don’t get the metaphor, it’s about the time of payoff, and the initial cost of communication. Taking the time to engage outside of your comfort zone doesn’t have an immediate payoff but a longer term one.

The other issue with this paradigm is that people want established paths of behavior. There are no established paths for interaction with cloud communities. People don’t know what to do, so the default is to do nothing.

The future is large, collaborative projects

I believe in our HPC community to innovate and come up with amazing ideas. I also believe in the power of numbers, and that you can start with even a mediocre idea or project, and with enough motivated contributors, turn it into something equally innovative. That is how I see the innovation space in Kubernetes. Often a feature or component comes out, and it is first a little rough around the edges. But like clay, with many contributors and common need, it transforms over time into an elegantly designed solution that solves a lot of problems. I am biased here (and recognize my bias) that I have more faith in large, collaborative efforts to solve some of the most challenging problems than say, a small group that are isolated in academia. Sitting in these small groups, I think we will have the most success through engagement – bringing out expertise to the table and conversation for these larger projects. Is it often uncomfortable? Yes. Does it often go against traditional academic norms and incentives? Yes. I think with this strategy we can solve larger problems, and in a more collaborative fashion that leads to things we champion (but often don’t practice) like reproducibility and transparency.

I can give a quick example with respect to multi-cluster scheduling. There are huge internal projects working on the problem. And they will likely come up with interesting papers. But I believe it would be a better strategy to first collaborate with the SIG multi-cluster group, ensure they are handles for customization (for specific use cases like HPC) and then to optimize for that. I believe that a viable future for most models that are converged (general problems of compute that can sit between cloud and HPC communities) is that the powerhouse global community is going to put together some kind of skeleton, and the initial version won’t fit exactly what we need. But it will very likely be customizable, and we will customize-away for our use cases. Maybe our use cases will emerge in the larger community, and they will be solved before we’ve had a chance to write papers on what we are doing. Cloud companies get a competitive advantage for standardizing things. This means they themselves need that ability to customize, and that need directly helps us.

This goes back to the talk I shared from Ricardo – I can guarantee you he has something like that in mind. We prioritize working together, and we figure out the details for what we specifically need. This is a different strategy than what I normally see – working in silos and coming up with disparate solutions that then further separate the two communities. Ironically, because the underlying use cases are so similar, we usually have a loophole that the Kubernetes (and cloud-native) community eventually innovates what we need anyway. Examples include (but are not limited to) batch workflows, topology-aware scheduling, and custom scheduler policies. The scheduling space is still a bit rough, I’ll admit, but it’s getting a lot better, and really quickly. I suspect the next item to add to that list will be multi-cluster and multi-tenancy. We will see.

The gopher has no clothes

I sometimes feel like I’m pointing out that the emperor has no clothes. But it’s strange to watch these cycles repeat, year after year. The insight is that there is not a real divide in the actual technology space – the current divide results from us not working together. We have similar workloads and similar needs, and the only reason we have entirely different projects is because HPC has largely existed in a silo. A lot of the innovations that we need are ironically coming to be, not because of our input, but because they are foundational to workloads we share in common and cloud needs them too.

I am just one person, but I will continue to express my views, and to have my voice, even if I am a bit against the grain or considered non-conformist for it. I know that my opinions are often threatening to people, and that is outside of my control. If you find my ideas threatening, it might make sense to think about why. And after that, let’s have a conversation about it. Let’s grow and learn from one another, because we very likely have similar goals in this beautiful space of work.

On that note, I’m off to a running adventure! And this week is Supercomputing. I’ll be watching Kubecon talks, engaging remotely however I can (without having purchased a ticket) and probably enjoying a quiet week of focus on programming projects and learning. I do hope to go in the future for some fun social aspect. To all my friends in attendance, have an amazing week!