January - April, 2025
Stripe is a payments processor that handles online and in-store transactions by managing the flow of information between customers, merchants, and banks. I first became interested in Stripe years ago after hearing their story on the "How I Built This with Guy Raz" podcast. What really captivated me was how Patrick and John Collison founded the company when they were around my age (19-22) with a simple yet revolutionary idea: allowing online stores to integrate payment processing with just seven lines of code. They took a process that once took weeks and made it something that could be completed in seconds.
While Stripe may seem straightforward from the outside, orientation revealed the complexity behind its services. The company has expanded far beyond payment processing to offer a full platform, including fraud prevention, business infrastructure, and more. Seeing the depth of its offerings made it clear that working at Stripe would give me the chance to tackle real, complicated problems, and its reputation led me to believe I’d be working with some of the best in the software industry.
I spent 16 weeks interning at Stripe from January to April 2025, starting with an orientation week at their San Francisco HQ. After that, I moved to Seattle for the remaining 15 weeks, working on the Service Orchestration team in the Core Compute department. Our team handled Kubernetes clusters, which are the environments where Stripe's services run.
During my time at Stripe, I completed two significant backend projects, both implemented in Golang. My primary project involved developing Kubernetes informers to track Karpenter resources (an AWS open-source node autoscaler recently adopted by the team) and creating APIs to expose this data. I built a system that persisted resource states to PostgreSQL, implemented efficient queries, and extended Stripe's internal tooling to display this data. This improved engineers' visibility into the cluster autoscaling process and node lifecycle management. I completed this technical infrastructure work ahead of schedule by the halfway point of my internship.
For my second project, I productionized an LLM-based diagnostic agent for Kubernetes troubleshooting. This system analyzed Splunk logs and Kubernetes events to provide automated diagnostics for common issues, reducing the support burden on the team. The agent helped engineers quickly troubleshoot deployment failures, container crashes, and resource allocation issues without requiring manual log investigation.
Outside of work, there were a bunch of fun activities for interns, like go-kart racing, mini golf, a Seattle Kraken hockey game, and even a chopstick forging experience. These events, along with my day-to-day work, gave me a great look at both the technical side of things and the collaborative, social culture at Stripe.
Stripe's headquarters is in Oyster Point, on the eastside of San Francisco Bay, with a great view towards the East Bay. The area is mostly quiet, filled with modern office buildings and warehouses, many of which are biotech companies. It's a surreal location with prime real estate—on the water, palm trees, and a bike and running trail along the bay—but there's not much else nearby, no homes, stores, or restaurants. It felt odd being in such a prime spot with little traffic or crowds.
For the first week, the onboarding was pretty intense. I, along with about 300 other new hires, stayed in hotels around a 30-minute walk from the office. Every two weeks, Stripe runs these onboarding sessions for new employees. On the first day, onboarding started at 7:00 am and the rest of the week at 8:00 am. We were given a book, The Dream Machine (published by Stripe's own press company!), and spent the week in back-to-back presentations and activities. These covered everything from Stripe’s products to how the company fights fraud. For software engineers like me, there were also technical presentations on Stripe’s tech stack and architecture. We were supposed to meet Patrick, the CEO, but he was sick that day, which was a bit disappointing.
Social events were organized for interns, like go-kart racing, which turned out to be a lot more intense than I expected. The karts were fast, the track was small with lots of turns, and I ended up getting rear-ended a few times, leaving me sore. The other interns had pretty impressive backgrounds. I thought that having done two prior internships was a lot, but the Canadian students from schools like Waterloo were coming in with four or five previous internships. After Stripe, they had plans for summer internships at prestigious companies like Citadel, Meta, Bridgewater, and Apple. But as one of my Stripe teammates, Javin, says (and I later found out Mark Twain originally said), 'Comparison is the thief of joy.' So, I tried to ignore the comparison game and focus more on work and conversations about activities and hobbies outside of work.
During onboarding, I got a solid understanding of Stripe’s core business and operations. Presentations covered the full range of Stripe’s products, how they fight fraud, and the company's plans for the future. As an intern and software engineer, I was especially impressed by Stripe’s tech stack and architecture. One thing that stood out to me was Stripe’s commitment to reliability. They aim for over five and a half nines (99.9995%) uptime, meaning a maximum of just 13 seconds of downtime per month. This level of reliability is key for Stripe, especially during high-traffic periods like Black Friday and Cyber Monday. The company’s infrastructure is built to handle massive transaction volumes without sacrificing uptime.
After a busy week, I drove back across the Bay Bridge to Oakland to spend the weekend with my parents before heading to Seattle for the rest of the internship.
My flight back to Seattle got delayed a couple of hours, which caused some stress since I had a TaskRabbit appointment with someone to help me move from my old apartment to my new place. My old apartment had a termite infestation, so I needed to get out fast. The person I hired, Lance, was super accommodating and told me he’d just hang around Seattle until I arrived, since he lived out of town. I didn’t get to the apartment until around 8 PM, but Lance was a pro and helped me load everything into his truck and move to my new place. I gave him a good tip for going above and beyond.
My new place was the back half of someone’s home they’d closed off. My bed was in a loft that could only be accessed by a ladder, which made nighttime bathroom trips a bit tricky. But I had my own private bathroom and shower, a full-sized fridge, an oven, a microwave, a couch and TV (which I never used), and access to the family’s BBQ and washer/dryer. The house was in a quiet residential neighborhood, about a 45-minute walk to the office, or 30 minutes if I jogged down the hills. It had a nice backyard I overlooked, and there was a dog. Trader Joe’s was a 15-minute walk away, and QFC, the best grocery store in Seattle, was a 25-minute walk. The living situation was a huge upgrade from the cheap apartment I had while interning at Amazon the fall before—no more dealing with people fighting and yelling at 3 AM, though it did cost an additional $575 per month.
Besides the first week in Seattle when I went in earlier, my usual routine was pretty straightforward. I’d wake up around 8 AM, do an hour of exercise, and then dive into my morning meetings. These included weekly on-call meetings, team-wide check-ins, and smaller team syncs from about 10:00 AM to 11:30 AM. Once those wrapped up, I’d shower, then walk and jog (the flats and downhills) to get to the office by 12 PM for lunch with the team. After lunch, I’d have my daily check-in with my manager, Garvin, and then work in the office until around 4 or 5 PM. I’d walk home, usually taking a slightly longer route to listen to podcasts and have some thinking time. In the evenings, I’d work for another 3-4 hours from 6 PM to 9 or 10 PM.
On the weekends, I definitely took advantage of sleeping in—especially with how cold Seattle was in January and February. I’d usually wake up around noon and then spend the rest of the afternoon, or until it got dark, outside. I’d walk around the city, hit up parks and the waterfront, go for runs, or grab groceries.
I also got a chance to join a few fun social activities. The first was mini-golfing at Flatstick Pub. It was indoor mini-golf, and while the course wasn’t super unique, it was a good time. The next activity was definitely my favorite—going to a Seattle Kraken hockey game. I’d never seen a hockey game live before, and the energy in the arena was contagious. The Kraken are a new team (joined the NHL in 2021), and the Seattle fans are really passionate about them. You'll see their logo all across town. I was surprised by the fights that happen during the game—apparently, fist fights and wrestling are a normal part of the sport. The Kraken lost to the Toronto Maple Leafs, but the whole experience was definitely memorable.
The last activity was a chopstick forging event. The other interns and I went to this place where we forged our own chopsticks from iron rods. We stuck the rods in furnaces, hammered them on anvils, and kept repeating the process until we shaped them into pointy chopsticks. It was a pretty unique and fun experience. Unfortunately, when I flew home at the end of my internship, TSA took them away because they were too sharp. At least I still have the photos to remember it!
When I joined, my team had 10 people, one manager, and by the time I left, it had grown to 14 people, with two new managers reporting to the original one. Despite the team expansion, my work and projects were not significantly impacted by these changes. The team name was "Service Orchestration," and we were under the "Core Compute" department. Our main responsibility was managing Kubernetes clusters, which are the environments where the company's services run.
Kubernetes is an open-source system (launched by Google in 2014) that automates the deployment, scaling, and management of application containers across clusters of machines. Containers are isolated units that package an application’s code, libraries, and settings to ensure consistent performance across different environments. While that’s a simplified version, there is a lot more to Kubernetes that I still want to learn.
My team handled the Kubernetes clusters, ensuring the smooth operation of all the services that ran on them. This involved working with a variety of tools and writing code to manage, monitor, and optimize the clusters, ensuring the systems were stable, scalable, and cost-efficient.
In terms of team structure, my internship manager and mentor, Garvin, was the key point of contact, and we synced daily. I bounced ideas off of him, he gave me feedback, reviewed all my PRs, and helped me get up to speed with understanding anything I needed to. Then there was Justin, my team’s manager, who came in a few weeks after I joined. He was a very approachable and down-to-earth person who took on the more traditional management responsibilities but worked hard to maintain a hands-on, accessible presence. Above him was Uttara, who initially managed my team and played a vital role, alongside Justin, in translating the company's larger goals into more tangible projects within our team.
One of the key projects my team was working on during my time there was bin packing in Kubernetes. This is a classic computer science problem, where the goal is to fit objects of varying sizes into a limited number of containers as efficiently as possible. In Kubernetes, this translates to fitting as many pods (groups of containers) onto nodes (machines that run the containers) as efficiently as possible, optimizing resource usage like CPU and memory and minimizing wasted space.
When I joined, Garvin explained that Stripe had been running one pod per node for security reasons. However, this approach was quite costly. By relaxing the security policies and allowing multiple pods to run on a single node, the company was able to save millions. The potential savings were immediately clear to me, and it was eye-opening to see the significant impact that engineering decisions could have on both the technical and financial sides of the company. It was my first time witnessing how software engineers could provide value far beyond their paychecks.
Garvin also explained that the team was adopting an open-source project called Karpenter, launched by AWS. Karpenter is a horizontal node autoscaler that dynamically creates and deletes nodes based on workload demands. It can consolidate underutilized nodes (horizontal autoscaling) or replace nodes with others of different sizes (simulating vertical autoscaling) by automatically evicting and recreating pods.
At the time, the team had limited visibility into Karpenter's behavior. To inspect Karpenter-related Kubernetes objects (CRDs like NodePools, EC2NodeClasses, and NodeClaims), engineers had to manually run kubectl commands across multiple terminals. This was tedious and disrupted day-to-day work. Stripe already had internal tooling to view nodes, pods, and containers, so integrating Karpenter into this system was the goal of my first project.
The main tasks included:
Creating Kubernetes informers to watch Karpenter CRD create, read, update, and delete (CRUD) operations.
Saving Karpenter CRD states to a PostgreSQL database.
Building a gRPC API endpoint to fetch these resources efficiently.
Updating the frontend UI to display Karpenter objects, pulling from the live control plane or database depending on resource status.
Informers are hooks into Kubernetes that notify custom code when specified events occur. I implemented new informers for Karpenter CRDs, built the associated database tables (storing CRD objects as JSON blobs), and indexed by node name (among other fields) for fast querying.
The backend API was created using gRPC and protocol buffers, mainly focusing on fetching resources associated with a given node.
The informer and basic API work took about two weeks. Updating and extending the UI took a bit longer because I scoped additional functionality based on teammate feedback.
Some of the major extensions I built included:
Fetching terminated nodes and pods: Previously, the UI could only show active nodes and pods by querying the control plane. I extended the backend to query the PostgreSQL database for historical resources. I had to build efficient SQL queries over large datasets because the database saved every object state. To make queries fast, I added multi-column indexes on pod name, namespace, and cluster, and optimized the queries to find the latest version quickly. In one instance, I brought down a query latency from over five minutes (unusable) to under one second.
Real-time resource usage: I integrated Prometheus metrics to calculate CPU and memory usage for pods, nodes, and clusters. This let users view actual vs. available resource utilization live.
One interesting thing I learned during this project was that CPU allocation in Kubernetes is more about scheduled runtime than dedicated hardware (millicores as time slices, not exclusive core access). My manager showed me a great YouTube video that helped me think of Kubernetes CPU measurements as time units.
Other UI improvements: I added better filtering, sorting, event viewing, and integrations with Karpenter-emitted control plane events.
Another problem I tackled was a bug where the system dropped informer events. When a delete event and an update event happened at around the same time for the same resource, the informer would remove the resource’s state from its cache before the system processed the update event. Finding the issue was one challenge, fixing it was another.
I created a custom cache instead of relying on the informer cache. I made deep copies of each object and its state on every CRUD event and saved them to my own cache. Then, when processing the events, I pulled from my cache and deleted the entry afterward.
At first, this fix caused a memory leak. I used Prometheus to monitor memory usage and saw it increasing without bounds. I found several subtle causes of the leak, eventually fixed all of them, and fully resolved both the dropped events bug and the memory leak.
I used stacked pull requests for the first time. Instead of branching each change off of main, I created PRs off previous work, enabling continuous progress without waiting for every individual review and merge.
Over the first 8 weeks of my internship, I averaged multiple PRs per day. I shipped the full integration by the midpoint of the internship, received positive feedback from leadership, and made the Karpenter adoption experience much smoother for my team.
I started thinking about my second project around week three or four of my internship. During one of our weekly team meetings, my coworkers Javin and James presented a side project they had built in their own time: an LLM-based diagnostic agent that used Splunk logs — mainly container logs and Kubernetes events — to automatically identify common issues with Kubernetes deployments. These issues included crashlooping containers, node and pod terminations from Karpenter, cronjob failures, and networking misconfigurations.
Before this tool, users who ran services on our Kubernetes infrastructure had two options when something broke: debug it themselves or ask for help in our team’s Slack channel. When our engineers responded, they had to manually dig through logs to figure out the issue. Sometimes the root cause was simple, like missing container permissions. Other times it was more complex, like pods terminating due to Karpenter-driven resource consolidation. Either way, it cost a lot of engineering time to diagnose issues manually.
We believed the troubleshooting process could be automated with an LLM, as long as the system had a few key pieces: access to relevant logs, background knowledge of how our services worked, templates for diagnosing common issues, and the ability to recommend specific solutions after analyzing logs. Javin and James built the prototype, and I took it the rest of the way to make it production-ready.
The project goals were:
Reduce time-to-resolution for common issues
Provide immediate automated diagnostics for frequently encountered Kubernetes service issues
Eliminate wait times for initial troubleshooting support
Decrease engineering support burden
Reduce the volume of routine Slack messages directed to the engineering team
Free up engineering time for more complex issues and platform improvements
Improve user self-service capabilities
Enable users to diagnose and resolve common issues without engineer intervention
Provide educational context to help users better understand the platform
Promote best practices through solution recommendations
Build a knowledge repository of solutions
Capture and codify institutional knowledge about issue resolutions
Create a growing database of diagnostic patterns and solutions
Enable continuous improvement of the diagnostic engine
Further develop observability and metrics
Track common failure patterns across the platform
Identify recurring issues that may require systemic fixes
Provide data-driven insights for platform improvements
I used Aurora PostgreSQL to save diagnostic outputs, which allowed the system to return cached responses and present diagnostic reports on the frontend. I pulled container logs and Kubernetes event logs through a Splunk API to provide the raw inputs for the diagnostic agent.
I integrated with Braintrust, a tool that records LLM prompt inputs and outputs, making it easier to review and evaluate model responses. I also used Stripe’s internal LLM API, which supported features like function calling. Function calling let me define parameterized functions the LLM could invoke during a prompt, such as querying Stripe’s internal home search API for background information. I also used it to force the LLM to return structured responses, like calling a "done" function with a TL;DR summary and a list of sources.
Lastly, I used Temporal’s workflow engine to run diagnostics in a series of stages. If something failed during the diagnostic process — like the Splunk API momentarily going down — Temporal provided a structured way to retry operations and handle failures cleanly.
Each diagnostic started with a StartDiagnosis request, which included basic information like pod name, namespace, node name, container name, and options like the lookback period for logs.
First, I checked if the system had already diagnosed the exact issue. If it had, I returned the cached result from the database unless a fresh diagnosis was explicitly requested. If it was a new issue, I queued it for processing. Once a worker became available, the workflow moved through these main stages:
Deployment Hydration: Find all failing pods related to the deployment ID and set up contexts for each pod-container-node group with issues.
Context Enrichment: Pull additional details from lifecycle tracker logs in Splunk, such as service name, availability tier, ownership, and configuration info.
Log Collection: Gather relevant logs from Splunk, including Kubernetes events, controller logs, OOM events, and the application’s container logs.
Deployment Analysis: Examine the deployment in Stripe’s internal system to determine which part of the deployment had failed.
Infrastructure Insights:
Generate manual insights about common problems (like OOM events or node terminations) by identifying patterns in the logs using regex.
Create a timeline of events from the logs.
Feed this timeline to an LLM, primed with background information about the team, terminology, and a list of common problems and solutions.
The LLM analyzed what happened and suggested fixes.
Service Log Analysis: Focus on the application’s container logs. The LLM looked for crash reports, panics, and error patterns, and then suggested causes and resolutions.
Overall Summary: Consolidate everything. The LLM merged related findings across stages and created a prioritized list of actionable recommendations with concrete next steps.
For the frontend, I focused on creating the best possible user experience. I structured the diagnostic workflow to save progress updates to our database as it moved through each stage. This let me implement a polling mechanism on the frontend that checked for updates every 5 seconds, showing users real-time progress during the typical 2-5 minute diagnostic run.
I added "debug" buttons directly on deployment pages where users would see their failed deployments. One click would immediately launch a diagnostic and redirect them to the findings page, which updated in real-time as results came in.
Since I wanted to test thoroughly before a full release, I implemented feature flags for the first time. These are basically boolean variables you can toggle through a separate system without changing code. This let us control who could see and use the diagnostic agent while I was developing it, initially restricting access to just my teammates.
I also added analytics tracking to measure engagement - how many people saw the debug buttons, how many clicked them, and how users interacted with features like the requery button that would refresh an already cached diagnostic.
For feedback collection, I placed prominent positive/neutral/negative buttons at the top of each diagnostic report, along with an option to leave comments. I quickly set up a Postgres table to store this feedback, created an API endpoint to receive it, and wrote code to send our team Slack notifications whenever new feedback arrived.
The results were promising. The agent excelled at identifying issues we'd explicitly defined in our prompts, like crashlooping containers or pods terminated due to Karpenter disruptions. For undefined scenarios, it was hit-or-miss - sometimes it searched our internal docs well and found surprisingly accurate solutions, other times it produced vague, unhelpful AI-slop.
I got to demo the project to a senior leader one afternoon, and he was genuinely excited about it. He actually suggested adding the feedback collection mechanism, which I implemented. In my final week, I shipped the agent and started a beta rollout, giving access to several engineering teams to test the system with their real-world Kubernetes issues.
During this internship, I grew both technically and professionally.
On the technical side, I got more familiar with Kubernetes. As my teammates often said, the more you learn about Kubernetes, the more you realize how much more you need to learn. I learned Kubernetes basics like nodes, pods, containers, and resources like CPU and memory. I worked with Kubernetes informers and, in my final week, helped my manager, Garvin, test new Kubernetes scheduler settings. I also gained experience with Karpenter and the ideas behind autoscaling and bin packing.
I used Splunk to view application logs and debug issues, and I used Prometheus and Grafana to monitor metrics like memory usage to catch memory leaks. I improved my skills working with PostgreSQL, including writing more efficient queries and developing API endpoints, although I had already built a good foundation for this during my previous internship at Amazon.
The LLM diagnostic agent project was new territory for me. I learned how to architect an LLM system using a multi-step approach: first gathering all the context and inputs (like container logs and Kubernetes events), then designing multiple prompts to process the inputs step-by-step — filtering logs, listing issues, consolidating them into top issues, and finally summarizing the findings. I got hands-on experience with LLM concepts like function calls and with tooling like Braintrust.
Lastly, I built more frontend features and thought more about user interactions, trying to reduce friction as much as possible — similar to Amazon’s one-click philosophy.
On the professional side, I learned how to manage my own projects. I kept a project milestone tracker up to date with weekly summaries of what I accomplished, what milestones I hit, and what goals I set for the next week. I got more experience with sprint planning and estimating what I could realistically complete each week. I improved my communication during weekly team meetings and daily syncs with my manager and mentor, Garvin.
I also learned the importance of being social with teammates. Spending time together at lunch or getting to know each other outside of work makes a big difference in creating a positive work environment.
Stripe is a solid company. Good culture, good people, good work. Every Friday, Patrick, John, or another senior leader hosted a Fireside chat with a Stripe customer to talk about how they used Stripe and ways Stripe could grow or improve. I liked that the founders stayed so closely involved.
Everyone I spoke with believed the company still had a lot of room to grow. During orientation, I learned that Stripe already processes over 1% of the world’s GDP — $1.4 trillion in 2024 — meaning about 1 in every 100 dollars moves through Stripe. Their technology powers most payments you make without realizing it — underneath Shopify, Uber, Lyft, Amazon, Instacart, and many more.
The people I worked with had left companies like Google, Amazon, Microsoft, and Oracle because they saw more upside at Stripe. Company morale was high. During my internship, I tried to take advantage of the opportunity as much as possible, working hard and long days to finish my first project early so I could start a second one. I gave my best effort and got strong support from my teammates.
To celebrate the end of my internship, my team took me out for sushi. I hadn’t had sushi in ages, so I really enjoyed it.