My 12-week Experience Interning at Pinterest
August, 2024
Intro
Social media often gets a bad rap for fostering addiction and mindless scrolling. But during my 12-week software development internship at Pinterest, I saw a different side of the industry. Pinterest's mission is to bring everyone the inspiration to create a life they love. Their main metrics aren't how many hours users spend on the app, but rather how often they check in (weekly active users WAU and monthly active users MAU). They also don't allow political ads or posts as those go against their mission, and this has cost them at times, especially now in the 2024 election cycle as other social media behemoths are raking in money from political ad spending. My role at Pinterest was a software development intern, focusing on backend data processing infrastructure.
The Application Process
As a CS student looking for my first internship, I spent the summer and fall applying to hundreds of companies. My resume showcased projects and a teaching assistant role from Berkeley. About 10% of applications led to online assessments, including coding challenges, behavioral questions, and occasionally IQ tests. Only 3% resulted in phone interviews.
I don't recall exactly when I applied to Pinterest, but they sent a coding assessment. Fortunately, I had completed a similar one for another company and could reuse my results. Shortly after submitting, around 11 PM on a Friday, I received an invitation for a phone screening.
During the call with recruiter Shelley Hernandez, we discussed my interest in computer science and Pinterest. When asked about my preference for fullstack, backend, or frontend development, I chose backend. Shelley matched me with a potential data team in Palo Alto, conveniently close to my home in Oakland.
The following week, I prepared using resources Shelley provided, including a helpful mock interview recording by Pinterest employees. During the technical interview, I aimed to replicate the mock interviewee's approach:
Presenting the brute force solution
Offering an optimized solution
Asking clarifying questions about the problem
The interview went well, and Shelley called the next week with an offer. Although I had another option, I felt Pinterest's was superior and accepted in October/November 2023.
Pre-Internship and Onboarding
In February, Pinterest sent a box of company swag and began including us in internal emails about company developments, successes, and strategic focus. They also organized about ten video calls featuring Pinterest employees presenting on diverse topics, from Coachella marketing efforts to Pinclusion groups and technical deep dives into machine learning. These events helped familiarize us with the company culture and operations before our start date.
In late April, I met my team for the first time. Our focus is building and supporting cloud-based data processing infrastructure (using Kubernetes) that enables data scientists and machine learning engineers to work with terabytes of data (via Apache Spark applications). When my manager and mentor asked about my experience with Kubernetes and Apache Spark, I admitted I had none. They then asked about my preferred work depth - frontend, mixed, or all backend. I mentioned that I wanted to focus on full backend development, and they were willing to accommodate this preference even though I didn't have direct experience with Kubernetes or Spark.
The internship began in May with a two-week onboarding period. This was my least favorite part of the experience. Days were filled with meetings, many of which were general sessions for all new employees and not directly relevant to my role. While I appreciated gaining a broader understanding of the organization, there was some repetition in the presentations.
Also, having just completed an intense semester with four technical classes and 10-hour study days, I found myself restless and twiddling my thumbs during the downtime between onboarding sessions. I was hungry to start actual work and found the idle periods frustrating. As someone who likes to work, this initial phase tested my patience.
Projects and Responsibilities
My first major task was upgrading open-source software used by my team. We had forked the software in 2022 and needed to update it to the latest version. Subsequently, I worked on my more major projects which were optimizing cloud code execution times and developing a tool to more accurately track resource usage (CPU, memory, etc.) in our cloud environment.
Learning and Growth
Overall, I learned to take ownership. I learned to take full responsibility for my work, encompassing code quality, testing, and design decisions.
Code Quality: Updating the team's software was not straightforward. I had to reconcile conflicts between our customized version and the latest open-source release while maintaining a clear code history. My mentor's feedback taught me to continually refine and improve code quality. I learned that when working with open-source code, it's important to make minimal changes to ensure future upgrades are seamless. For instance, although there were pieces of code that our team wasn't using, my mentor advised against deleting them. Additionally, I had to disable my code editor's automatic formatting feature to keep our forked version as close to the upstream open-source version as possible.
Holistic Thinking: Beyond code changes, I learned to consider the broader implications of version upgrades. I analyzed new features, tested potential benefits for our team, and presented findings.
Deployment Process: I developed a comprehensive rollout strategy:
Implementing changes in the development environment
Conducting thorough testing
Establishing reversion protocols
Deploying to a mock-production environment
Final production rollout
Project Design and Stakeholder Management: For the resource tracking project, I was given autonomy in design and development. I met with stakeholders to discuss designs and gather feedback. When users expressed a preference for maintaining current tracking methods, I adapted my design accordingly. Later, when my manager suggested a different approach, I learned to be flexible and pivot as needed.
Company Culture
As this was my first internship, I lack a basis for comparison. However, Pinterest's culture appeared positive and supportive. I never felt pressured to overwork, and my colleagues were consistently helpful and approachable. The general atmosphere suggested that employees were content with their work environment and fulfilled by what they were doing.
A Day in the Life of a Pintern
Pinterest offered a flexible work environment, with offices worldwide and the option to work remotely. Given my team's mostly remote setup and the long commute to offices (2 hours to Palo Alto, 1.5 hours to San Francisco), I chose to work from home.
My typical day started later than I initially expected. I'd use the morning for a quick bike ride, then begin work around 10 or 11 AM. Slack was our primary communication tool. For the first couple of weeks, I had daily video calls with my mentor, but we eventually transitioned to Slack messages. Each day, I'd update my team on my progress and plans for the next day, which kept everyone informed and helped me stay on track.
Pinterest made efforts to maintain social connections despite the remote setting. They organized various events, including a fun DIY resin coaster activity where they mailed us kits to make at home. It was a nice creative break from coding.
The highlight of my internship was a day at the Palo Alto office. We had lunch with Jeremy King, the CTO, which was an incredible opportunity. There were only about 10 interns present, creating an intimate setting for discussion. I asked him about developments in sales and marketing, and he explained Pinterest's shift from CPM to CPC advertising models. Interestingly, I learned that the CTO is also an early-morning cyclist like me, which added a personal touch to our interaction.
After lunch, we participated in a virtual escape room. While I had hoped for an in-person activity, it still provided a good team-building experience. We also had the chance to meet other executives like Sabrina Ellis (Chief Product Officer) and David Chaiken (Chief Architect). It was nice to see the human side of these executives and hear their takes on where Pinterest is heading.
Looking back, even though a lot was virtual, it was a pretty unique experience. Not many internships let you pick the brain of C-suite execs while making resin coasters in your pajamas.
Makeathon
Towards the end of July, Pinterest held this big internal hackathon they call a Makeathon. It's a 3-day event where everyone in the company can work on whatever they want, as long as it somehow benefits Pinterest.
I teamed up with two other interns to work on a code-sharing tool. We were trying to make it easier for people to share code with each other. We didn't win the whole thing, but we did make it to grand finalist status. Considering only about a quarter of the hundreds of submissions got that far, we were pretty proud.
It was a great way to wrap up the internship. We had the opportunity to work with different technologies than those used in our projects and see how our ideas compared to those from the rest of the company. Plus, it was enjoyable to flex our creativity muscles and build something with other interns.
Concluding Thoughts and Acknowledgements
I'm really grateful for the experience. Given how competitive the Pinterest internship program was, I feel lucky to have been selected. I appreciate that my manager let me dive into backend development and learn Kubernetes and Apache Spark on the job. It showed me that Pinterest invests in its employees, even if it meant my contribution rate might have been lower while I was getting up to speed.
The company culture was fantastic—everyone was friendly and supportive, which made the whole experience enjoyable.
Here’s what I’m taking away:
Navigating a Big Tech Company: I learned how to build relationships with managers and colleagues and figured out how to find the right experts when I needed help.
Taking Ownership: I gained a lot from learning to take ownership of my projects rather than just following orders. My manager encouraged me to lead my own work, which taught me valuable lessons about responsibility and initiative, similar to leadership but focused on managing my own tasks rather than leading a team.
The technical skills I picked up—like debugging Kubernetes issues, working with Kafka, and managing sidecar containers—were not covered in my coursework, making them especially valuable.
I want to thank my recruiter, Shelley Hernandez, my manager, Ang Zhang, and my mentors, Rainie Li and Hengzhe Guo. I also appreciate the help from Ashim Shrestha, our site reliability engineer, and William Tom, a fellow Cal grad and team member.
In-Depth Technical Project
I purposely kept this note light on the technical aspects of my work until now. Let's delve deeper. First, an overview of the landscape:
Pinterest's big data and ML teams process petabytes of data. This scale demands distributed computing. Engineers write code to split massive datasets, process chunks across multiple computers, and aggregate results. This follows Apache Spark's driver-executor model, where the driver distributes tasks and data to executors.
All this runs "in the cloud" - specifically, in Kubernetes clusters on AWS EKS. Kubernetes doesn't natively support Apache Spark, but it's extendable via "operators." The Apache Spark Operator enables Spark applications to run in Kubernetes, with drivers and executors in separate pods.
The Spark Operator watches for Custom Resource Definitions (CRDs) - Spark applications submitted by Pinterest software. These CRDs include Spark configurations. The operator uses these to run the spark-submit command and create the driver pod, which then spawns executor pods.
We use Yunikorn instead of the default Kubernetes scheduler, as it's better suited for Spark applications and batch data workflows. Yunikorn excels at gang scheduling, ensuring resources are available for both driver and executor pods before scheduling. For example, the default scheduler might schedule driver pods without considering the resources needed for their executor pods. This can lead to a situation where the cluster runs out of resources before all executor pods are created, causing a slowdown or stalemate until running jobs complete. Yunikorn, understanding how Spark applications work, won't schedule a driver pod until there are enough resources for both driver and executor pods to complete their job.
Now, let's dive into the projects I worked on:
Upgrading Spark Operator
Removing Mutating Admission Webhooks
Improving Resource Tracking
Upgrading Spark Operator:
I synced Pinterest's fork with the open-source repo, last updated in 2022. This wasn't a simple upgrade. I had to perfect the Git rebase, updating Go versions in Pinterest commits while keeping them intact. Resolving conflicts between Pinterest commits and the open-source version was a crash course in mastering Git CLI.
Once I cleaned up the Git history, I conducted extensive testing and planned a thorough rollout. I noted all new Spark Operator features added since the last sync, presented them to my manager and mentor, and we agreed on which ones to test and adopt.
We focused on two key features: a leader election mechanism for seamless operator transitions during failures, and the ability to watch multiple namespaces. I tested these extensively in the dev environment before rolling out to our adhoc environment, which runs near-production level Spark jobs daily.
I monitored the operator's performance, ensuring no jobs failed due to the upgrade. After a few weeks, we rolled it out to production. I also spent considerable time figuring out how to safely revert between Spark Operator versions if needed, despite some breaking changes. This was tricky, so I reached out to the open-source community for advice. The commit author of the breaking change responded with a plan, which, while manual and tedious, provided a viable fallback option. I tested this reversion process and wrote a script to automate it, sharing it with my team. This experience of collaborating with the open-source community was new and beneficial for me.
Removing Mutating Admission Webhooks:
Next, I tackled the mutating admission webhooks. We had a Kubernetes sidecar container connected to Spark Operator that modified driver and executor pods via webhooks. This added necessary configurations but slowed down the process. These configurations included access to the filesystem in their container, privileges to read and write there (volumes and volume mounts), parent pod information (owner references), pod scheduling priority (priority class), and node selection criteria (node selector). While essential, the process of communicating with the webhook sidecar and applying these configurations by writing to the pods introduced unnecessary delays.
To streamline this, I needed to apply these configurations when running the spark-submit command. While Apache Spark doesn't accept these configuration fields, Pinterest has its own Spark code. I customized it to accept these new configurations and apply them to the driver and executor pods.
The code wasn't too complex, but it required digging through various codebases and studying similar implementations. The real challenge was testing. For each configuration previously handled by webhooks, I had to run Spark applications, examine the driver and executor YAML files, and verify that the configuration info appeared correctly.
There were bugs, and I had to fix my code multiple times. Once things were smooth in the dev environment, I moved to adhoc testing. This was trickier because we had jobs running with custom Spark code. I had to replicate these jobs using my Spark code as the base image. After confirming all adhoc jobs ran well, we rolled out to production.
Improving Resource Tracking:
The final project involved improving resource tracking for our internal billing team. Previously, we used Yunikorn for this, but it had limitations. It stored data in-memory only, meaning crashes led to inaccurate estimates. We accessed this data through Yunikorn's logs, but this approach was being deprecated.
I needed to write code to use Yunikorn's event streaming API for resource tracking. This presented two major design decisions. First, where would the code live? If outside the cluster, I considered options like NodePort (easy to connect but lacks security), LoadBalancer (provides a single IP address with some security rules, but more complex setup), or Ingress (acts as a smart router with advanced features). Each had drawbacks in terms of security, complexity, or scalability.
After careful consideration, I decided to create a sidecar container that would run alongside the Yunikorn scheduler. This allowed native access to the localhost API endpoint without security complications and scaled well with multiple Yunikorn scheduler instances. It avoided the need to synchronize across multiple external instances and simplified the overall architecture.
The next major design decision involved my database setup. My initial design involved streaming events into a Kafka topic, tracking resources in the sidecar container, and uploading snapshots to S3 along with the latest Kafka topic index. This would allow for recovery and re-updating of resource usage if things crashed. However, my manager opted for a simpler approach: save all Yunikorn events to a Kafka topic persisted to S3, then write a processing script to calculate resource usage from the events.
While this approach meant the script had to loop through all events (as they weren't stored in S3 in an easily indexable way), it performed remarkably well. In instances where Yunikorn didn't go down, my script matched Yunikorn's calculations exactly. When Yunikorn did go down, my script calculated usage that was often significantly higher than Yunikorn's estimates - in some cases up to 1,100% greater. This massive discrepancy isn't an error, but actually represents more accurate tracking. When pods complete while Yunikorn is down, it never knows they completed and doesn't emit an event for them, completely omitting their resource usage from its summary. My script, however, could account for these "ghost" pods. While we couldn't know exactly when these pods finished, we knew they started (from Yunikorn's initial event) and that they finished sometime between Yunikorn going down and restarting. This allowed me to calculate lower and upper bounds for their resource consumption, providing a much more accurate picture of total resource usage.
In conclusion, the technical aspect of my internship was incredibly rewarding. I came in knowing nothing about Apache Spark, Kubernetes, or Yunikorn, and I'm proud of how much I learned in such a short time. I owe my success to my mentors and team members who patiently answered my questions, met with me over Google Meet, and walked me through tough situations. I learned an immense amount from them and these projects, learning things never taught to me in school.