Humans of Computer Systems: Goku

Programming

How did you learn to program?

In college


Tell us about the most interesting/significant piece of code you wrote.

contributed to zfs, oracle file systems (writeback, journal, snapshot compaction)


Who did you learn most from about computer systems?

books, internet, architects


Who is the greatest programmer you met, and what is impressive about them?

Brian bendelorf (ZFS on Linux) maintainer, Matt Ahrens


What is the best code you have seen?

zfs, tintri file systems, yugabytedb, linux kernel


What do you believe are the most important skills to be successful in your field?

understanding the fundamentals, not giving up, digging deep


What quality or ability do you value most in a computer systems person?

  1. Ability to debug code which others have written, mainly to the point that the RCA (root cause analysis) is correct and not superficial
  2. Ability to write clean decent design code at speed
  3. Simplicity in the complete process of systems software development
  4. Adding meaningful debugging stuff to RCA field issues


Personal

Which of your work/code/accomplishments are you most proud of?

  1. zfs snapshot automounting
  2. Tintri file system (fixing customer issues)
  3. RCAing and fixing the most complex issues, most of which were more than a decade old too


What comes to you easy that others find hard? What are your superpowers?

  1. Debugging complex systems code
  2. Generalising, simplifying systems concepts
  3. Clarity of what needs to be done
  4. Coming up with new ideas
  5. Not giving up


What was a blessing in disguise for you? What seemed like a failure at the time but led to something better later for you?

Initially I struggled a lot with debugging complex issues. I never use to RCA issues completely correctly. More like yes this might be the issue but never was really sure about it.

But with never giving up and always curious about trying to know what happened. It all fell in place well. These days I am able to RCA many issues very quickly, sometimes complex ones in a day also. Which takes other months to RCA.

So, now I feel all that initial trouble was just worth it and all those failures have improved me a lot.


What do you feel most grateful for?

Being in the line (systems) which I am very passionate about. Basically earning my bread from what I like to do.


What does your perfect day look like?

Work on the feature (design discussions, code reviewers, write code, RCA issues, meetings)

Work on customer issues (sometimes)


What made you most happy in the last year?

RCA of critical issues (fsck, linux inode corruption)

Working on file system feature(can't mention much)

Joining distributed systems reading group

Being on twitter reading blogs, tweets of folks in my line


Work

What was your biggest mess up? What was the aftermath?

Very initial in the career, was assigned some work in zfs, for that I took an extra hold and the code was checked in. Later we started to see commands being hung, unmount being hung, only to RCA it, it took 2 months and it was a regression of my change and my companies CTO had RCAed it. 

That day I learned you cannot write code based on assumptions. Though having said that, production grade systems code are so complex, that even if we want to make no assumptions, we still end of making some only to later figure out that we have introduced a regression. The effort in this direction would be to think through a lot and rethink again for revalidating the code so that most of the issues are caught during writing code and self review and test the code using all the available testing infra.


What was your most interesting/surprising or disappointing interaction at work?

Surprising : Most of the times I pick up tough work and I always think, will I be able to pull it of. But I always do, though it always is a rough ride, but the satisfaction of getting it done is immense.

Interesting : I was working for a startup and writing server storage code, which was needed to integrate with high availability (ha) library. I wrote the integration in a day and it was not working. So, I started root causing the issues, only to notice that all the issues were of halib and asked the halib developer what to do. He said you are welcome in fixing them. So I fixed all the issues in that day itself and the integration was successful and it was working smoothly. What I learnt from this was, halib developer had a big heart, in acknowledging the issues and had an even bigger heart to allow me to fix all those issues. So essentially software development is all about team work. If you have great team members, doing any work is always a pleasure

Disappointing : Management saying focus on quality in all all-hands and when though being a developer, I came up with the innovative idea of incorporating novel testing ideas and found many corruption issues was completely ignored. 


What do you like most about your job/profession?

Systems software development.

Reading research papers.

Interacting with other technical folks.

Attending conferences, meetups


What do you dislike most about your job/profession?

Politics affecting growth of good engineers.

Management saying something is important but that actually was just for saying purpose, they actually never meant it.


What would be the single change that would improve your work environment most?

Management being more objective in growth of engineers.


Technical

What do you think are the hardest questions in your field?

Production grade stable code. All other stuff are also very challenging, but stability is something which even after years of development is very hard to achieve.


What are you most disappointed about the state-of-the-art in your field?

Novelty of ideas (mostly though being called different, are internally the same stuff)


What is your favorite computer systems paper? Why?

Not one but many in the areas of,

corruption handling, recovery

consensus (though not read many)

file systems


Story

I was working on a file system for a customer for 3 months, just got the knowledge transfer. There was a customer issue and I was told to look into. All the file system engineers had left the company (since company was bankrupt and another company bought it)

When I spoke with the customer he was very angry since support was not able to bring the box up and there was 1 month of downtime and this had escalated a lot.

I being new to this file system internally was not sure what is going to happen. 

But I assured him, that give me 2 days and I will come up with the solution. He was like we have wasted enough time already. Somehow I pacified him and he was ok with 2 days.

I started going through the issue bundle to figure out what it was, only to my surprise this is some issue hit in clones promotion code path, which I am not even familiar. Digging more through it, I got to know that no way I am going to RCA and fix this issue in 2 days.

Then I started looking at, even without fixing the issue can we bring customer machine back online. So luckily there was a tunable to disable it.

Thats when I spoke with the support team to arrange a call with the customer. Accessed the box changed the tunable to disable the feature and voila the box was up :)

So I thought that the customer would be happy now, but he was not :) He said this all is ok, but this is not the fix, whats the fix and when can I get it.

That time I had to tell him that, the issue is being actively worked on and till now since 6-7 years of production deployments we have never seen this issue, so this issue is as new to us as it is to you. So, we will RCA and fix the issue asap in the upcoming releases and we will not just throw it away. This reasoning resonated well with him.

So learnings were,

- Be confident that you will solve the problem even if you know its your first time in this product. If we ourselves are not confident, customers then tend to panic

- If solving problem takes time, try to give what customer needs, in this case no downtime.

- After going through all the hardwork, try to tell the truth to customers, they do understand that you are humans too and will appreciate your work if you have really worked hard to solve their issue

I got an appreciation throughout the company for this issue.


If you enjoy reading this seriesconsider taking 10 minutes and submitting a response. All questions are optional. You can skip most, and tell a lot more on other questions you choose. 

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book