Meet CTW’s new VP of Engineering, Cao Zhe (Part 1)
Path to CTW
This is our series on the team working at CTW.
Today, we’re interviewing CTW’s new Vice President of Engineering. In a team full of pioneers, this man stands out! He was one of the first people to introduce cutting-edge concepts like DevOps and Site Reliability Engineering to the Japan market. With a career spanning Ricoh, Rakuten, and Indeed, he’s built some of the highest performing teams in the business!
Tell us a bit about yourself
Well, I'm from China, from the North West part, called NingXia. It’s about a 2–3-hour flight from Beijing. I came to Tokyo after graduating university, which means I’ve been in Japan for around 18 years!
My first job here was at Ricoh. They’re a Japanese company doing printers, digital cameras… things like that. I started my career as a Java developer there, doing embedded development. I was there for around four years, mainly doing application development. After a while, I began looking at middleware as a career option. But I quickly saw that my knowledge and skill set were not good enough to do the job. So I thought, “Okay, time to get some training,” and I resigned and went back to university.
I entered the University of Tokyo and did my master’s degree in programming languages. I was looking to work with compilers, interpreters, this kind of thing, and I wanted to learn the principles behind the various programming languages.
What I really wanted to do there was to study Java and learn about how Java works in depth. But, you know, the university didn't have a Java professor then. They had a lab focusing on the Ruby programming language instead and my professor was actually a contributor to Ruby. He was one of its pioneers. I joined his laboratory and studied the Ruby interpreter, which was a different language but the underlying principles were the same. It meant that I was getting all the knowledge and writing skills that I'd need for other languages.
Where did you take these new skills?
After graduating, I moved to Rakuten where I worked in DevOps for the next 4 years. Basically, I was managing infrastructure.
Rakuten had a huge number of machines to manage then. In those days, we didn't have Kubernetes. We didn't have containers, right? So most of them were physical machines with maybe a few VMs. I remember there were around 20,000 servers! So we needed to control them, check if they were alive, provision accounts across all the servers, and all these kinds of things.?
This was around 2011. At the time, companies were handling their ops in the traditional ways of the time. So it was around 2009, I think, that the DevOps concept was coined in Europe. I started at Rakuten already knowing a lot about server management and quickly saw that the way we managed things was not very efficient. And right then, I also discovered the concept of DevOps. It was a new idea and still quite new in the Japan market. We picked it up and launched the first DevOps team at Rakuten. We started automating our operations, introducing some frameworks and tools to handle all the servers, which is when I really started working on distributed systems.
After about a year of this, we successfully launched the system and the team, and I thought to myself, "Okay, it looks like the job is done. It's not perfect but it's working.” We had the platform. We had the base to build on. The team was there and they could continue working on top of that.?
By then, I was interested in doing something different, looking for some new challenges.?
How did you transition to Site Reliability Engineering?
That’s when I moved to Indeed, when they were still pretty small in Japan. I remember I was the third member in the infrastructure department. So I thought, "Okay, wow, this is a new company, a small team, and they have very different challenges here.”
Rakuten had a lot of servers but, mainly, they were in two places — two data centers.
But Indeed was different.
Indeed had a much smaller number of servers but everything was very well distributed globally. They had seven data centers around that time, and around 2000 servers. That was interesting to me because it was still distribution but it was kind of the next level of challenge. Right? So I moved there and started doing something quite similar to before but with a different set of problems.
What I did at Indeed was to start with their DevOps, doing infrastructure-related things. I quickly found that the way we did things wasn’t good enough for the market. I mean, what we were doing was kind of DevOps but we had more responsibilities beyond just infrastructure. We also needed to work on the application side.?
That was a new thing for me. At Rakuten, I didn't need to worry about applications, I just had to catalog and distribute the infrastructure, the server side.
Another "first challenge."
At Indeed, the developers didn't do on-call responses then. The leadership asked Ops to do the on-call, so we had to understand the application code. I started working on this because we needed to respond to incidents and we had to understand how the applications were built. That’s when we discovered that there was something systemically wrong but we needed a better understanding of the code to fix things.
We were just kind of fighting fires every day because the early version of the app had fundamental problems.?
The company was still small at that time, so it didn’t matter as much. But Indeed quickly began growing, scaling up in size significantly. So when they started having more and more users, more than more applications, we said, "Okay, this isn't working. So let's think about what we can do differently."?
Again, I was quite lucky in that this was around 2012. Google had just published a book on a new concept called Site Reliability Engineering (SRE). And when I read the book, I thought "Wow, that's exactly what we should do! Let's do that. Let's adopt that." So I sent it to my boss, saying, "Hey, you see how Google are going through something quite similar? Have a look at what they did." And my boss agreed immediately. We bought the book for every member of the team — and it's a pretty thick book — but we ate it up.
Then in 2012 or 2013, we started trying out SRE-related things. We launched a test team — called the SRE Team — and at the same time still had traditional Ops. One group was keeping the broken train going, so to speak, while the other was making a new one.?
The SRE team worked at it for a while until we were confident that it was working. Then, we began transitioning the rest of the team, which took around one year to fully transform into the SRE unit. We brought in new ways of thinking, new mechanisms, new processes, and a lot of new concepts into their skill set. We also hired new people and changed the talent requirements for the job.?
And, again — working together — we successfully adapted to fit the company’s new needs. Right now, I think Indeed has an excellent SRE infrastructure. Before I left, the total team size was around 80 SRE professionals. They manage the entire service for all users including job seekers and employers.
At this point, I was the head of SRE for Job Seeker which was the biggest business unit at the company. The people were in place. I’d built a senior team of around 30 reports, distributed them across four offices — Indeed JP, US, India, and Europe — and they were all doing good things.
So I thought, "Okay! My mission here is done. Let’s look for some new challenges!” That’s when a headhunter pinged me to say, "Hey, I have a new job for Woven Plant Holdings.” The project was initially called TRIAD (Toyota Research Institute of Advanced Development). He said, “So they're doing car automation, self-driving cars — building cool products. They have plans to launch in a few years and they’re thinking about how to make their systems reliable. So they need a head of SRE. Do you want to have a try?"
I jumped on it.
At Indeed, we had an existing department which we’d kind of transformed into an SRE function. For Woven, they were providing me with the opportunity to build everything from the ground up. And I thought that sounded awesome. Because, you know, even though we successfully transformed the team at Indeed to fulfill the SRE function, I still felt some things were missing. It wasn't the ideal set of skills that I would have wanted now that I was armed with more knowledge.
So what was the next engineering challenge?
The Woven job, it was a blank canvas. My responsibility was to launch the SRE team, right?
But at the same time, I was also asked to take care of the company’s IT operations. Usually that’s two separate positions. One person heads IT Operations and the other heads SRE. Some companies think the two are the same thing. So after joining, I had to explain the differences to the leadership.
Operations is more defense. You protect the system. You provide reliability and services to the employees internally.
领英推荐
But SRE is more focused on the product side, things that have an impact on the market. We provide services to our end users.?
One is more focused on the product, the other more on the internal IT infrastructure. So I suggested that we separate them and the leadership agreed.
The catch was that I’d head both.?
Again, here the IT operations side was completely new to me. I’d done a lot on the SRE side, but I'd never done anything IT related. And that's a lot of very different things tied together. To be honest, I kind of underestimated the challenge involved. At Indeed — Indeed is an internet company — so it's native to the digital space.?
That made me think that the IT department had a fairly easy job. You don't need to configure the mailbox. You use Google. There are so many SAAS to rely on that you don't even mention them. What else do you need to do? It's very simple, right? That was my bad assumption going into Woven, and then I kind of hit a wall. I realized once again, "Woah, this is totally different from what I thought!"
The structures I was expecting just weren’t there. And Woven had their own way of doing things because it wasn’t just the communication software at issue. They had financial tools, Oracle-related apps, HR requirements, a bunch of vendors, suppliers, purchase platforms...
?I had to relearn a lot of the basics.
So under the IT department, we had the engineering part, the operations part, SRE, architecture, and other teams. We all worked together but I mainly focused on the operations.?
Now, while technically they already had an operations team, they were mainly contractors. The company was new and didn't have any experts in place. So they thought, "Let's use contracts to fill the gaps, right?" When I joined, they had 30 contractors serving around 2000 employees.?
We needed to support our employees more productively and more efficiently. This was my first challenge.
Woven basically wanted to take the IT back from the contractors and the vendors. So I said, "Okay, we need to hire. We need to put full-time employees in place to handle these things."
But another challenge was that Woven had a limited understanding of what Ops should be doing. There were just repetitive tasks in place that people needed to do themselves. That wasn’t a fun environment for top devs. We couldn’t attract the talent we needed with such limited scope.
So the next thing I did was to expand the scope of Ops, from service management to operation management. We would begin delivering services to the employees while also moving away from manual systems to automated processes.
With that change, we were able to staff up with a very diverse and talented team.
Again, that went on for about a year. It wasn’t quite finished yet but the direction was set and we hired someone to head the IT Operations side. Before I left, I also launched the US and UK teams and put the leadership in place there to take care of things.
I handed over the operations side of things to my successor, launched the SRE team, started working on one product, and got it on the path to being production ready.
That’s when our CHRO at CTW contacted me and said, "Hey, we have an interesting new challenge here…"
Part 2 of the interview coming soon!
_________
For more information about our available positions, have a look below.?
We’re actively looking for talent in the following roles.
Senior SRE | Ex-Amazon | Coding Instructor
2 年Congratz to Cao Zhe as the VP of CTW!!! He is definitely one of the best leaderships I have ever seen. I'm sure that Cao Zhe could lead CTW to another further level.