I'm a former Googler who managed the deployment of their content distribution network. Outside of the codebase challenges that nostrademons articulated very well, Google has to deal with various big company concerns including shipping and logistics, human resources, legal issues in the many countries it operates in, hardware development, datacenter management, etc. To do this they have many talented people across disciplines. I was constantly looking for people to join our team while still maintaining the high level of ability and innate Googliness that Google looks for in an employee. In many measurements Google is still understaffed. Check out Royal Pingdoms profit per employee. http://royal.pingdom.com/2011/05/17/apple-staff-profit-per-h.... The difference between 2008 to 2010 is incredible and factors in growth in employees. Google still needs to hire... not for world domination but to maintain and improve the services that we love.
Each one of those 50-odd services could be a full-fledged company on its own with a few hundred or even a few thousand employees (in fact, many of them were). I would even bet that as separate companies, they would add up to a lot more than the estimated 20,000 Google employees.
That is actually a really good example. Look at the competition, Conduit (http://techcrunch.com/2008/01/16/toolbar-company-conduit-rai...), a company that literally has a few hundred employees (150-499 according to Glassdoor). I figure that Google has about the same number as the competition does for each service, so the Chrome team would roughly match Mozilla (~300 employees), the Docs team would match Zoho (~1500 employees), Maps would match MapQuest (~150 employees), etc.
Not to nitpick, but Zoho also has portfolio of several other products including Accounting, CRM, Helpdesk, Network Manager among others. And their total employee strength as you said is ~1500.
That's what I thought before I joined. Once I got into the swing of things and started doing actual work, I was like "Goddamnit. We need more employees. There's too much work to do!"
The simple, non-confidential answer is that running a global service used by billions of people entails facing a lot of problems, on a daily basis, that most developers will never face. And that makes everything move slower.
Have you ever internationalized an app? It's not just a matter of translating every string in the UI, although that certainly is a pain in the ass. You also need to handle right-to-left text and mirrored UIs for Arabic/Hebrew/Farsi, and English text mixed in with that mirrored text (this is common in Israel) which requires a directionality reset. Did you know that you're not supposed to bold Chinese characters? The typographic convention is to make them red for emphasis - but only Chinese characters, English text embedded in Chinese characters (eg, "President Obama" in a Chinese news article) should still be bolded as necessary, so you can't just use a global CSS rule. You know that several European countries format a million bucks as $1.000.000,00 instead of $1,000,000.00, right? And dates go DD/MM/YYYY instead of MM/DD/YYYY, and your date parsing routines have to understand that as well as your date display routines. Ever dealt with Russian plurals? In English, you know something is plural based on whether it's one or many; in Russian, it's based on whether the last digit of the number ends in 1, 2-4, or 5-0, and sometimes depends upon its position in the sentence. Can you do case-insensitive comparisons in multiple languages?
And every single UI engineer needs to be familiar with these pitfalls. A lot are hidden behind frameworks and utility functions, but then every single UI engineer needs to learn the frameworks and utility functions, and understand the purpose behind the libraries so they know when they're appropriate to use.
Do you worry about efficiency when you program? Most programmers these days grab something like Python or Ruby whenever they need to do something, but doing a complex parsing routine in Python can easily run 100x slower than doing it in C++, and at Google scale that means 100x more machines. There are a bunch of open-source libraries for the type of internationalization problems above - but the ICU libraries can easily pull in 50M or so of data files into RAM. If you're running on 20K machines (number pulled out of a hat), that's a terabyte of additional RAM consumed because you wanted to use an open-source library. Do you care about end-user latency? Most webdevs these days take for granted that they'll have JQuery or Prototype available to deal with cross-browser differences. One of the first things I did when I joined Google was try to convince them to allow JQuery for development, and as part of that I ran an experiment to figure out what the actual user impact would be. It turned out it would've doubled end-user latency of the time, and so we had to keep writing everything from scratch based on browser primitives.
Then there're all the issues surrounding running a massively distributed system. Machines fail, packets get dropped, and you have to handle those failures gracefully. It's impossible to upgrade a distributed system all at once without downtime, so every time you make a change that crosses a component boundary, you need to make the new component backwards-compatible with the old one and run both versions of the code in prod until the whole fleet has been upgraded. Many familiar algorithms that work on one machine don't parallelize effectively, so you're limited to what you can compute. Even if they do, the standard libraries for them usually assume a single flat address space, and so you have to really understand the algorithm and how to convert it to a distributed network of computers, and then reimplement it yourself.
If this sounds exciting, we're still hiring. :-) If it doesn't, well, if your startup gets big you'll have to face these problems anyway. Hopefully you can hire someone who finds them exciting.
That's all true, but I fear it might be painting a slightly unrealistically positive picture :-) There are also other kinds of reasons for why everything moves slower than one would hope, and they aren't quite as interesting.
Some system your project depends got deprecated again, and you need to decide whether to spend the time to migrate it to the replacement now, or hope it continues working until the replacement has been deprecated too. The development configuration files for other teams systems that you need to inherit from to bring up a full test cluster for your own system changed again, causing mysterious errors. Hopefully early on, rather than in the output of a data pipeline that takes 20 hours to run.
Or the system your working on is somehow relevant to the outside world. There's a legal/political/PR problem, and actual engineering effort is needed to implement the minimal hack to get rid of the problem as soon as possible. Or your project has inexplicably and apparently without anyone's knowledge been marked as falling under Sarbannes-Oxley, and now some change needs to be explained in great detail to a SOX auditor. Which would of course be much more important than any work you were actually hoping to get done :-)
The code base is so massive and there are still so many unexpected dependencies that even with distributed and cached compilation, just the processes responsible for the distribution might be enough to drive your machine to its knees. Of course there's a process in place to try to keep the dependencies from growing out of control, but that's then going to make your life much more difficult if you're working on the inconvenient parts of code nearer the center of the dependency graph, and thus more strictly regulated.
Also, there have historically been a lot of misguided projects that in practice failed but somehow didn't die. Usually they will get killed eventually, but they do tend to grow larger and live longer than they should. Maybe that's changed lately.
This is, of course, an unrealistically negative view. But at least in my experience these kinds of issues were more significant drags on productivity than the fun kind of issues you listed. Though that could be just because the boring kind of drag is much more noticeable. Google's a great place to work at in any case.
Yeah, I've run into all of those at times. There are some problems that are just inherent to working at a big company.
Most of the really frustrating ones are fairly avoidable if you (or your tech lead, if you're not leading the project) are reasonably adept at playing the big-company game. My usual reaction to deprecation of critical infrastructure is to complain loudly and persistently, and I can usually get a stay of execution for 6 months to a year, enough for someone else to be the early adopter and work out the kinks of the replacement, then move my own project over to it once it's stable. I've saved many hours for my team by continuing to use deprecated software until the replacement has been deprecated, and then switching straight to the replacement's replacement once it's stable.
The massive codebase is reasonably easy to navigate once you become adept with CodeSearch. Build performance problems are persistent, but I solve that by getting a new machine every time the opportunity presents itself and using them headless for builds & demos. I also open them up to my team so that other frontend engineers I work with can use them...I've got a mini server farm under my desk.
The legal/political/PR stuff can't really be helped, but I find that exhilarating in its own way, since you get to see how those systems work. I've learned to hate the Chinese government since working at Google, though.
BTW, all of these are potentially problems at a startup using open-source software and doing interesting things. It's just that there:
1.) Instead of worrying about your OS libraries being deprecated, you pull the version you want into your own source tree and never touch them, at least for several years. Any attempt to upgrade to a new version is met with intense pain, and so you push the problem off and hope that either your company will go bankrupt or you'll get bought by some big company and the programmers there will rewrite your software entirely.
2.) Instead of complying with the legal problems, you ignore them. If somebody pays attention to you, you're sued out of existence and all look for new jobs. If they don't, you go about your merry way and hope nobody looks too closely at you.
3.) Instead of your dependencies growing out of control....aww hell, every company I've worked at that's more than 6 months old has had dependencies that were out of control.
4.) Instead of being on a misguided project that in practice failed but didn't die, you're on a misguided project that will fail and die spectacularly. And then you get to do it again for the next startup.
Startups can be a lot of fun too, but understand, you're screwed regardless of what you do. :-) Engineering is hard. Let's go shopping!
Funny but I was thinking about some of these things a few months back that you are addressing here but more in the form of an API that allows real time conversion of visitors by geo-targeting traffic for all types of sites.
So say I run a blog and I write a new blog post that happens to mention $1,000,000.00. It would be nice if a reader from Europe ends up on the blog and is reading it would naturally see a conversion appear next that that number in Euro or something to that affect, but not just currency conversion but in the correct format (similar for other metrics since the US seem to be the rarity when it comes to measuring things in Fahrenheit, inches and feet, miles, etc...).
Interesting you mention JQuery vs rolling your own from JS primitives. Not sure if you mean execution time or loading time? Any idea why G+ performs so badly on the iphone compared with other sites that are using a JQuery or similar?
The limiting factor in end-user browser latency is almost always download time. The cost of JQuery is primarily the cost of downloading JQuery. The advantage of doing everything sans libraries is that you ship only the code you actually need to use to the browser.
(It's the same reason why Google uses browser-sniffing instead of feature detection in its JS. It lets us avoid shipping IE-only hacks to W3C browsers, or W3C code to IE.)
There're secondary issues with browser CPU time, and you eventually learn to avoid CSS or JS constructs that tend to peg the browser CPU. Anything that causes a reflow is expensive, and certain selector engine constructs require JQuery to traverse the whole page. But this is typically secondary to download time unless you're doing a complex animation or mousemove handler.
G+ performs badly on the phone because it does a lot. G+ has so far prioritized features over latency, so they optimize for development time. The result often isn't the tightest code. They operate under very different constraints from websearch, though - typically, people open up G+ and then leave it open for a while, but they run a search and then hopefully immediately bounce off the SRP.
It does, and it works as well as the compiler can prove that the code is unused. For some stuff (like functions that are never syntactically referenced), that's trivial. In the general case, it's impossible, because of the halting problem.
G+ is a heavy user of Closure Compiler (websearch is too, actually, but we use it differently), and Google's JS coding conventions actually are designed with the compiler in mind. The thing is, most of this bloat isn't functions that are never used, it's functions that are rarely used. G+ is big because you can do a lot with it.