This review was also published, in abridged form, at Slashdot (2006.07.26). Your feedback is welcome: i-at-ihearth4x0ring.info

The author, Brian Donovan, is a software engineer and writer who currently lives in Hong Kong with his wife and two cats.

Last modified: $Date: 2007-02-28 06:45:51 +0800 (Wed, 28 Feb 2007) $

Building Scalable Web Sites by Cal Henderson

Building Scalable Web Sites by Cal Henderson full titleBuilding Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications
authorCal Henderson
pages330 (8 page index)
publisherO'Reilly Media, Inc.
rating 9/10 9/10 9/10 9/10 9/10 9/10 9/10 9/10 9/10 9/10
reviewerBrian Donovan
ISBN0596102356
summaryIf you've been kicking around the idea of doing a Web startup, then you should definitely give this book a read.
Ths book is primarily about web application design: the design of software and hardware systems for web applications.

Preface, What This Book is About (page xi)

Cal Henderson describes himself as "Flickr architect, PHP programmer, author and chronic complainer". He's also originally from London, England, which explains the missing third comma in that list. Thankfully, though, you won't find any British spellings (i.e. "colour", "pyjamas", etc.) or punctuation in his first book, Building Scalable Web Sites, published by O'Reilly. Henderson has been on the Web development scene for several years now (his personal site, iamcal.com, is just over six years old) and, while he's probably best known for his work on the Flickr team, you may have heard his name before in connection with sites like B3TA (he's one of the cofounders) and Barbelith.

Building Scalable Web Sites (BSWS) is a great book and it probably couldn't have been published at a better time. Though profitability remains elusive for most new ventures, launching a Web startup has never been easier. Hardware is cheap and both you and your prospective users are likely to have much more powerful machines than were available just a few years ago. Hosting's cheaper than ever too. In terms of labor, the latest crop of companies always seem to include at least one or two actual programmers among the founders. If you need more help, inexpensive offshore labor is always available, though the wisdom of going down that route is ... highly questionable. I remember reading notes from a talk given by Mark Fletcher, the founder of Bloglines, during which he allegedly exhorted founders to outsource (to eLance) as much as possible ("the first Bloglines notifier that ran on a Windows desktop cost $500, and was written by a couple of guys in Russia in 3 days").

Back to the book. It's not a step-by-step guide (and doesn't claim to be one), but Building Scalable Web Sites is the closest thing available to a nuts and bolts look at managing the technical aspects of doing a Web-based startup. There's lots of code inside, but the book isn't built around a single, extremely contrived, case study like an online wine store. Instead, the chapters follow a general pattern: a topic (like bottlenecks in your application and platform, scaling, or monitoring) is addressed and some rules of thumb that describe the way that the author feels things should be done are set forth and explained, with lots of very specific hints and factoids mixed in along the way. Tools for other languages (in most cases, Perl) are mentioned in passing, nearly all of the code snippets are in PHP. MySQL 4.1 is the basis for most of the database-centered material.

My comments are interspersed with a few (of the many) quotes that I found interesting or insightful.

Inside the book

The book opens with the author reminiscing about his entry into the field. Terrania, his first web app, was written in C++ with a C++ CGI interface. Details are sketchy but it appears to have been a virtual online world inhabited by player-owned creatures. No dates are given in the book but terrania.iamcal.com first shows up in the Internet Archive's Wayback Machine in Feb 2001. His UBB modding experiences, presumably his introduction to the world of scripting languages, are touched upon next. With another UBB hacker, he started ubbhackers.com (first shows up on web.archive.org's radar in late 2000) and moderated several forums there. In other words, Henderson progressed from writing C++ CGIs and modding messageboard software to being the server-side muscle behind Flickr in just a few years. Not too shabby!

Henderson's resume indicates that he joined Ludicorp about a year before they shut down GNE (Wikipedia entry), their Web-based roleplaying game, to focus on Flickr and it's his role as web development lead at Ludicorp that led to the inclusion of the "The Flickr Way" sub-subtitle that runs diagonally across the upper right corner of the book's front cover.

A web app is built like a trifle

Click for the full-size slide and Jeffrey Veen's thoughts (26 June 2005) on Cal Henderson's "How We Built Flickr" presentation.

Chapter 2, "Web Application Architecture", begins with Henderson drawing an analogy between a web app and a type of dessert known as a trifle - the sponge cake at the bottom of the dish is the database, the next layer up (jell-o) is the business logic, and so on. The black and white image in the text is identical to the color image included in a slide from an eight-hour workshop that the author gave in San Francisco titled "How We Built Flickr" (20 June 2005 review of the same workshop from Henderson's Barbelith collaborator Tom Coates). Having read the book and some reviews of his workshops and looked at the list of talks on Henderson's site (some with Powerpoint decks for download), it seems as though a lot of the ideas expressed in the book were developed over an extended period and through repeated presentations. Higher Order Perl author Mark Jason Dominus (blog) is using a similar strategy, doing a "World Tour" of Perl Mongers groups, to hone material for an upcoming book of his own on "red flags" in Perl code.

Henderson's sense of humor is evident throughout the book, but not in the annoying overly cutesey way that made me want to toss "Extreme Programming Installed" into the circular filing drawer. In the section on software interface design (where he means the interfaces between the layers of the trifle), for example, there's a "Web Application Scale of Stupidity" that places "sanity" in the center and OGF (one giant function) and OOP at the extremes. The process of separating web app logic from presentation is broken down into 3 steps: separating logic code from markup, splitting the markup into per-page files, and moving to a templating system. He closes out the chapter with a breakdown of the hosting, hardware, and networking issues involved in serving up web apps.

While the in-crowd might make a big thing of XHTML and standards-compliance, it's worth remembering that you can be standards-compliant while using HTML 4. It's just a different standard.

Chapter 2: Web Application Architecture, Layered Technologies (page 9)

Next up are the considerations around development environments, beginning with a 3-point list of guidelines for building small-scale web apps up into big ones: use source control, have a one-step build process (literally, if possible, a single button), and track bugs (as well as non-bug items like features and support requests). Readers get to feast their eyes on a cropped screenshot of Flickr's build control panel (two buttons, "perform staging" and "perform deployment", to match the last two steps in the release sequence in an HTML form). For small teams, the author is in favor of allowing multiple developers to trigger releases and he suggests several ways of trying to keep that workable. In version control, Subversion gets the nod and, though no bugtracking tool is singled out as the best, FogBugz garners the highest praise ("extremely effective") and has the shortest list of "cons". The author never comes out and says what the Flickr / Flickr-Yahoo team uses in either area, however.

It's more important for people on a team to agree to a single coding style than it is to find the perfect style.

Chapter 3: Development Environments, Coding Standards (page 65)

I enjoyed the entire book, but two chapters really shone. Henderson tackles internationalization, localization, and Unicode in chapter four and it's the most readable introduction to the issues that I've seen yet. MySQL's currently incomplete implementation of UTF-8, sarcastically referred to by some as "UTF-7½", is mentioned in enough detail that a reader can decide whether or not it's likely to be an issue for their app. The book as a whole is packed with little nuggets of information like that - things you might not have otherwise been even peripherally aware of until they bit you.

The solutions that were reached [w/re to character encoding and processing] cut out a lot of the hard work for developers - it's now almost trivially easy to create a multilanguage application, with only a few simple bits of knowledge.

Chapter 4: i18n, L10n, and Unicode (page 69)

The coverage of handling emails programmatically is also quite good. Henderson does the basics and then delves into a number of possible pitfalls in considerable detail. The salient aspects of the TNEF (media type application/ms-tnef) format, used by MS Outlook for attachments and metadata, for instance, are explained and pointers are given to open source TNEF parser implementations. I also got a lot out of the section on dealing with email from wireless devices like mobile phones, titled "Wireless Carriers Hate You" (there's that dry British wit again).

The second half of the book focuses more on scalability. It's also where you'll find the most material on using MySQL, including but not limited to query profiling and optimization, a discussion of the merits of denormalizing once you begin to reach a certain scale, and a comparison of the different MySQL backends. There's an entire chapter devoted to finding and dealing with bottlenecks - how to determine whether your app is CPU-bound, I/O-bound, or context-switching-bound and what to do about it. The chapter on scaling begins by debunking the "scaling myth" (but he actually tackles several misconceptions at once - namely that scalability is synonymous with speed, that scalability is a byproduct of having written your app in Java, etc.) before getting into vertical vs. horizontal scaling (buying more powerful and expensive servers vs. adding more cheap cheap servers), load balancing, and more.

Why not 10/10?

Technically, I think that Building Scalable Web Sites is 100% ... but I came across a few niggling flaws.

Two dates given (both on page 155), 1990 for the creation of libxml and 1995 for the design of XML-RPC, are incorrect and I spotted a handful of grammatical mistakes (probably proportionately fewer than in this review) that I've already submitted, along with the date mistakes, as errata through the form linked from the O'Reilly catalog page for the book.

Additionally, though the cover does say "The Flickr Way", you won't find many sentences that begin "At Flickr, we [...]". Aside from the "Rolling Your Own" section in chapter seven describing some custom middleware and a protocol that they whipped up for moving files around within their system, there aren't a lot of explicit details about the way that Flickr operates in the book. You'll actually get more insider info from Tim O'Reilly's "Database War Stories" entry regarding Flickr, which is based on Henderson's answers to questions posed by O'Reilly, than from this book. On the other hand, since Flickr is likely still evolving, most of that sort of information would probably be out of date fairly soon anyway.

If you'd like to get a feel for Henderson's style, chapter five ("Data Integrity and Security"), is available as a PDF on the O'Reilly catalog page for the book and he's has also put some articles online (all PDFs, not much overlap with the material in BSWS).

TOP