Why Google Stores Billions of Lines of Code in a Single Repository is an excellent paper by Rachel Potvin and Josh Levenberg. They both work at Google, Rachel being an engineering manager and Josh a software engineer. Their writing, published in the Communications of the ACM in July 2016, may be found here. They provide a fascinating deepdive into the way Google handles source code in a monolithic single repository, the trunk based development, the Google workflow, all the Google-built tooling, pros and cons, and an analysis of using a single repo at ultra-scale.
I really hope you read the paper, it’s wicked fascinating software engineering. I liked the piece so much that I decided to read it again and write some notes/summaries of the different topics touched. Scroll through them below and read the paper (which is vastly more infromative than what you can find here!).
Super-shortly:
Pros: unified versioning, extensive code sharing, simplified dependency management, atomic changes, large-scale refactoring, collaboration across teams, flexible code ownership, code visibility
Cons: having to create *and* scale tools for development and exectuion and maintain code health (also a possibility of potential codebase complexity)
The repo:
~1 billion files
~35 million commits
~85 TB of data
~2 billion lines of code
~9 million source files
2014: 15 million lines of code changed in 250,000 files. 25,000 users and avg 500,000 queries per second.
note: most of the traffic comes from Google’s automated build and test systems
Compare to Linux kernel: ~15 million lines of code in ~40 000 files
Google Piper design
– stores a single large repository
– implemented on top of standard Google infra, namely Spanner
– distributed on 10 datacenters
– Paxos for replica consistency
– Google infra and private networks cut the latency and deliver needed speed
– Google originally used a massive Perforce instance with custom-built caching and other infra for over 10 years
Piper security
– supports file-level access control lists
– most of the stuff seen by everyone, anything may be hidden if needed
– read/write logs; owner can see who viewed, when, and what
– purgin of accidental critical secrets
– for instance business critical secrets like algorithms might not be available for everyone (but: over 99 % of all version-controlled stuff is seen by all full-time Googlers)
Piper workflow
– create a local copy, store files in the developer’s workspace
– – this is like working copy in Subversion, local clone in Git, or client in Perforce
– pull updates from Piper
– share the workspace as a snapshot for other devs to review
– commit *only* after code-review
Clients in the Cloud, or CitC
– cloud-based storage backend + Linux-only FUSE fs
– Piper workspaces seen as directories in the fs
– support the usual Unix tools
– local changes laid on top
– browsing, searching, editing any files in the Piper repo
– only edited files stored locally
– avg workspace has <10 files while still showing everything in the Piper repo
– *all writes* stored automatically, can be tagged, named, and rollbacked
Trunk-based development
– vast majority of Piper users work on “head”, “trunk”, or “mainline”, that is the most recent version of everything
– all commits in there
– all changes seen by everyone using Piper after every commit (remember: commits only after code-review)
– using branches very very rare except for releases
– releases usually a snapshot of the trunk + cherry-picks from it
– no dev branches, no feature branches, no nothing
– feature-development through the use of feature-flags in code
– feature-flags controlled by conf. files, no need for new binaries
– feature-flags typically used in project-specific code, not libraries
– easy to experiment with small amount of users
Code review
– nothing is committed without a code review
– the committer can enable a flag for auto-commit if the review passes
– the reviewers have tools for viewing and adjusting the code easily anywhere in the Piper repo (tools are named Critique and CodeSearch)
– commits have to be accepted by directory owners
– remember: the whole Piper repo is availabe for anyone -> anyone can propose changes in any piece of code anywhere, but the owners of directories have to accept them
– directory owners are the people most familiar with the code/project/library in question
Commit-infra & refactoring
– automatic rebuild of all dependencies, testing
– automatic rollback in case of widespread breakage
– vast and customizable pre-submit testing and analysis, runs before anything is committed
– static analysis system called Tricorder
– – provides data on code quality, test coverage, test results
– – provides automatic suggestions for fixes with one-click applying
– – triggered after all changes and periodically
– – used to ensure codebase health
– set of devs periodically dig through Piper directories to refactor code in order to keep it healthy
– large backwards-compatible changes first, removing unused paths second
– tool called Rosie suppors that by splitting the large patches made by the devs into smaller patches that are individually reviewed by the directory owners
Analysis
Advantages
– unified versioning, one source of truth
– extensive code-sharing and reuse
– simplified dependency management
– atomic changes
– large-scale refactoring
– collaboration across teams
– flexible team boundaries, code ownership and visibility, implicit namespacing
– all code depend on other code directly
– the diamond-dependency problem is gone
– atomic changes enable refactorings of variables or api calls for hundreds of thousands of files without test/build breakage (in a single commit)
– engineers don’t depend on specific versions -> no need to update them
– all files uniquely identified
– a good example:
– – the Google compiler team can run regression etc. tests nightly on all affected code and validate new versions
– – code can be refactored to support new versions of compilers before shipping them
– – ~20 compiler releases a year
– – compilers can be tuned to use best possible default settings
Drawbacks, trade-offs etc.
– tooling investment is HUGE
– couldn’t be used without all the special support-systems
– codebase complexity,
– unnecessary dependencies
– discoverity difficulties
– effort in code health
– sometimes hard to explore code
– the usual suspects like grep unusable from time to time
– too easy to add dependencies -> unused dependencies
– lack of will to write documentation if everyone can look up the apis themselves
– depending on more than just the api because you see how the code works
Alternatives
– the favoring and use of DVCSs has have grown -> moving has been investigated
– moving to a DVCS (eg Git) would require a split to thousands of repos
– – Android is Git hosted and North of 800 repos
– currently available DCVSs don’t provide needed security controls
– investigating whether Mercurial could be made to support Google scale
Checkout the wonderful paper here.