Extending the life of CVS with Python

Author: Anthon van der Neut
Date: 2006-02-21

Introduction

The same weekend I got confirmation that my talk was accepted, the Python mailing list had a message that the Python development had moved on to use Subversion. I briefly condidered if I should move the company I work for over to Subversion as well and change the title of this presentation, but that as you will see that is not a short term option, nor necessary.

Before I introduce myself lets make sure nobody is disappointed at the end of the talk: This talk is not about the Python language or specific features, nor about open source software. Although when requested I probably will get permission to publish some of these components.

I will tell about about why we are still using CVS and how Python helped to keep it being a succesfull, low maintenance, enterprise despite some of its deficiencies.

Structure

First I will tell something about my background and about Pinnacle the company I work for, so you will have an idea about the problems that Pinnacle had (apart from having me as an employee), developing their products.

Then I will present, with some historical development notes how we solved those problems and what role Python played in that. If I am not as longwinded as usual we will have time for questions, otherwise feel free to talk to me afterwards and/or email me.

Check

My background

"When making software make sure you automate going from source to a CD."

My name, Anthon van der Neut, I am originally from Holland, and as you all know from experience that not only gives me the advantage of always being right, but I am also allowed to tell how things should be done.

I have been professionaly involved in software development for 23 years now, not realy a programmer, since I am primarily involved with management 22 of those years. But I like to stay in touch, and as part of my management task always tried to help my developers with automating tasks around software development so they could get there work done. Originally wrote most of those tools in C, then about 10 years in C++ and since 1997 mostly in Python.

Apart from a two year premature stint at setting up a nanotechnology company, I have always been working in the 3D and 2D computer graphics area.

My experience with releasing software: Make sure as a engineering manager that you know how to make changes to the code and get them into the released product, without outside help. This means automate the build process of compiling, and making the installers.

That means that everyone, including the people providing revision control support in your company provide you with tools to do your job. They should not necessarily do part of that job and make you dependant on them all the time. Just like your engineers should be able to compile their software themselves (instead of handing a stack of punchcards to an operator).

The company I work for

Since 2001 I have been working for Pinnacle Systems, which last year was acquired by Avid. Pinnacle (and part of Avid) provide Video related software and hardware from consumer products for editing home videos and TV viewing to high end systems for broadcast facilities. Pinnacle grew mostly through acquisitions and had at one point 12 engineering locations 8 in the US and 4 in Germany. All in all there are over 300 engineers with a CVS account.

I was hired in the Sausalito office to get Commotion, a Mac and Windows based compositing and roto-scoping package out of the door. Because of the multiplatform development the team there was already using CVS.

In other parts of the company mostly Visual Source Safe was used, but there was also a considereable source code base in HMS, and people had been using Perforce and ClearCase, but moved away from those.

Believe it or not there were also engineers not using revision control at all.

SourceSafe was a major problem, as remote access was more than slow, even using Source offsite. Because of timezone differences getting access often involed days to the right passwords and clearances.

So a tendency existed to just get the source code for something and put things in your own (source safe) repository and update that on an irregular basis.

That way of working leads to problems, especially in the consumer product area where people can (and do) install multiple of your products, with DLLs that not only do the same things, but have the same name and are compiled from source code that is almost but not exactly the same. Consumers are not happy when they install a second product and then have a reintroduction of a bug that was solved with the last update of the first product they had.

CVS

In September 2001, Pinnacle acquired Fast in Munich (Germany, I am not sure if there is a Munich in the USA as well, probably is not that far from the captital of Kentucky). I was part of the technical audit team, and although the technical capabilities of the product were superior, the way they were developed was lacking a bit. Developers had their private SourceSafe database and delivered compiled DLLs that would flow into installers.

So we setup a second CVS server in Munich commited the current source code (without history) to that machine.

My team was going to provide an additional rendering engine for the Editing product Fast was making. So we needed to get our stuff to Munich, preferably tested. What we did was setup one-way mirroring using cvsup. First to get our compiled and tested DLLs to Munich, then to get the source code in the other way so we could compile and test their software. Compilation took 2 hours and that was quicker then trying to download the 450Mb installer that was the result of that.

With some misfortunes with VSS and day long recoveries of databases the Studio group, which is one of the larger got moved and after most of the ocmpany moved over.

Maximized Setup

It allows us to build everything from material checked out locally, even the stuff contributed by a remote location. This is good e.g. when the power failed in Sausalito. For three days in a row.

cvsup is a wonderful program (and not even written in Python, it is written in Modula-3). It knows about the structure of the files in the repository and can do very efficient one-way mirroring. It uses a pull model, appropriate for the BSD community where it evolved. Back in 2001 that was the only free mirorring software I could find and that made the decision for continued use of CVS easy. However it is pull based so something is needed to tell the client to start pulling on a commit to the host.

viewcvs (based on Python) was running on the server under Apache so you can actually browse the tree without needing to check everything out.

And there was a little webpage on that Apache server where I and some other admins could create accounts and sent people temporary passwords via email that they then could change themselves.

Some stuff that was added

So here are some of the problems we solved over time:

I wanted however that when our buildmachines in Sausalito checked in the result of a build, that that triggered a mirror to Munich. There are many ways to solve this, but one that could be done with technolgies I had used before is have a postcommit step trigger a remote webpage, that writes a file that gets interpreted by a cronjob.

Cronjobs are good because they are asynchronous and can be run as root, e.g. necessary to make new account.

Triggering is not trivial, as on a sngle commit the postcommit program is called for each directory that has changes, so you have to gather some info or else you get a bit to many triggers.

There is also the problem that CVS calls the postbuild script via a shell, without quoting, so if you have spaces in your directory you get a problem. We did not solve that with Python, we patched the CVS source.

So based on a commit and some info on where things had to go mirrors were started.

We wanted the builds in Munich to start after a while when something new was committed. If you do an rlog on the 18.000 or so files in the repository for one product version it took 15 minutes to detrmine nothing got changed. We already had postcommit program, and that was extended to record who changed what in which files. We did not process that with a cronjob and find as that has a somewhat course granularity (2 minutes). So a Python program runs and checks every 5 seconds for files not touched for 15 seconds.

The original idea was to make sharing easy. So although CVS supports multiple repositories per server, we put everything in one, so you only had to login once (per server). However some projects needed restricted access (not the binary output in lib, but real sources). We are using groups for that and by that time centralized the user management on one machine. It has an LDAP database that gets replicated to the other machine automatically and the user passwords for CVS and for Apache get updated from that on a HTTP request to each of the servers (much easier than figuring out how to authenticate to LDAP). Viewcvs was in addition adapted to take the user info into account and not display directories that developers could not check out.

Last year all of that forced us to upgrade to Linux with a 2.6 kernel on all machines becaue the 2.4 kernel allowed only 32 groups.

Other changes that we made include extending viewcvs to allow developers with certain rights to delete versions and/or change comments. For this you normally need cvs admin and we blocked that usage in general for developers.

Why Linux on the hosts

What was nice is that after a 500 day uptime, the machine in Munich rebooted and nobody remembered that this program, which had been running that long needed restarting, so no builds got triggered.

Some stats

Offices have closed, sources moved

By now the combined repository (excluing mirrored duplicates) is over 190Gb, checked out 56 Gb of material in 795.000 files

CVS shortcommings

Apart from the slowness not a real problem for us

However CVS has some shortcommings, best way to investigate those is by looking at the websites of the alternatives to CVS. No atomic commit, no renaming of files or directories. To be honest, none of those things were a real problem. Some of that can be solved with proper education (first update your sandbox, then compile and test and then commit), others cannot be allowed in our environment (you want to rename your DLL that is used by 30 developers in 5 products and hope they all update their configurations?)

These features, some users might find lacking but for us they are not.

Everybody happy. No!

One deficiency realy affected us and that is that metadata (most notably tagging) is not revisioned.

Some others do a better job of this, but were not stable enough, had no mirroring or had other problems (e.g. price tag).

CVS has a shortcomming that affected us: it does not revision its metadata. That means if someone moves a tag for Studio_10 on a bunch of files there is no easy way to look were it was before. It is probably most easy to recover from the backup, because CVS is file based in the repository that is an option, if you would bring back a whole database, the other projects sharing the repository would be affected as well). That is kind of nasty because than for a bug fix release we do not know where to branch.

Some revisioning software do a better jobs of this, but changing is not easy. We had at one point converted one server to CVSNT. CVSNT has special support for Unicode files and claims to be backwards compatible with the CVS repositories. Well it is not, the problem is at least with files that changed from being marked as ASCII to being BINARY in their lifetime.

Other software lacked mirroring, or was a single database, or was expensive (either to run or to maintain).

So as we needed revisioned metadata, we needed somewhere to store this and keep track of revisions. What is easier then to store this in CVS itself and have the routines that check out and commit (before and after the compilation) take care of handling this. And actualy we don't need that much information. Each component is build and just based on Buildnumber that has a datetimestamp for checkout, and for check-in and branch information you can retrieve a particular build (and checkout the associated sources). In addition to that we store a lookup from a label to a build (and who set that label when). And store this on the with the component that is being used. That gives us the possibility to determine that certain versions of the binaries never made it in a release and after half a year of non-use they can be purged.

CVS MetaData

__meta directory with the sources and with the binaries

YAML format, you can actually read and edit that data

In order to support and drive the generic build system (called the Sandbox system), we have a __meta directory on the source side which hold a component specific python script that is dynamically loaded. It holds some information on source and binary directories and other components needed in order to be able to compile itself, and a list of commando's to execute. And that is used to do special things if needed (partial optimised compilations based on which files have changed e.g.). A BuildData.txt file holds build info including branch builds, and which labels to use when checking out (some components are being build in several flavors in this way).

On the binary side there is a CVSMetaData.txt file with build information as described and also who to notify on correct builds. What to do if you want to use this component in an installer etc. There is also a small file telling us where the component originates from, with all the mirroring going on that is sometimes intransparent.

The metadata and builddata files are YAML files, as we wanted something with structure that was human readeable and editable. Many elements get updated via programs or web-pages but not everything.

Other useful things

We parse the ocmpiler output, not all the compilers (we currently support 8 or so) have proper exit values. Visual Studio, called from the commandline has some very nice behavour, including writing nul bytes to stdout and ignorring errors in postbuild steps, if they are not the last command. By capturing stdout and parsing that in Python we can keep track of what is going wrong. (And insert ^^^Error as you will be amazed how many files in the repository have error in its name, making it difficult to find the actual error message in the email that you get).

Our build servers run a small C program that starts a python based xmlrpc server and restarts it when it exits without a magic exit value. That xmprpc server can update its own files (from CVS of course) and then be commanded to restart itself. That server starts the actual builds via remote command, from a developer machine or from the cvs servers.

Apart from compilation we use the system for making installers as well.

The webpages on the servers are run as cgi scripts, not as mod_python, so a checkout of the latest version of these file immediately affects all the pages. And of course this is mirorered to all the servers.

Is this cost effective?

Some indication of the python system itself: 150 modules, with 22000 lines of Python. It is difficult to exactly determine where development stops and support starts, but the whole system was probably developed in something like 3 man months. It has (not enough) test routines using test.py which use I like much better than unittesting.

Support is still a part time task, and has varied depending on new projects getting into CVS etc. It definately helped that I worked both in the USA and now are located in Germany and was able to help people, or go and stop by.

Would I do it the same way? I would definately consider to do things this way again. It all has to do with cost of the alternatives. Avid, who acquired us is using ClearCase and has 4 people fulltime to manage the builds of 200 engineers. We might have to go to ClearCase, but I am not so sure if we will give up the Python base build system around that. The nice thing is that our MetaData does not have to be in a proprietary ClearCase format, but can easily be replicated there for ClearCase users.

Questions?