I needed to get some historical data out of Github that was no longer available and stumbled over a really neat project called GHTorrent that records and augments the Github public event stream. I thought it would be neat to see who’s using Github using this information.
Note: If you only need to do a query or two, there’s a web-based SQL tool here.
We first need to download and then restore the database to get access to the latest dump of users. For this you have two choices;
- A huge (14.1+GB) MySQL dump containing everything that GHTorrent has ever seen. By far the most up to date option.
- Chunked MongoDB backups for specific collections (users in this case) which is much, much smaller. However, this is very out of date.
We’re going to go with the large MySQL dump for this, as it has over a million more users than the MongoDB backups. So go ahead and start downloading the latest dump from here.
Now we’re left with a single utterly massive Gzip’d SQL dump. We don’t want or need to expand this entire thing (which would be over 100GB) and all we care about is the users table so we don’t want to restore it all anyways.
So, let us pipe it and cut out just the users table. This way we don’t waste any RAM or disk space.
At this point you might run into an error:
sed: RE error: illegal byte sequence
This is caused by the eternal pain of Unicode, so lets cheat and tell sed to treat everything as a simple byte string and run it again.
Now we have a file
users.sql which contains a dump of only the users
table at a measly ~524MB.
Assuming you’ve already created a MySQL/MariaDB database called “github”, lets go ahead and restore it now.
Woohoo! We’ve got a local copy of over 4.7 million users now. Let’s slice the domain from each email address and see who our winners are:
Huh, well I guess that was predictable! The top 15 are dominated by the major email providers. #4 is a fake placeholder, and #14 is a default host on Linux and some BSDs.
Lets try something more interesting and see which USA schools are into Github.
Neat! It would be interesting to mix this with the size of each school’s CS/CE departments and see what the adoption rate is. Of course, it’s only an aproximate since someone might be using their personal email address.
Something else that might be interesting, let’s try the 10 largest software companies (as listed by Forbes).