$USER@bit-of-a-byte ~ $ cat $(find /var/log -name '*.log' -print | head -n 3)

Collegi Pixelmon - Developer Log

The following post is an extreme rough draft. In fact, it isn’t even actually a post. These are my development notes from my refactoring of the collegi data infrastructure. As such, they’re arranged in no real sensible order besides having been written chronologically. Additionally, these have not been proofread, grammar checked, copyedited, or spell checked, as i write them in an IDE and not an actual text editor. As such, please don’t judge my writing ability off of them. More importantly, however, these do not have the standardized links that i provide to new concepts or commands in my blog posts, as embedding links to things I already know or have access to in a developer log that on average no one else sees just seems silly.

So, if you have questions, use google, and expect these to be updated over time.

The logs as of this posting run from 10/13/2016 to 10/16/2016, so over three days of work. There is a -LOT- more to be done.

They are broken down into the following format. Each list is a set of specific actions I took, and sometimes the list ends up with notes in it because, again, no one generally sees these, but under the task list is the space reserved for notes on the above list. Then a new task list is declared, then notes, then tasks, and so on and so forth. Generally each new task heading would signify a new blog post, talking about the tasks and the notes, so keep that in mind.

These were requested by Kan, a player on our server. Enjoy!

Tasks

  • Made a backup of the repository as it stood on 2016-10-13 in the event anything breaks too badly during this.
  • Removed all existing submodules from the git repository. Committed the removal.
  • Ran the previous backup script to make sure that 10/13 was backed up. This included new additions to git annex.
  • Forced git annex to drop the old SHA256E key-value backend files that were made obsolete by the conversion to SHA512E key-value backend.

Notes 1: During this time, and while watching the way the version 1.0 backup script ran, I noticed there is a significant performance penalty for moving the location of the local mirror. Borg uses the entire path as the file name, so any deviation in the path spec causes it to treat the files as brand new. Note that this does not cause any issues with de-duplication, but the process of adding these files causes a massive performance hit. This made me start thinking about including the local mirror in the git annex so that as long as the annex was kept in tact in regards to metadata, the paths would remain the same as all additions to Borg would take place from the same root directory.

The problem with this would be the fact that annex keeps everything as symlinks. As such, I am looking into the unlock feature of version six repositories.

Notes 2: Dropping unused from a local area goes -much- faster than dropping from remote. Who knew, right? :tongue:

Tasks

  • git-Annex drop completed, but Finder isn’t showing a reduction in used drive space, but I think this is more an error on the side of finder than something with git annex, as du -h showed the directory was down to the size it should have been. Once I manage to get this finder thing figured out, I’ll move on to the next part.
  • Finder is taking too bloody long to figure its shit out, so I moved on to the next step in cleaning up the repository. I’m rewriting the commit history to completely remove files I don’t need from the actual git repo. In theory this shouldn’t touch git-annex at all, but that remains to be seen.
  • Ran BFG Repo Cleaner on the following directories and files:
    • collegi.web
    • collegi.pack
    • collegi.git
    • .DS_Store
    • .gitmodules
    • collegi.logs (Just for a moment, and we made backups.)
    • collegi.configs
  • Ran filter branch to purge any empty commits left after the above.
  • Expired original ref-logs, repacked archive.

Notes 3: At this point we had gone from 230 commits to 102 commits. We were also left with the original envisioning of what this repo would be, which was a simple git annex to push files to Backblaze b2 from the Borg repository. Now to verify that all of our data is still 100% ok.

Tasks

  • Ran git fsck
  • Ran git annex fsck

Notes 4: Wow this is going to take a long fucking time. Who woulda thunk it.

Notes 5: So apparently the current version of git-annex is using the old mixed hashing method, which is a format that “we would like to stop using” according to the wiki. Might need to migrate. Need to figure out how.

Notes 6: From the wiki: “Initial benchmarks suggest that going from xX/yY/KEY/OBJ to xX/yY/OBJ directories would improve speed 3x.” It’s worth migrating.

Tasks

  • Run git annex uninit
  • Reading through the git-annex-init man page to see what else we should change now since we’re already migrating. Post Uninit we’re going to have to run a full borg data consistancy check.

Notes 7: Ugh. The document I found was actually an theoretical one, and while it is true that git-annex does use the new hashing format in bare repositories there is no actual way to move to the new one in a regular repo. So I am running an uninit for basically no reason. The only good thing about this that I can think of is that I will be able to reform the final git-annex repo in a much saner fashion. The bad news is that I have lost the log files, unless git-annex is going to bring those back for me. I am annoyed.

Notes 8: Good news! I just remembered that I had made a rsynced backup of the repository before I started fucking with it. So I didn’t actually lose the log files, I just went ahead and pulled them out of the git-annex backup.

Tasks

  • After the git annex had uninitialized, I decided that if I was going to do this whole damn thing over again I was going to do it right.
  • Started a new borg repository in new-collegi. Pulled out contents from the original borg repository, using backups to restore any files that got hit in the above clusterfuck, then recompressed with maximum LZMA compression.
  • During this period I also standardized how the borg create paths would work. The server would exist within a collegi.mirror directory, and the entire directory would be added to borg upon each run of the backup script. This effectively means we never have to worry about the LZMA penalty discussed below again after the first re-add, unless we do major server restructuring, because paths will remain stable between commits.

Notes 9: The initial speed penalty for using LZMA is absolutely jaw dropping. One borg create took eight hours to complete. Eight. However, I quickly noticed that due to Borg’s de-duplication mechanism, the add times got faster the more data I added, and gzip-9 to lzma-9 did actually yield some improvement. It also reduces the incentive for me to do this fucking disaster again, because of how much it absolutely fucking sucks.

Notes 10: As an example of what I mean by the above, the initial adding of 1.8.9 took six hours with LZMA-9. When the map was changed from NewSeed over to Collegi, it took another four hours just to update the paths and what not, even though the data hadn’t updated, just the paths have changed. (This is indicated by the fact that the total repository size barely increased, all the size that changed could be explained by new metadata.) However, when the paths are kept the same, adding 100GB of data takes 13 to 15 minutes. So, the benefit of LZMA-9 is worth the initial startup, imho.

Notes 11: borg extracting from the GZIP-9 archives takes about 40 minutes, and that’s from highly de-duplicated and GZIP-9 archives. What this means is that pulling from an lzma-9 is probably going to take about an hour, depending on just how de-duplicated the archive is (as in, how many different chunk files contain parts needed to reassemble the original content).

Notes 12: Have hit the series of backups where things have moved into the Users path, and I’m restructuring them. It made me think about how I will handle the mirror directory in the future. I think I am going to do a few new things with respect to the new setup. The mirror directory will be a part of the git-annex repository, so there will be a new folder inside it called collegi.mirror or something similar, and then I can move the new backup script to be ran from the root directory, which will be beneficial. That way everything is neatly packaged. the issue becomes mirroring this, because uploading that much constantly changing data to backblaze would be literally stupid, and not at all within our budget. What I will likely do is initialize a “bare repository” on my time machine drive, and mirror the entirity of the git-annex repository to that.

Mandatory Break Notes

  • You need to run borg info to make sure the latest creation thingy is the proper size, and a borg check might not be a bad idea either as you fell asleep and closed the mac during work on the repo.
  • Cleaned the time machine volume of the repeated backups of the new repository because it doesn’t make any sense to have 20 versions of it.
  • Moved the repo to the time machine drive as temporary storage using rsync.

Tasks

  • Restarted the transfer process starting on the 8th of October

Notes 13: Not a huge shock but running some of these commands across USB 2.0 can add anywhere from 10 to 30 minutes. Doing them cross device gets even worse, with some transactions taking almost an hour.

Notes 14: I’ve been going back and forth on what filesystem I would like to deploy since I am redoing the collegi drive as a whole. Now the interesting thing to note here is that by the time I get this thing fully ready to deploy, the drive I have here may not be the drive it ends up on, but this is as good of a testbed as any. I’m really thinking I will go with apfs. Most of the gripes I have with it are easily resolved through borg and git annex.

Notes 15: In a highly amusing turn of events, it is bigger in lzma 9 than it was with gzip 9. weird.

Notes 16: While it would likely be prudent to go back to the previous compression method, the benefits that I have made to the directory structure while redoing the borg repository are worth the few extra gigabytes of overhead especially concerning with Backblaze B2 it barely costs a penny.

Tasks

  • Use JHFSX for the new drive. I would have really liked to use APFS but I am still worried about the data loss considering there is almost a year till it will ship. JHFSX is reasonable enough for right now, while still being safe to unplug.
  • I went round and round on using encryption on the new drive. did it.
  • using rsync to bring the data to its final resting location.
  • OK started setting things up
  • Defined gitlab as the metadata backup again
  • created a bare repository on skaia
  • set up prefered content so skaia requires everything in the main repo
  • set the main repo to require a –force to drop content via preferred content
  • Set the backend to SHA512E
  • began the long process of adding the data to the git-annex
  • Set up bin directory to not be tracked by git-annex but instead by git
  • added backblaze remote, not encrypted, with a proper prefix
  • started to sync to backblaze
  • noticed an issue with how the sync was going to gitlab, will correct.
    • corrected the issue

Collegi Pixelmon - Backup System Part 2

What was originally intended to be a one off blog post may become my new source of material for the coming weeks. After utilizing BorgBackup and git-annex to backup what has now grown to almost 2.5 Terabytes of data, I began to wonder what other ways I could put git-annex to use for us here at Collegi. We already use various GitLab repositories to manage different facets of the project, and I began to wonder if there wouldn’t be some way to use git-annex to completely unify those repositories and distribute their information as needed.

This started as a brief foray into git submodules which, while allowing me to consolidate data locally, does nothing in helping me to properly redistribute that data to various locations. The only way that it would be possible to do such a thing would be to take all the various git repositories that Collegi utilizes, which currently is sitting at six total, including the git-annex metadata repository (which isn’t publicly visible), and merge them into one master repository through the use of git subtrees. This would allow me to still have multiple repositories for ease of project management, but all those repositories would be pulled down, daily, to a local “master” git-annex repository and merged into it.

Once this was done, the use of git annex’s preferred content system would allow me to decide what data needed to be sent to which remote. This would let me back up some information to one remote, and other information to another. As an added bonus, the use of git subtrees would even allow me to push changes back upstream, and all of it would be centralized.

In the future, this would allow us to push very specific data to specific team members, who would then modify the data, which would be pulled back down on the next git-annex sync, we would see changes needing to be pushed upstream had been made, unlock those files, then use git subtree to push them back to their remotes. That’s the theory at least. As far as I am aware, either no one has done this before, no one who has done this before has lived to tell the tale, or no one who has done this before has blogged about their experiences in doing so.

That’s where this blog comes in. I’m currently in the process of making a complete copy of the current root repository, which is still using git submodules, and from there I can begin experimenting. Whether or not this works remains to be seen, but it coincides neatly with a rewrite of the backup script to update it to Google Shell Style Guidelines, which means I can build the script around the new repository layout, and while doing so I should be able to head off any unforeseen issues.

It’s very likely that I am going to finish writing 2.0 of the script before doing any of this crazy shit, but this post helps me to organize my thoughts. Besides, it just means 3.0 will be that much more exciting when it drops.

Stay tuned for more of my antics and adventures with making this absurd system take shape, and turn into the omnipresent repository of every single facet of a Minecraft community.

The Collegi Pixelmon Server Backup System

Wow, time flies. It has been almost a year since I last updated this blog, including fixing some of the issues that Jekyll 3.0 introduced in my formatting. Luckily, that could be fixed by just adding a few spaces. In the past year, quite a bit has happened, but nothing quite so exciting as becoming a co-owner and the head developer of a new Minecraft community called Collegi. Collegi is a Pixelmon server, which means we have Pokemon right inside Minecraft. However, we strive to make the server Minecraft with Pokemon, instead of Pokemon in Minecraft. It’s a small difference, but one that we happen to find very important. We want the survival aspect of the game to be front and centre.

The server has become absolutely massive, with each downloaded snapshot running about 100GB in size. (Note, that throughout this article I will be using the SI standard GB, which is 109, versus the Gibibyte which is 230, how hard drive manufacturers were allowed to change the value of a gigabyte is something I will never understand.)

Now, with a 500GB flash drive on my MBP, I don’t really have the room to save all of those snapshots, especially considering we have snapshots going back six months, across three different major versions of Minecraft. In fact, completely expanded, the current backup amount at the time of writing is 1.11TB.

So, I began to search for a method of performing backups. I had some rather strict requirements for these backups, that lead to the formulation of the system I am going to discuss in this article.

Requirements

  • Incremental FTP
  • Deduplication
  • Compression, and the ability to modify compression levels on the fly.
  • Checksumming to silently detect corruption.
  • Encryption
  • Tools need to be actively maintained and ubiquitous.
  • Able to sync repository with a remote source.
  • Cheap
  • Open source wherever possible.
  • Easy to access archived versions.
  • Must be able to be automated.
    • If not in setup, then in how it runs later.

Step One - Getting the Data off the Server

We use a lovely company called BisectHosting to run our server. They provide an extremely barebones budget package that gives us a large amount of our most important specification: RAM. We can live without fancy support tickets or SSD access if they offer us cheap RAM, which they do. Beyond that, however, they also offer unlimited disk space, as long as that disk space goes towards the server itself, so no keeping huge numbers of backups on the server.

Now, they did offer a built in backup solution, but it only keeps the past seven days available in a rolling fashion, and I really really like to keep backups.

The only real gripe I have about BisectHosting is that they only allow the use of FTP for accessing data on the Budget Server tier. Worse, they don’t even use FTP over TLS, so the authentication is in plain text. However, I just change my password weekly and it seems to work alright.

The most important part of getting the data off the server is only getting the new data, or the data that has changed. This requires using an FTP Client that is able to sanely detect new data. Checksums aren’t available, but modification date and file size work just as well.

There were a large number of clients that I tried out over time. Filezilla was the first of those. It seemed to work alright for a time, except that when you have a large amount of identical files (We have 15,824 files at the time of this writing) it hangs. Now, it does come back eventually, but it’s still not the best of features to have a client that hangs.

The next one I tried was a Mac favourite known as Cyberduck. I really liked the interface for Cyberduck, but the first nail in its coffin was the inability to perform a modification time comparison and a file size comparison during the same remote to host sync. That meant it took two syncs to grab everything up to date, and even then it didn’t always seem to take. During the time that I was using Cyberduck, we had to restore from backup for some reason that is currently eluding me, but when we did so we noticed that some recent changes on the map hadn’t synced properly. Combine all of the above with the fact that from time to time it would hang on downloads (I’m assuming from the absurd number of files) and that wasn’t going to work.

The final GUI client that I tried was called Transmit. I really, really enjoyed using Transmit. It is a very polished interface, but first off it isn’t free, or open source, so that invalidated two of the requirements. However, if it worked well enough, I was willing to overlook the issues. Problem was, it didn’t work well. I forget what happened at the moment, but I know that it experienced similar hanging to Filezilla.

Regardless, Transmit was the last GUI based client that I tried. It took me a bit to realize, but if I used a GUI client there was a very minimal chance that I would be able to automate the download.

That left command line tools, which after I found LFTP I kicked myself for not looking into first. In addition to being an open source tool, LFTP has the ability to perform multithreaded downloads, which isn’t common in command line clients. Furthermore, it was able to compare both modification time and file size simultaneously, reducing the sync operations needed back to one. It is actively maintained, available in Homebrew (though, at the time of writing it has been moved into the boneyard), written in C, and very easily scriptable. You can call commands that would normally have to be ran from inside the FTP client directly from the command line invokation of LFTP. It handled our data quantity flawlessly, and easily worked through the large amount of files, though it can take quite a while to parse our biggest directories. At the time of writing, that directory is the map data repository for our main world, which has 12,567 items clocking in at 88.15GB. It takes between two and five minutes for LFTP to parse the directory, which considering all the other benefits is fine by me.

Our remote to local command utilizes the LFTP mirror function, and from within the client, looks like this:

mirror -nvpe -P 5 / ~/Development/Collegi/

Step Two - Convert the Data to an Archive Repository

When you are talking about a server that a full backup runs 100GB, and you want to perform daily backups at minimum, it becomes absurd to think that you could run a full backup every day. However, the notion of completely incremental backups is far too fragile. If a single incremental backup is corrupted, every backup after it is invalid. More than that, to access the data that was on the server at the time the incremental was taken would require replaying every incremental up to that point.

The first solution I tried for this problem was to use ZFS. ZFS solves almost every problem that we have by turning on deduplication and compression, running it on top of Apple’s FileVault, and utilizing snapshots. The snapshots are complete moments in time and can be mounted, and they only take up as much space as the unique data for that snapshot. Using ZFS Snapshots, the 1.10TB of data we had at that time was reduced to 127GB on disk. Perfect. The problem becomes, however, offsite replication.

Now, it is true that by having a copy of the data on the server, one on my MacBook, and one on an external drive here at the house, the [3-2-1 Backup][12] rule is satisfied. However, three backups of the data is not sufficient for a server that contains over six months of work. It’s reasonable that something cataclysmic could happen and we’d be shit out of luck. We needed another offsite location. The only such location that offers ZFS snapshot support is Rsync.net which 100% violates the “Cheap” requirement mentioned above. That’s not a knock on their service, Rsync.net provides an incredible service, but for our particular use case it just wasn’t appropriate.

So the hunt began for a deduplicating, compression based, encrypted backup solution that stored the repository in standard files on a standard filesystem. The final contenders were:

I was leaning very, very heavily toward BUP until I discovered BorgBackup. My primary concerns with BUP was that it did not seem to be under active development, and after over five years it still had not reached a stable 1.0. Git would have been useful, but just like ZFS it would inevitably require a “Smart Server” versus the presentation of just a dumb file-system.

BorgBackup sold me almost immediately. It allowed you to mount snapshots and view the filesystem as it was at that time, it offers multiple levels of compression ranging from fast and decent to slow and incredible, and it has checksumming on top of HMAC encryption. It’s worth noting at this time that nothing on the server is really so urgent as to require encryption, as most of the authentication is handled by Mojang, but I still prefer to encrypt things wherever possible.

It was under active development, it’s developers were active in the community (I ended up speaking with the lead developer on twitter), and it was progressing in a sane and stable fashion. As an added bonus, the release of 1.1 was to provide the ability to repack already stored data, allowing us to potentially add a heavier compression algorithm in the future and convert already stored data over to it.

The only downside to Borg was that at first glance it seemed to require a Smart server, just like git would.

Regardless, the system would work for now. If worst came to worst, I could utilize something like rclone to handle uploading to an offsite location.

When everything was said and done, we had reduced the size of our 1.11TB backup into a sane, usable 127GB.

The current command that is used looks like this:

borg create --chunker-params=10,23,16,4095 --compression zlib,9 --stats \
    --progress /Volumes/Collegi/collegi.repo::1.10.2-09292016 .

Step Three - Offsite Replication

I could easily spend a very long time here discussing how I chose the cloud provider I would inevitably use for this setup, but it really comes down to the fact that I quite like the company, and their cloud offering has a very complete API specification, and is dirt cheap. We went with BackBlaze B2. I could, and probably will, easily write a whole separate post on how enthralled I am with BackBlaze as a company, but more than that their $0.005/GB/Month price is literally unbeatable. Even Amazon Glacier runs for $0.007/GB/Month and they don’t offer live restoration. It’s cold storage as opposed to BackBlaze’s live storage.

The problem became this: How do I get the Borg repository to fully sync to B2, but do so in such a way that if the local repository ever became damaged I could pull back only the data that had been lost. This is what the documentation for Borg means when it mentions you should really think about if mirroring best meets your needs, and for us it didn’t.

Again though, B2 is just a storage provider, not a smart server. So how do I set things up in this way? The answer became to use another tool that was almost used for backup in the first place, Git-Annex. The only reason git-annex wasn’t used for backup to begin with is that it doesn’t allow us to retain versioning information. It just manages large files through git, which wouldn’t work. What it would do, however, and do quite well, is to act as a layer between our BorgBackup repository and the cloud.

So, I stored the entire borg repository into git annex. Once this was done, I used a plugin for git-annex to add support for a B2 content backend. Then, the metadata information for the git repository gets synced to GitLab, and the content is uploaded to B2.

Conclusion

The end result of this is that our 100GB server, as it stands at any day, is mirrored in four separate locations. One on the host itself, one on the MBP hard drive, one in the Borg Repository, and one on the BackBlaze B2 Cloud. More than that though, we have a system that is easily automated via a simple shell script, which after completing the initial setup (sending 20,000+ files to Backblaze B2 can take a while), I will demonstrate here.

Thank you so much for reading, I look forward to sharing more about the inner workings of the Collegi Infrastructure as time permits.

Video

I just recently completed an asciinema of the process. See below. Also note that you can copy and paste commands from inside the video itself. Go ahead, try it!