Fountain Blocks: A Simple and Robust Decentralized File Storage System
The ability to store and retrieve computer files in a reliable and cost-effective way is one of the most essential aspects of modern computing — one that most of us take for granted at this point. The introduction and refinement of services such as Dropbox over the past few years has made file storage feel like a “solved problem” for the majority of common use cases. But this is true only in the traditional world of client-server technology, where we can safely assume the continued existence and availability of cloud services. What about in the new world of decentralized systems, such as the ones used in crypto-currency projects? Once you get rid of the concept of an authoritative server that can always be found at the same URL (say, at dropbox.com), the picture looks very different, and it’s not at all clear how this should work.
A question that quickly comes up in any discussion of decentralized file storage is “Why bother with decentralized file storage anyway if the existing cloud storage services work so well and are reasonably priced?” This is a valid question, since a central tenet of engineering is to try to keep things simple. Introducing decentralization into file storage complicates things tremendously. So what do we get in return for this extra complexity? The answer, of course, is that you get a system that no one controls, which gives it censorship resistance and staying power (i.e., you don’t have to worry about a particular cloud storage company going out of business and taking all your files along with it).
The crypto-currency project I am working on — Pastel— is attempting to solve the problem of rare digital art on the blockchain in a truly decentralized way. As part of that, we need a way to safely store the art image files. If rare art on the blockchain has any hope of lasting relevance to the art market, it must address the issue of permanence. After all, we still have access to ancient Greek sculptures 2,000 years later, and even less permanent forms of art such as the drawings of Leonardo have reliably withstood the test of time. Yet anyone who has used computers over the last decades has a horror story about a dead hard drive that robbed them of their precious family photos or college term papers. Indeed, the only reason why we seem to have progressed past this stage of digital impermanence is the rise of cloud storage services, which are of little use in a truly decentralized system.
That is no coincidence. The only way to create reliable systems out of unreliable components such as spinning hard disks and solid-state disks that inevitably wear down and lose their storage reliability is to use lots of them in different locations so that, as long as the components don’t all fail at once (something so unlikely that we can essentially ignore it), we will be able to reliably retrieve the files we store. This is precisely what Dropbox, Amazon, Apple, and Microsoft do with the data saved in their cloud services, and it obviously works well. But this sort of redundancy, when done correctly, requires huge financial and technical resources.
Now consider how much harder everything becomes when you make it decentralized! Dropbox has the huge benefit of controlling its own infrastructure. If one of their data centers needs to go down temporarily for some reason, their engineers can plan for this situation and thus avoid downtime. But what happens when you don’t control the servers because no one is in control? This is of course the essential point of a decentralized system. In the Pastel system, file storage and retrieval functionality (along with several other services, such as robust near-duplicate image detection and NSFW content filtering) are provided by dedicated machines known as Masternodes.
A Masternode, or “MN” for short, is simply a computer running software that supports the operation of a crypto-currency network. They are similar to full-nodes in the Bitcoin network, which validate transactions by ensuring that they comply with the Bitcoin protocol that all the users agree to (for example, that you can’t send 5 bitcoins from an address that only contains 2 bitcoins). Note that this is different from Bitcoin mining, which is generally performed by completely different kinds of machines. The difference between a Bitcoin full-node and a MN is that the MN-operator gets paid by the network for running the software by receiving some percentage of the coins mined (in contrast, the Bitcoin network gives 100% of the mined coins to the miners).
The other big difference is that you must own a certain number of coins in order to run a MN. In the case of Dash, which is the masternode crypto-currency with the largest market capitalization, it currently costs about $250,000 to purchase the 1,000 DASH required to “collateralize” a MN. That is, if you want to run a Dash MN, you need to buy these coins and essentially tie them up by not selling them. As soon as you sell enough of them to dip below 1,000 DASH coins, you stop receiving your portion of the DASH mining block reward. On the surface, this doesn’t sound like the utopian ideal of a completely decentralized system open to anyone, since $250k is a lot of money, and will generally be available only to an elite group of wealthy users of the network. Even in smaller masternode crypto projects, where MNs can cost far less to set up (a LINDA masternode, for example, costs under $6,000 to acquire), only a relatively small number of users can be expected to run a masternode.
The reason why such a system can be considered decentralized really boils down to one word: permission-less. That is, anyone who wants to can set up a Masternode and help run the network — you don’t need to ask for anyone’s permission or approval; you just need to be able to come up with the money to purchase enough coins and to have the basic technical knowledge required to set up the machine (this is trivially easy to do now with freely available step-by-step guides that even a novice could follow).
To see how different this is from the centralized scenario, imagine that a very enthusiastic Dropbox user liked the service so much that he asked the company if he could run the Dropbox server software on his own personal machine. It sounds absurd, right? Of course Dropbox wouldn’t allow this — all the servers need to be in their direct corporate control for a host of reasons. Not the least of which is, what happens if this enthusiastic user needs to use the computer for something else? Or what if he stops paying his power bill and the computer turns off?
Yet, this is the exact situation that a decentralized MN crypto-project finds itself in. Just as a user does not require permission to set up a Masternode, the user doesn’t need permission to disconnect it! Masternode operators are under no obligation to maintain the machines, and one has to assume that any given masternode machine could instantly shut down — with zero notice to the rest of the users of the network, so that preparations can be made, such as copying the data to another machine. It seems almost a hopeless task to marshal a motley crew of computers run by random people around the internet into a data storage system that has the kind of permanence and reliability of a world-class art museum like the Louvre. Nevertheless, this is the problem we are faced with if we want to make a project like Pastel Network a reality.
Like any big and important problem, many smart people are attempting to solve decentralized storage in a variety of ways. These competing solutions can be roughly broken down into two types: storage-specific crypto projects, such as SiaCoin and Filecoin, and more generic distributed storage projects, the most prominent of which are IPFS and Swarm. While a detailed review of these projects is beyond the scope of this article, there are a few good comparisons of IPFS and Swarm that explain the basics. Taking into account the specific requirements and priorities of the Pastel project, none of these solutions seemed at all attractive.
For one thing, Pastel is not about file storage — it’s about rare digital art. File storage is just something that we need to have in order to achieve the main goals of the project. At the same time, it’s absolutely critical that it work reliably and cost effectively. From our perspective, all of the mentioned projects were almost comically over-complicated in terms of technology. But even more importantly, they seemed to get the crypto-economics wrong, at least for our use case. In order to see why, you have to understand what we are trying to accomplish, so we will now take a brief digression to discuss this.
In the new Pastel Network project, artists will be able to register their original digital artworks on the Pastel blockchain, setting a specific number of copies in a way similar to how a traditional artist might make a limited edition set of prints.
What makes “meatspace” prints like this valuable is their rareness. Despite the fact that anyone can freely download a digital image of the art prints linked above (although the downloaded image generally won’t be high-resolution), many of them sell for thousands of dollars each. This is because you can’t make any new ones — the artist has agreed that only these specific prints will exist, so if that artist later becomes famous and important and tens of thousands of collectors are interested in purchasing the work, it’s not difficult to see how the cost per print could skyrocket if there are only, say, 30 such prints in existence.
Similarly, art on the Pastel blockchain will enjoy the same benefits from being rare. But unlike real-world art, digital blockchain art — when implemented correctly — does not suffer from the same risk of counterfeiting, just as one can’t make counterfeit bitcoins out of thin air. In fact, Pastel goes one step further and makes it so that even the artist can’t betray their promise of rareness. In the real world, an artist could decide to make additional copies of prints, or they could change a tiny detail and make a print of that instead (this would hurt the artist’s reputation, but could still happen). But the Pastel system uses an ensemble of three different deep-learning models to create a robust image fingerprint for each work of art registered on the system. If anyone attempts to register the same art — even if they modify it in various ways — it will be rejected by the system for being a duplicate, as shown in the example below:
The computations to determine whether a newly submitted digital artwork is original are done on a randomly selected MN in the Pastel network, with 2 other randomly selected Masternodes independently confirming that the checks were done correctly. In addition, the submitted images are also checked for NSFW content (we don’t want the network devolving into something like Freenet). Assuming that a newly submitted art image passes these tests, and the artist submits the required registration fee (paid in Pastelcoin (PSL) and set as to cover the estimated storage costs in perpetuity), the system must now store the art image file.
While the metadata ticket describing the artwork, such as the title, the quantity created, the artist’s digital signature, etc. are stored in the Pastel blockchain itself (using the Pay-to-Fake-Multisig-Address method described here), the art image files themselves are much too large for this and would quickly lead to insurmountable scaling issues if we attempted to write them directly in the blockchain. Thus, we need a way to reliably store these files off-chain in a distributed, decentralized way.
At this point, it’s important to point out that the file storage requirements for Pastel Network are somewhat constrained in that:
- We never need to modify or delete existing files, since that is not allowed in the system.
- We don’t need image retrieval to be particularly fast or high-bandwidth, since only the owners of the rare art will be permitted to access the files.
- Furthermore, when a user purchases the rare art, their client software will download and store the full-resolution originals on their local machine; thus, users will generally only need to re-download the files if they are installing the wallet on a new machine.
- The thing we really, really care about is that the system should be extremely reliable in terms of data retention. The promise we make to artists who use the network is that, once they pay the one-time registration fee, they never have to worry about their art files being lost — even in a bad scenario where large parts of the network go down permanently.
Obviously, if another crypto project had very different requirements, such as needing lots of small temporary files to store the state of a system, the developers would end up making very different design decisions. With that being said, the requirements for the Pastel Network storage layer are likely relevant for many interesting blockchain applications. With that out of the way, let’s dive into the key ideas of the Pastel storage layer.
Pretty much any robust file storage/transfer system begins by splitting up the file into pieces. In the case of BitTorrent, the best known system of this kind, files are broken up into disjoint (i.e., non-overlapping) chunks. While BitTorrent generally works quite well for popular files, any frequent user of BitTorrent can tell you that torrents can get “stuck” at 95%+ completion because some of the chunks are unavailable on the network. Unless a user joins the torrent swarm with the complete file and “seeds” it, downloaders may never be able to get the entire file — a frustrating experience, especially for users on slow connections!
The problem with BitTorrent is that you need to download every chunk to get the full file, and since these chunks are all disjoint, there will always be a “rarest” chunk that everyone needs but only a few users have. At the same time, other chunks will be plentiful, so that all the users can download them quickly. This is obviously a sub-optimal outcome. So what can we do to address this problem? One approach is to use what are called erasure-codes.
In an erasure-code scheme, there is some redundancy built into the chunks, which allows you to reconstruct some missing file chunks using the chunks that you do have. If this idea seems familiar to you, it might be because it’s an old idea from the days of binary usenet groups. Back then, users would post a large file broken into chunks along with several “PAR” (short for parity) files, which would allow for the reconstruction of missing chunks using special software.
This is a good idea, and erasure-codes are a primary feature of such file sharing crypto projects as SiaCoin. But given the specific design requirements for Pastel Network, I believe there is a better way using what are generally known as fountain codes. The first practical implementation of fountain codes was pioneered by Michael Luby, who ended up selling his software company to Qualcomm. His ideas have since made it into pretty much every smartphone since 3G came out. When I first learned about the existence of fountain codes, I was both amazed and incredulous that such a system could even be possible, let alone work well. It almost seems like magic. Perhaps when you are done reading this, you will walk away with a similar sense of amazement.
So what’s so special about fountain codes? To be concrete, I will be focusing on a particular class of fountain codes called LT-codes (short for Luby Transform). The LT code works by breaking up the data that you want to transfer — which could be a stream, such as a cell phone call on a digital cell network, or a generic data transfer in which the order of the chunks doesn’t matter — into a series of chunks. So far, sounds like BitTorrent, right? The difference is that the chunks in an LT-coding scheme are fungible. That means that any chunk is as good as any other chunk, as long is it is different from the chunks you already have. There is no such thing as a “rare” chunk because any new chunk you download will be useful in reconstructing the file.
This property is in fact what gives Fountain Codes their name: the analogy is that of a water fountain, with one or more “users” attempting to fill their cups or water bottles at the same time. The users don’t care which particular droplets of water they get from the fountain; they just place their vessel in the stream and hold it there until it fills up. Furthermore, the users don’t need to synchronize their actions — one person could start filling a bottle, and then a few seconds later another person could come along and start filling their cup — they don’t need to wait for the next time the fountain starts up to get the stream “from the beginning”.
The way this works at a very basic level is that every chunk in an LT coding scheme contains certain random fragments of the combined file. It’s almost as if someone took the entire file, put it in a bag, and then smacked it with a hammer, turning it into a bunch of oddly shaped fragments. Some of these fragments would be relatively large compared to the full size of the file, and others would be tiny bits of the file. Continuing this analogy, we could then reach into the bag without looking and grab a fistful of these fragments, which would constitute a chunk in the LT coding scheme.
Two things about this process were shocking to me. The first is that we can always figure out how to piece these fragments together. But perhaps this isn’t so hard to believe, since if a porcelain vase fell and broke into a bunch of pieces, it might might take a while, but as long as you managed to not lose any of the fragments, these would only fit together properly in one way — the right way. We would start with the biggest fragments first, and then find the next biggest fragments and try gluing them into place, and eventually piece together the vase.
At this points it’s important to realize the limitations of our analogy. In a real LT coding scheme, the bag of fragments would be bottomless, so that we could always reach in and grab a new fistful of fragments, each different from the previous ones, and each containing new pieces of the vase that would be useful in reconstructing it. At this point you might be thinking “that sounds pretty wasteful, since you would end up getting lots of duplicate fragments.” Indeed, it’s hard to see how this wouldn’t lead to a lot of overhead in the protocol.
Obviously there needs to be some degree of redundancy/overhead in any robust system like this, but we should distinguish between redundant data and wasteful overhead. By wasteful overhead, we mean that some of the bits we are sending are not the actual data itself, but rather the protocol that describes the fragments and how to assemble them. While such information is needed to make the system work, we want to keep it to an absolute minimum. After all, if we had to download 30 megabytes worth of 3 megabyte chunks in order to reliably assemble a 5 megabyte file, that would be a pretty bad outcome, since the system would only be 5/30 = 16.6% efficient.
This leads to the second thing that shocked me about LT codes, and which I still have trouble believing despite seeing it with my own eyes. LT codes can in practice achieve efficiency rates of upwards of 85%, as shown in the chart below:
This means that the vast majority of our bandwidth is being put towards useful work: transferring the bits of data we want, not useless overhead that uses up our bandwidth without getting us closer to what we want: downloading the file. And yet, we still get these almost miraculous benefits: we can get file chunks from multiple sources, in any order, and every chunk is just as good/useful as any other chunk. No more stuck downloads, no more “rare” file chunks that we hope someone will make available!
LT Codes were originally designed for a broadcaster-subscriber model, in which the broadcaster (say, the NFL streaming a live game, or Steam serving a video game demo to users) has a full copy of the file/stream, and generates new LT chunks on the fly in an “endless stream” of chunks, and the subscribers can tune in to the stream or begin downloading the file at any time.
If you think about it, this would be impossible using a system like BitTorrent since each chunk only contains a specific piece of the file; thus, all the subscribers would need to begin downloading at the same time in order for everyone to get the full file. One workaround might be for the broadcaster to have several streams going at once at different starting times, so that a subscriber could just wait for the next stream to start their download. But this obviously uses a lot more resources and is much less flexible and convenient for the subscriber.
While exploring possible ways to store the art image files for Pastel, I realized that it would be fairly straightforward to adapt this model to a more generic file transfer/retrieval context, where several users on the network have varying numbers of chunks downloaded and a few users have the entire file so that they can generate fresh chunks. To illustrate how this might work, we can take another Pastel art image as an example. The file shown below is a 20.1 megabyte PNG image file:
Along with the image file, we will also encode the html metadata ticket, which looks similar to this example. We combine the image file (there can be more than one image file in a given artwork, but for this example we will restrict ourselves to a single image) and the metadata ticket file into what is called a TAR file, and then compress the TAR file using a new compression library developed at Facebook over the past few years, Z-Standard. Since the PNG image file is already compressed, we do this step primarily to get a single file to work with that doesn’t contain any redundant data. The code for these preliminary steps is fairly simple and straightforward, and can be seen here:
Now, our goal is to transform this single compressed file into a collection of LT chunks. Then we will show that, even if we randomly delete or corrupt the majority of these LT chunk files, we can nevertheless reliably reconstruct the original files. If you would like to try out the Python source code yourself, you can find a self-contained demo here that is under 400 lines of code.
You can replace the image and metadata files that I used with any set of files and the demo should work in the same way. Anyway, once we have completed this initial step of creating a single compressed file from the data we want to store, we must must then specify 2 parameters that control how the data is encoded into LT chunks.
The first of these is the size of each chunk file; for this, we want to strike the right balance between the inconvenience of dealing with lots of small files, and the desire to be efficient in terms of not wasting bandwidth. That is, if the chunks are too large relative to the size of the data we are encoding (in this case, around 20 megabytes of data), we could easily end up needing 30+ megabytes worth of chunks to reconstruct the file. Since most digital image files that would be registered on the Pastel network are probably on the order of 10 to 30 megabytes, I am using a 2 megabyte chunk size.
The other key parameter is the desired redundancy factor. What this controls is the total size of all the chunk files that we initially create. So if the original data in this case is ~20 megabytes, and we use a desired redundancy factor of 12, we will create 20 x 12 = 240 megabytes worth of chunks; at a 2 megabyte chunk size, this means that we will start out with around 120 chunk files. The code excerpt below shows the various parameters, including the ones that specify the location of the files we want to encode and where to store the resulting chunk files:
The fact that we have many more chunks than are required to reconstruct the files is the first layer of protection for data stored in the Pastel network — it is what allows us to lose a large portion of these chunks and still be able to recover the files. The second layer of protection, which is equally important, is that these chunk files will be uploaded to the network of Pastel Network Masternodes so that each distinct chunk file will be stored by at least 10 randomly selected Masternodes. This two-tier data protection scheme is what gives the network its robustness and protects the art files from “Black Swan” type events, where huge portions of the infrastructure could vanish without warning. Such a risk is not just theoretical: for example, suppose that at some point in the future, the the price of Pastelcoin drops significantly, leading Pastel Network Masternodes to liquidate their holdings and shut down their servers en masse. We need to be prepared for this situation.
The final step to truly protect the art files is to make the system dynamic. That is, even if a newly registered artwork begins with the desired 12x redundancy factor, and each chunk is spread across several independently controlled Masternodes, this redundancy might erode away over time as machines shut down or chunk files are deleted or corrupted. In order to combat this, Masternodes will be randomly selected in the Pastel Network system to maintain the redundancy level of each registered artwork to ensure that it meets the minimum desired redundancy level.
The way this works is that the Masternode chosen to verify a given artwork will check the file-servers for all of the active Masternodes (these are hosted using the high-performance NGINX web-server, where access is limited to the IP addresses of valid Masternodes), looking for LT chunks that correspond to the file hash of the artwork the Masternode is checking. The Masternode then computes the overall redundancy level by adding up the size in megabytes of the distinct LT chunks it is able to locate (after verifying that the chunks are valid ) and then compares the aggregate size of these chunks to the original size of the encoded files.
If the resulting empirically verified redundancy level is below the target redundancy level, the chosen Masternode will reconstruct the original file and then proceed to generate fresh LT chunks, advertising the new chunks to the other Masternodes so they will be mirrored on multiple machines. In this way, the storage network is dynamic and self-healing: it can take a beating without dying, and will gradually recover to its initial strength, at which point it can handle another beating without ever losing the precious art file data.
How will we know that we have accurately reconstructed the original files and not somehow introduced subtle errors? At the time of encoding the original files, when we create the single compressed archive containing the image/metadata files, we will compute the SHA3–256 hash of the compressed archive. This file hash, along with the file hash of each chunk file we generate, will be encoded in the “header” of each chunk file. This means that we will always be able to detect if any chunks are corrupted, and we will know for sure at the end that we have the original files because the hash of the reconstructed file will match the file hash for that art asset that is recorded in the Pastel blockchain.
Of course, transforming a 20 megabyte file into 120 different chunk files totaling ~240 megabytes, and then back into the original 20 megabyte file, is not particularly exciting or impressive. To make this demo interesting, after we generate the LT chunks, we will randomly select 75% of the chunk files and delete them. In addition, we will choose 10% of the remaining chunk files and overwrite a portion of the data in these chunks with random bits, corrupting them. We will then attempt to reconstruct the original file, and verify that the reconstructed file is correct by comparing its file hash to that of the original file. Here is what it looks like when you run the demo code:
While this demo is obviously not a complete, working file-storage system, it contains the skeleton we need to build the complete system. Indeed, the rest of the work is pretty basic and straightforward: we just need to write some code that will run on each of the Masternodes that will check each of the other Masternodes looking for file chunks that should be mirrored. And when a new art image is registered by a randomly selected Masternode, the newly created chunk files will automatically be mirrored by other Masternodes. We also need to write the “self-healing” code that checks whether each art file has a sufficient amount of redundancy in the Pastel network to ensure the safety and permanence of the data.
One thing we have overlooked thus far is the question of security. Since we don’t control individual Masternodes (no one does, since they can be created without anyone’s permission or authorization!), what is to prevent a hostile/malicious Masternode operator from pretending to host file chunks, without actually hosting them? After all, storage space isn’t free, and the profit from running a Masternode would be higher if the amount of storage could be reduced. There are various ways to address this issue. The most basic and effective is to rely on random inspections from randomly selected peers.
To see how this works, suppose that the Pastel Masternode hosted at the IP address 184.108.40.206 claims to be storing the LT chunk file named “FileHash__186fddcceb5e5a8074fd5880b8965900bf50c18c4bbe3efbcfda030856435909__Block__000000001__BlockHash_a7853dad91a44e7907bdee1a5ecde1fd24483f8530e3db17431d92b43f194933.block.” Rather than trust this Masternode, we can have another randomly selected peer masternode (you may have noticed that we rely heavily on this concept of randomly choosing a Masternode — this is to ensure that collusion among Masternodes is difficult unless the attacker controls a majority of the Masternodes, which would make the attacker heavily incentivized to not attack and injure the network. To read more about how we can randomly select peers in a fair, objective way, see this documentation from the Dash project) issue a storage challenge.
What do we mean by a storage challenge? The most straightforward way to perform a storage challenge is for the challenger to independently retrieve a known reliable copy of a given LT chunk file, and then demand that the challenged Masternode send this file on very short notice — say, within a few seconds. If the challenged Masternode were actually storing the file, this challenge should be quick and easy to respond to accurately. But if the Masternode is lying and doesn’t actually have the file in question, it will take too long for them to respond to the challenge, and they will fail. If a Masternode fails enough storage challenges, it can be punished by the network by withholding the Masternode block rewards for a set period of time (say, 100 blocks).
The problem with this approach is that requiring Masternodes to constantly transfer entire LT chunks is inefficient of bandwidth. There is a nice way to get the same result while transferring much less data. The way this works is that, rather than the challenger Masternode demanding the entire LT chunk file, the challenger instead specifies a random segment of bytes in the selected chunk file (e.g,. out of ~2 million bytes in an LT chunk, they could select the random interval beginning at byte number 523,232 and ending at byte number 1,235,212) and asks the challenged Masternode to respond with the SHA3–256 hash of this random segment. Since the segment could be comprise any part of the LT chunk, there is no way for the challenged Masternode to respond to this without actually storing the file. But now the challenger/challenged Masternodes only need to transfer a few bytes instead of the whole file.
In summary, there are several competing decentralized file storage systems out there. From the standpoint of our project, most of these seemed too complex, and didn’t offer the same protections in terms of robustness/reliability that we wanted. In addition, systems such as IPFS and SiaCoin are external to Pastel Network. We can’t control what happens in these networks, and the developers who do can decide to change how the system works. Having an external dependency — especially one so critical to the functionality and security of the system we are trying to build — seemed to be an unacceptable risk. Furthermore, it was essential to use that we be able to control the economic incentives that govern the “crypto-economics” of the network: that is, the registration fees (the “carrot”) and the penalties for failed storage challenges (the “stick”).
At the same time, we didn’t want to build a big complex system like SiaCoin or IPFS to solve what seemed to us to be a relatively straightforward problem with very limited requirements. These requirements and constraints collectively led us to “roll our own” distributed file storage system. Our sincere hope is that this design will prove useful to other decentralized projects and that others will contribute additional code and features.