On BitTorrent and Peer To Peer Distribution

DeWitt Clinton
May 2004

The challenge with distributing a high volume of large files, such as music, video, or applications, from a central server is obvious -- there is a fixed limit to the size of the pipes leading to the servers, and there is only so much one individual server can attend to at one time.

Traditionally, these problems have been solved by either adding more servers and bigger pipes or adding mirrors for the files. For example, if you download a new version of Firefox (the excellent, and free, Internet Explorer replacement), you will be given a choice between downloading it from the original site (at Mozilla.org) or from one of it's mirror sites.

The problem with the mirror approach is that the mirrors need to be established and maintained just like the original server, they need to be publicized in advance so that people know to check them, and they need to be kept current with all the files from the original source. And most importantly, they suffer the same fundamental issue as a single server does -- there is a fixed amount of bandwidth to all of the mirrors, and even a large number of mirrors can be overloaded. Moreover, their capacity to serve files is static -- it does not change based on demand or utilization, and can not grow or shrink depending on the file's popularity.

Caching systems are one way around the static server issue. Some ISPs (such as AOL) employ caching technologies that insert themselves in between the end-user and the server, such that certain files (often only smaller static files, such as images) are temporarily saved on and served from the ISP's servers.

One of the most popular third-party caching services is Akamai, a very successful company that has built out a massive network of caching servers all across the planet. Companies that need to distribute large amounts of content can pay Akamai to keep temporary copies of their files, and the Akamai network will automatically distribute it around the Internet, shortening the distance that data needs to travel to get to the end user. However, while the dynamic caching approach is popular, but it requires the upstream support (and hosting) of a third-party service and the hosting company to pay the bills. So while this works extremely well for commercial services, such as Major League Baseball's high-quality live internet broadcasts, it doesn't exactly serve the needs of the private indivudal that wishes to distribute content efficiently.

The last five years have seen an explosion of Peer To Peer technologies that attempt to get around these limitations. In a P2P architecture, files are not stored on a central server, but rather migrate from one end user computer to another, gradually spreading out across the Internet. The advantages to P2P are profound -- no longer is a massive central server required to host large data files, nor is expensive third-party caching necessarily. Rather, the end users themselves become the caching and distribution system, and the file network grows and shrinks according to the popularity of the data.

The first generation of P2P file distribution, such as the original Napster client, worked by leveraging a central server to store a description of the content that is available at each peer -- this description is called metadata, and is much smaller than the data itself. This made searching for content trivial, as all you had to do was ask the central server for a particular file, and it would immediately return a list of peers in your vicinity that had that file. And once the file was retrieved, the central server would be informed, and your individual client machine would be added to the list of potential servers.

Unfortunately, the technique of using a central metadata server was subject to it's own limitations. First, as the number of files and clients grew, the central server needed to scale to massive proportions -- for even though the files themselves were distributed across the network, even the metadata constituted a high volume of information. Second, as file-traders quickly moved to using these types of services to distribute material protected by copyright laws, the copyright holders had one convenient location they could approach to remove files and/or get a list of the people serving the content. (In a famous case, the Recording Industry Association of America successfully sued Napster, ultimately shutting down the service and bankrupting the young company.)

A second generation of peer to peer application sprung up in Napster's wake. Popular protocols include Gnutella and Gnutella2, Kazaa, Overnet, Soulseek, and more. All of the second generation protocols share a similar advantage over first generation protocols in that not only is the content itself distributed, but the metadata is as well. Searches are not made through a central server, but fan out from peer to peer until files are found. Taken to the extreme, this model is exponentially complex and practically unusable, so all protocols have adopted a model by which certain clients (often those sitting on the biggest pipes) are turned into super nodes which act as local central servers to nearby clients.

Perhaps the most important of the the second generation peer to peer clients is an open-source project called BitTorrent. BitTorrent's brilliance is that is does not even attempt to provide explicit searching capabilities. Instead, a small file, called a ".torrent" file is created that describes the size, cryptographic hash, and the location of a tracker for the content. The .torrent file can be served like any other file off of an HTTP server, which can be read into a BitTorrent client. The BitTorrent client will then read the location of the tracker, which will inform it of other clients that potentially have the file in question. The file is broken up into segments and those segments are distributed piece by piece across the network and efficiently reassembled on the client end. For popular files download speeds frequently equal the maximum possible governed by the size of the pipe leading to the end user. And even unpopular files are no harder to get than the capabilities of the original source.

Because of the ease in setting up .torrent files and trackers, and the near optimal utilization of network resources, BitTorrent has rapidly become the distribution mechanism of choice for legitimate file sharing. Many sites that need to distribute large files, such as some of the major Linux distributions, game demos, and legal music trading. Additionally, a community of meta-sites that list .torrent files has been steadily growing, and a list of those sites can be found here. Some of the more interesting files available on BitTorrent include some old school Nintendo game replays (check out the insane Super Mario Bros. runs), the Jay-Z Grey Album and Double Black Album Remixes over at Banned Music, or the Clash remixes at London Booted.

Over time, expect distributed file sharing protocols to be built right into the web browser (or, in Microsoft's case, the operating system). Peer to peer distribution in the era of broad band connectivity, particularly protocols as elegant as BitTorrent, is simply too useful not to become an integral part of the internet. However, the risk of trojaned applications and viruses remains high until digital signatures and high grade encryption becomes commonplace in peer to peer applications. BitTorrent begins to address this with cryptographic hashes, but at the end of the day is still vulnerable to simple DNS server weaknesses and gaping holes in the TCP/IP v4 protocol. However, if used judiciously, applications such as BitTorrent already offer a very viable alternative to waiting hours on an overloaded server for a popular file.