At the first Decentralized Web Summit Tim Berners-Lee asked if a content-addressable peer-to-peer server system scales to the demands of the World Wide Web. This is meant to be a partial answer to a piece of the puzzle. For background, this might help.
Decentralized web pages will be served by users, peer-to-peer, but there can also be high-performance super-nodes which would serve as caches and archives. These super-nodes could be run by archives, like the Internet Archive, and ISPs who want to deliver pages quickly to their users. I will call such a super-node a “Decentralized Web Server” or “D-Web Server” and work through a thought experiment on how much it would cost to have one that would store many webpages and serve them up fast.
Web objects, such as text and images, in the Decentralized Web are generally retrieved based on a computed hash of the content. This is called “content addressing.” Therefore, a request for a webpage from the network will be based on its hash rather than contacting a specific server. This object can be served from any D-Web server without worrying that it will be faked because the contents will be checked to make sure it is the right content by rehashing it and checking to make sure it was right.
For the purposes of this post, we will use the basic machines that the Internet Archive currently uses as a data point. These are 24-core, 250TByte disk storage (on 36 drives), 192GB RAM, 2Gbit/sec network, 4u height machines that cost about $14k. Therefore:
- $14k for 1 D-Web server
Let’s estimate the average compressed decentralized web object size is 50KBytes (an object is page, javascript, image, movie—things that make up a webpage). This is larger than what the Internet Archive web crawl average, but it’s in the ballpark.
Therefore, if we use all the storage for web objects, then that would be 5 billion web objects (250TB/50KB). This would be maybe 1 million basic websites (each website would have 5 thousand web pieces which I would guess is much more than the average WordPress website, though there are of course notable websites with much more). Therefore, this is enough for a large growth in the decentralized web and it could keep all versions. Therefore:
- Store 5 billion web objects, or 1 million websites
How many requests could it answer? Answering a decentralized website request would mean to ask “do I have the requested object?” and if yes, to then serve it. If this D-Web server is one of many, then it may not have all webpages on it even though it seems we could probably store all pages for a long part of the growth of the Decentralized Web.
Let’s break it into two types: “Do we have it?” and “Here is the web object”. “Do we have it?” can be done efficiently with a Bloom Filter. It is done by taking the request, hashing it eight times and looking up those bits up in RAM to see if they are there. I will not explain it further than to say an entry can take about 3 bytes of RAM and can answer questions very, very fast. Therefore, the lookup array for 5 billion objects would take 15GB, which is a small percentage of our RAM.
I don’t know the speed this can run, but it is probably in excess of 100k requests per second. (This paper seemed to put the number over 1 million per second.) A request is a sha256 hash, which, if recorded in binary, is 32 bytes. So 3.2MBytes/sec would be the incoming bandwidth rate, which is not a problem. Therefore:
* 100k “Do We Have It?” requests processed per second (guess).
The number of requests able to be served could depend on the bandwidth of the machine, and it could depend on the file system. If a web object is 50KB compressed, and served compressed, then with 2Gbits/second, we could serve a maximum of 5,000 per second based on bandwidth. If each hard drive is about 200 seeks per second, and a retrieval is four seeks on average (this is an estimate), then with 36 hard drives, that would be 1,800 retrieves per second. If there were popular pages, these would stay in ram or an SSD, so it could be even quite faster. But assuming 1,800 per second, this would be about 700Mbits/sec which is not stretching the proposed machines. Therefore:
* 1,800 “Here is the web object” requests processed per second maximum.
How many users would the serve? To make a guess, maybe we could use the use of mobile devices use of web servers. At least in my family, the web use is a small percentage of the total traffic, and even the sites that are used are unlikely to be decentralized websites (like YouTube). So if a user uses 1GByte per month on web traffic, and 5% of those are decentralized websites, so 50MB/month per user of decentralized websites could give an estimate. If the server can serve at 700Mbits/sec, then that is 226Terabytes/month. At at the 50MB usage that would be over 4 million users. Therefore:
* Over 4 million users can be served from that single server (again, a guess.)
So, by this argument, a single Decentralized Web Server can serve a million websites to 4 million users and cost $14,000. Even if it does not perform this well, this could work well for quite a while.
Obviously, we do not want just one Decentralized Web Server, but it is interesting to know that one computer could serve the whole system during early stages, and then more can be added at any time. If there were more, then the system would be more robust, could scale to larger amounts of data, could serve users faster because the content could be brought closer to users.
Performance and cost do not seem to be a problem—in fact, there may be an advantage to the decentralized web over current web server technology.
Pingback: PDH & Events Impacting The Future Of The Internet | Events Wrangling
That is quite and impressive looking server setup. What type of server setup did the Internet Archive start out with?
Not sure if I understand the concept of decentralized web. Does a decentralized web get around the problem of a monopoly ISP? In my remote area, it’s either satellite (which I don’t have and don’t want) or AT&T DSL (I do want basic DSL, I don’t want AT&T). I think what you’re saying is once you connect to the internet/web (and I basically don’t know the difference), then with a decentralized web you would use different server structure. But that first step, connecting… ?
btw, as regards your comment form, I don’t have required field e-mail because the e-mail provided through AT&T (Yahoo/sbcglobal.net) changed its privacy policy in June 2013 (coincident with Edward Snowden?) to making you click a button that says you agree to no right to privacy basically, a button I could never click. Without clicking that button, AT&T/Yahoo/sbcglobal.net will not allow you access to your e-mail, even your years of stored old e-mail sent under previous policy. I think this is the same as google’s policy where they told a court that g-mail users have “no reasonable expectation of privacy.” My expectation of privacy is the same for a letter I mail and I wonder that the USPS doesn’t offer e-mail accounts for all US citizens, as if they took their constitutional function seriously. The solution of more and more encryption restricts right of privacy to tech savvy, which I’m not.
http://www.alternet.org/civil-liberties/gmailers-beware-google-says-you-have-no-reasonable-expectation-privacy
https://en.wikipedia.org/wiki/United_States_Postal_Service
Would love to hear you specifically address that problem (actually problems plural in that participation online usually requires an e-mail address, as here) and if a decentralized web would help.
Thank you!