(Cited in Forbes)
Because of recent news reports, I wanted to cross check the cost feasibility of the NSA’s recording all of the US phonecalls and processing them.
These estimates show only $27M in capital cost, and $2M in electricity and take less than 5,000 square feet of space to store and process all US phonecalls made in a year. The NSA seems to be spending $1.7 billion on a 100k square foot datacenter that could easily handle this and much much more. Therefore, money and technology would not hold back such a project– it would be held back if someone did not have the opportunity or will.
Another study concluded about 4x my data estimates others have suggested the data could be compressed 10:1, and the power bill would be lower in Utah. A Google Doc version of the spreadsheet and a cut and past version below.
This was just boingboing’ed.
number of call-minutes per person per month | 300 | minutes (estimate from my family’s usage) |
sides in a phonecall (caller+receiver) | 2 | since most calls are domestic, only need to record a call once for each reciever/caller pair |
number of people in the US | 315,000,000 | https://www.census.gov/ |
number of bytes/sec in a phonecall | 8,000 | this is the uncompressed number, could be compressed to 1/2 to 1/4 easily |
cost of a Petabyte (PB) of “cloud” storage | $100,000 | this is basically what the Internet Archive pays. Petabyte = 1,000 terabytes |
Square feet of datacenter space per petabyte | 16 | 2 feet wide by about 8 feet including corridor between racks |
Power to run a PB | 5 | kilowatts |
Cost per KWhr | $0.15 | California costs (higher than much of the country, could be 1/2 in other places) |
number of bytes/min in a phonecall | 480,000 | calculated from above |
number of bytes/month for a person | 144,000,000 | calculated from above |
number of bytes/month for the US | 22,680,000,000,000,000 | calculated (divided by 2 because there is a caller and reciever, don’t need to double count) |
number of PB/month for the US | 23 | calculated |
number of PB/year for the US | 272 | calculated |
Cost to store all phonecalls made in a year in the “cloud” | $27,216,000 | |
Square feet to store all phonecalls | 4,355 | |
Cost of datacenter power for all phonecalls for a year | $1,788,091 | |
What if NSA is using speech to text ? If they convert all telephone calls from Audio to text using latest text to speech software ?
So they will only have text files to save, wich are much smaler than audio files, and they could be compressed even more than audio .bin files. So it would note cost much to store it.
I think they convert it to text, cause only text files could be searched and analysed at a big scale easily using existing database technology. Only if audio is converted to .txt it could be scanned for “suspicious” words and expressions. Text content also can be loadet into AI systems like IBM Watson to interpret them and automaticaly find persons the NSA might be interested in.
So if they dont work with audio, they would have less storage costs and much better capabilities to analyse the given Data in a massive scale.
Well of course they translate the speech to text, that way they can grep for whatever keywords they are looking for.
But of course they soundfile is also saved so they can listen to the calls
the text searches find as interesting.
Also speech to text is flawed so the keep the originals.
The real savings in compression is in converting the language into phonemes. There are 40 phonemes in English, so it would take 6 bits to encode a phoneme. The average rate of speech is 9 phonemes per second. There are 3600 seconds in an hour, and 8 bits per byte. It would take about 24300 bytes to record an hours worth of phone call between two people. That would come out to 2.12 gigabytes of storage for ten years of phone calls without stopping for anything including sleeping or eating, or about 1 gigabyte each.
Phoneme recognition is not much better than text to speech.
Why not both? The storage space necessary is well within the scope of the new datacenter facility. The datacenter will also likely have plenty of processing power for converting and analyzing files.
I guess even if you convert it all to text to facilitate data mining, you’d still want to be able to listen to any particular call once your text-based crawler has identified it as suspicious. So you’d store the original audio streams as well. It’s a bit like compulsive hoarding. There’s a psychological barrier to throwing away any piece of data once you have it.
Olaf you are “Right as Rain”.
The sad part is ,all of this information will go to the U.N.
Voice to txt:
Converting phone calls voice to tex may seem neat and easy as it makes them simpler to research but you assume they speak English, or they are not speking in code . Far too many ordinary phone calls could be literally a bunch of garbled nonsense voice to text.
Many are nonsense in the origninal voice recordings anyway. I have awful trouble hearing and understanding cell to land line phine calls as most Americans are unwilling to keep batteries charged.
Even under the Patriot Act, the NSA does not have the authority to record domestic telephone calls. The existing “Carnivore” project works with e-mail, text messages and the “pen blotter” (call list) of telephone calls. It also works with lists of downloaders and streamers. The proposal is to use text to speech for domestic telephone calls.
The NSA has been using text to speech for telephone calls between the US and foreign nations since the 1980’s (or earlier).
Data mining is much more complex than grep. It involves analysis of contact groupings, key phrases across multiple messages/files/calls, and (for internet usage) creating or accessing web pages or files that are not easily available.
And suppose they’re only collecting what they publicly claim to be collecting: who you called and when. Because understanding the social graph is a crucial element of control. Given your numbers and one added assumption it’s easy to figure what that would cost.
Calculating the cost to store the caller, receiver, date, and duration of every phone call made in the US in the course of a year:
Let’s (very generously) assume that the average call is 3 minutes long. This means an average person is on 100 calls each month. Again avoiding double counting, since there are two people on each call, we need to store 50 calls per person per month.
That is: 15,750,000,000 calls per month
Since there are ~300M users, we’ll need 4 bytes for user id. There are two users per call. So 8 bytes to identify the endpoints. Date and duration of the call add another 8 bytes. For a total of 16 bytes per call.
This comes out to:
16 * 15,750,000,000 = 252,000,000,000
bytes per month.
Per year it’s:
12 * 252,000,000,000 = 3,024,000,000,000
bytes per year.
Throw in 25% overhead for whatever reason and the total storage is still under 4TB.
An external 4tb disk is about $175 retail.
For power, the disk is probably under 10 watts. So over a year it consumes under 87.650 kwh And given the $0.15 cost of a kwh, that comes out to a little over $13 total energy cost.
So: $188 to store all the envelope data for every phone call made in the US for a full year.
(it is late and my math is probably off somewhere. even if it is off by a factor of one thousand the cost is still, essentially, free)
and for zealous prosecution, the envelope data is even better than the entire phone call. I’m pretty sure you couldn’t fully account for a particular phone call from two weeks ago, but if the person you called was suddenly the target of an investigation…well, pack a toothbrush, if they’ll let you.
Pingback: Nerdcore › Innenminister Friedrich ♥ PRISM
The real cost is not in storing this data, but in collecting & analysing this data.
Telcos do not routinely record or store calls. This is why you have to be specifically redirected to the voicemail site to leave a message. Calls are only routed thru the exchanges and cells they need to be routed thru to reach their destination, so for instance if you call a number in the same cell site or the same wire-line exchange, the call does not go outside that exchange.
Thus, to record all calls, additional new equipment needs to be installed in every exchange and cell site in the country, to both record the call and forward it to the NSA. That equipment would need to be maintained and powered, and the lines to send the data to the NSA data centre would have to be paid for also.
In some areas there wouldn’t be enough bandwidth for this, so additional cables or microwave links would have to be installed and maintained purely for this purpose
This would all cost far more than it does to merely store data in a single data centre, and require agreements with a multitude of phone companies
This is a _huge_ job, that not only would require amazing organizational skills, and an army of technicians, but also probably require an army of lawyers and accountants just to prepare, sign, and pay for, the multitude of contracts and costs involved.
Then there’s actually analyzing the data, rather than just storing it, which, in order to do it in real time, so that there wasn’t a huge backlog, would require at least as many processors as there are calls per second in the USA.
Telcos are currently installing exchanges and cabling that can handle 150 million calls per second. Presumably they are doing this because such capability is actually required, but even if we assume something more modest like a mere 40 million calls per second, that means your analysis centre will need the equivalent of 40 million DSP chips just to keep up with processing the incoming load. Figure out the cost and and ongoing maintenance of that, with redundancy and 24/7 up-time and add that to your total.
And that’s just to get the data collected and carry out a first pass over the digital data, assuming that this can be done in real time. If it takes longer to analyse a call than it does to make it, then the number of processing units goes up linearly with the required processing time per call.
In fact, if I was the architect for this project I would suggest it makes no sense to forward all the call data to a data centre. Have the box that does the recording at the cell site or wireline exchange do the initial analysis and only forward stuff that is “of interest”, thus reducing the amount of bandwidth needed and the size of the processor farm required at the hub, . But the boxes at each location would need to be sized (or stackable) to have enough processors to handle peak load for each site.
Note I’m not suggesting any of this is infeasibly costly, (though it would be very costly for a government department to do it, more so than if a commercial entity did it) merely that :
a) Data storage is the smallest part of the cost.
b) An awful lot of people would have to know it was happening, from network management types right down to the guys installing the equipment at the exchanges to the people paying the money to all the thousands of little telcos there are in the US, to the managers responsible to approving those connections… etc.
c)Which leads me to : so if this is actually happening, how the hell did they manage to keep it secret this long?
Forwarding call metadata (caller phone number, receiver number, length of call, etc ), on the other hand (which is all that has been seriously claimed, BTW) merely takes data already centralized for billing purposes and sends it to the NSA in a few large text files. That only requires someone in the accounts department of each telco to know it’s happening, and to send an email ever day or week, not an entire battery of people in every part of the company and additional hardware and network connections.
> Thus, to record all calls, additional new equipment needs to be installed in every exchange and cell site in the country, to both record the call and forward it to the NSA. That equipment would need to be maintained and powered, and the lines to send the data to the NSA data centre would have to be paid for also.
This has been in place since a few months after 9/11, publically known about since 2006-7, and it’s really not difficult. You need to tap less than a few dozen places (exchanges and trunk lines only). No “exceptional skills”, management, or other ridiculousness required, just insert a router into the trunk and copy the packets to your private fiber lines. Offline processing can handle the rest.
Furthermore, there’s no realtime processing overhead required. The bulk of American calls are kids calling parents, workers calling coworkers, etc. Unless the name and number is flagged as interesting, you just log it and set it in queue to be Speech-to-Text’d for database queries later. If you absolutely required realtime (which you don’t), you could still do it fr a handful of phone lines at a time. Otherwise you can afford to wait a whole minute or two for the call to go through, get flagged by the system as an interesting event and then pushed up the queue for immediate processing.
The only technically complicated part of this whole venture is the filtering to pick what’s interesting and what’s not, but Bayesian statistics aren’t exactly rocket science anymore.
Meanwhile, the NSA builds a $2 billion dollar facility capable of storing zettabytes of information and nobody things to question what exactly is generating trillions of bytes of information and why they’d need such a place to store it all. Everyone’s so surprised to know the NSA’s been listening in on them for a decade, even though they’ve barely even attempted to make it a secret.
Just a note, but they didn’t keep it a secret. there was an article in wired about two years ago, and there was data in other locations I do not recall at this time. I’m not sure why everyone is so surprised, seeing as it was a bit of an open secret.
I did a similar back of the envelope calculation, but instead of dollar cost I examined what could be stored based on estimates of the capacity of the Utah datacenter. An impressive number of calls could be stored with even pre-Utah resources. I’d say with the extra capacity of this multi-billion dollar facility that years of data on several billion people could be stored with some basic selection and pruning. Or data on less people for much longer, of course.
Calculations:
http://sdrv.ms/17PuMRS
Where I originally posted:
http://www.reddit.com/r/PoliticalDiscussion/comments/1gajsc/anyone_who_believes_the_technology_doesnt_exist/
Do not underestimate the scope of the NSA. Their budget is not published but estimates have it at over $10 Billion a year. For all intelligence agencies combined, the estimate rises to $75 Billion a year.
The prohibition against our spies spying on U.S. citizens is a nicety that went out the window on 9/11/2001. $10B buys a lot of computers, bandwidth and analysts. It is also likely that the NSA benefits from expenditures in other, black intelligence budgets, not to mention money in public budgets that we might classify as waste or pork 🙂
Assume that no one communicates electronically anywhere that is not recorded for future use.
In addition to capturing the text, they must also be capturing whatever cues, verbal or non-verbal, that they believe provide the most crucial content for their purposes — such as pauses, emphasized vowels, voice stress signals, and environmental responses. That would take a little bit of processing, but it would make analysis a much more approachable task.
So, I see you saying, for the needs of national security, we all need 10gigE to our homes ASAP, yes? WOOHOO!!
Pingback: June 15th 2013 | LeakSource
You forgot cost of backups and number of years they intend to keep at a time (currently 5)
Pingback: The Guardian have an editorial Civil liberties surveillance… | Left ►ie =
Pingback: What would it cost to store all of America's phone calls? | diveintotech
Pingback: What would it cost to store all of America's phone calls? | Ediary Blog
Hi, Brewster – The estimates of how much data storing a phone call takes were off by a factor of two or so. You’d started with “30 kbps” for Skype and then said “compress by a factor of 10 like you can with text”, but that’s not how voice compression works. Telco voice signals start out at 64 kbps (per direction, but you’re not usually both talking at once), and the most popular compression algorithms get to about 8kbps for G.729 business PBX voice or 5.3-6.5 kbps for different GSM flavors. You can’t re-compress that into a smaller signal (though there are lower-quality voice codecs at lower bit rates, if you’re starting with uncompressed voice.) If you’re doing VOIP, you’re going to chop this into 20 or 30 samples of data per second (because you need to keep latency low) and add UDP, IP, and Ethernet headers to them, getting back to about 30 kbps, but the NSA storage only needs the raw voice signal, not all the transmission headers. So they’re probably storing about 6.5 kbps.
Great analysis and so timely! I didn’t see estimates for the number of non-US citizen immigrants (legal and illegal) and tourists — which I think is relevant to the original point.
Also, storage costs are only one slice of the pie as — staffing costs must be considered (this will add up since the US is *outsourcing* most of these costs).
Additionally, I think the storage cost breakdown includes back-end costs not the additional costs for the design of appropriate rule based “permission and role restricted” front-end search-tools – sort of the tools being used in hospitals for HIPPA
security requirements — like superusers and on down to guest users.
By the way, I was listing to a Congress Member who seems to think there is much more to this story as the potentiality of more leaks.
Makes me wonder if anonymous hacktivists have visited their servers and found a backdoor and made some backups?
Quote:
“Makes me wonder if anonymous hacktivists have visited their servers and found a backdoor and made some backups?”
I would guess that the ONE thing that is in NO way connected to the internet is the storage/analysis site in Utah.
My remedy for this entire situation is for everyone to start using all the terrorist key word they can think of in at least 1 or 2 phone conversations a day. This would quickly jam the works I suspect…. Kind of like the college kids that flush all the toilets in the dorm at once….
So you’re asuming Internet is the only information network with hacktivists? Interesting, or rather funny. Don’t you know that if you are near the Pentagon with a laptop and say you’re going to execute commands you’ll probably not survive? The networks they use are usually full of “little security problems solved by weapons”. I am not a hacktivist, not even a hacker, but it is relatively easy to hack into a wired or wireless network with enough patience, even with improvised tools.
If I wasn’t clear enough, I am not saying it is trivial to crack into something like that, but greater security needs greater patience+knowledge combo pack 😉
Pingback: What would it cost to store all of America’s phone calls? – Boing Boing | News For Web
A correction: audio is usually captured uncompressed at 16 bits rather than 8 bits per sample, i.e. 16,000 bytes per second, which means that all these totals should be doubled.
As we’re finding out from the Zimmerman Frye hearings, apparently if you call the cops or 911, they A/D convert at 8 bit res.
But you can add 8 zeros on the left hand side to make up the difference. : – )
Telephone audio is only 8-bit and as someone pointed out, it has a much lower sample rate then most audio you’re used to.
Also, your confusing audio resolution (8-bit, 16-bit or 24-bit) with sample rate. (ie, a 16-bit resolution could have a 44100 hz sample rate which would produce a bitrate of 705.6 kbps (uncompressed). Standard telephone is 8-bit and uses a 8000 Hz sample rate resulting in a 64.0 kbps uncompressed bitrate. Notice the math is resolution times sample rate. Compression is then used to get the bitrate even lower.
Pingback: Cost to Store All US Phonecalls Made in a Year in Cloud Storage so it could be Datamined | The Freedom Watch
Pingback: ¿Cuánto costaría almacenar todas las llamadas de voz? | La Internet es Fácil
Pingback: The AggreGAYtor: June 17 | Austin News
Pingback: Les USA « officialisent » l’existence d’un système de surveillance globale - CNIS mag
Outrage is not for geeks.
Pingback: [Announce-general] News from the Internet Archive # 11 | DJANDYW.COM Beyond The Known Universe √
Pingback: Blueprints Of NSA Data Center In Utah Suggest Its Storage Capacity Is Less Impressive Than Thought - Forbes
Pingback: Blueprints Of NSA's Ridiculously Expensive Data Center In Utah Suggest It Holds Less Info Than Thought - AnuragP
Pingback: Blueprints of the NSA’s Secretive Utah Data Center Revealed | PolitiTalk
Pingback: Blueprints of the NSA’s Secretive Utah Data Center Revealed | Reform The Government
The water use for cooling Utah data center is definitely an issue for me. If it only takes so much space to store all phone calls, then why is center so big? Because they’re going to do a lot more with it than that. As in realtime surveillance/suppression of comm. they don’t like, as in when my e-mails to groups they don’t like are blocked in realtime. And sometimes phone calls too, though I have found that one or the other will work when the other is blocked; mostly e-mail blocks, not smartphone. Lyle Courtsal http://www.3mpub.com
PS CIA jackals did gas attacks in syria. . . not gov’t.
Yes, the Utah data center sucks scarce groundwater. . .
The equipment being discussed has been operational at all digital switches in the US for many years. It is outlined and specified by the CALEA law! I have installed these units at MTSO and landline offices, and ISP NOCS at many locations during the last decade. Having been retired for several years I am sure the hardware is later generation and much more capable now. This is nothing new to an old telco Engineer!!!!
Pingback: Blueprints of the NSA’s Secretive Utah Data Center Revealed | TheBlaze.com