We’re a touch slow off the mark here but thought it would still be worth mentioning the latest offering from Amazon Web Services. For those of you who don’t already know Amazon is more than an online bookstore; it’s one of the most innovative players in the nascent cloud computing sector. Over the last few years they’ve launched a suite of web services covering hosting, processing and distribution challenging some of the traditional businesses in this sector.
4ip is a big fan of digital services making innovative and appropriate use of cloud computing and we’re particularly interested in launch of public data sets on AWS where select public data sets will be hosted for free as an Amazon EBS snapshot. According to Deepak Singh, business development manager at AWS: “Public Data Sets on AWS provides a convenient way to share, access, and consume publicly available data within your Amazon EC2 environment”.
In theory this significantly lowers the barrier for researchers and data analysts to access and use some of the most commonly used data sets in their communities without the need to manage data within their own AWS accounts. From a 4ip perspective we hope that, growing the number of people with access to important and useful data, and making it easy to compute on that data with cost-efficient services will fuel innovation and further accelerate the pace of new discoveries. AWS + hosted data sets should allow individuals to do the types of computing once reserved for large businesses and educational institutions. The potential to cause trouble here is HUGE and it would be great to see what could be done with shared instances of the Royal Mail’s Postcode PAF file, Neighbourhood Statistics from the ONS, Health care information, from NHS Choices, a list of all schools in England and Wales from the DCSF or the Official Notices from the London Gazette.
A by-product of increased access should be improved participation. Dr. Peter Tonellato from the Harvard Medical School commented that: “Public Data Sets on AWS will enable me and many of my colleagues to collaborate with each other by sharing our commonly used data sets, research environments and tools”.
Unsurprisingly, most of the data sets currently available have a US flavour but if you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community you can submit it to Amazon for inclusion. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but Amazon can work with you to host larger data sets as well. However, you must have the right to make the data freely available ![]()
4iP Blog
Cloudy Data
Posted by Dan Heaf on
Tue, December 09, 2008 at 11:43


Darryl Collins on Fri, December 12, 2008 at 5:36 said:
Just read your blog post. Very interesting. It mirrors what we have been discussing and planning in Banjax - gathering, tidying and making available public data sets in a web format for others to use. We need it for Local NI. Indeed we have needed it many times in the past 12 months.
Our NI Crime Map (licensed to Belfast Telegraph http://www.belfasttelegraph.co.uk/nicrimemap/) was an eye-opener for us. On a bureaucrat’s tick list of openness, they have done everything already.
* Is it available in digital format. Yes, excel, tick.
* Is it available to the public. Yes, one obscure website, tick. etc.
The reality is in the detail - making something available in an excel spreadsheet in an obscure corner of the web SHOULDN’T count! All public data should be in an open format, in a prominent place online, authoritative, accurate, maintained.
And we’d like to build something like that.
ArkAngel on Sat, December 13, 2008 at 1:10 said:
Are these kinds of services better provided by private (predominantly US) corporations like Amazon or publicly owned enterprises?
4IP Blogger on Sat, December 13, 2008 at 11:11 said:
Thanks for the comments. Interesting and useful. Darryl, I completely agree that the long term viability of these types of data services rely on their being prominent and well maintained. Hopefully, services that look to house multiple data sets will benefit from users going there to use one data set they know about and in the process discovering additional useful data they can use.
ArkAngel, I thought about this point while I was writing the post and think about it constantly in relation to 4ip. On the one hard it would certainly be preferable to have a publicly accountable public service orientated body. However, few public bodies I’m ware of have the technical know-how and vision to deliver this.
While private enterprises tend to be commercially motivated they have a number of other advantages. Firstly, many of the commercial players in this space are multinational. I’m sure data shared across borers will yield unexpected innovations and analysis. Sure, nationally held public bodies will want to do this too but can you imagine the red tape?
Secondly, experience tells us that services improve the fastest where there is competition. Right now all all the big technology players are falling over themselves to get into the cloud. You could argue that Amazon’s public data sets are a response to that competition. Public bodies rarely tend to compete with each other and they’re often prevented from doing so (unless you work in broadcasting). Services like these are embryonic and competition might drive could drive innovation and ultimately maturity.
Dan
Darryl Collins on Sun, December 14, 2008 at 10:12 said:
@ArkAngel It depends. Private companies are often more agile and resilient with things like this - they don’t have endless meetings to prevaricate about ‘solutions’ before writing up tenders and then micro managing a project to death! A good example is Google’s Patent site http://www.google.com/patents Another good example is Wikipedia - I can’t imagine that being delivered by a government department!
It all depends how they are set up and operated and to what purpose - private gain vs public good.
Mark Rock on Wed, January 07, 2009 at 12:58 said:
There’s an interesting article on this area at http://www.readwriteweb.com/archives/cloud_computing_is_more_than_a_computer_in_the_cloud.php
Interestingly he says:
“Just as ‘we’ used to duplicate and under-utilize computational resources, so we do something very similar with our data. We expensively enter and re-enter the same facts, over and over again. We over-engineer data capture forms and schemas, making collection exorbitantly expensive, whilst often appearing to do all we can to limit opportunities for re-use. Under the all-too-easy banners of ’security’ and ‘privacy’ we secure individual data stores and fail to exploit connections with other sources, whether inside or outside the enterprise.”
« Previous entry: 4iP's first project out the door, #ncfc
Next entry: Digital democracy »