Amazon announced the released of a new web service today that aims to facilitate easier access to open, public data sets. Public Data Sets on Amazon’s Web Services will attempt to make a wide range of public data available for free use by anyone. Users can interact with data sets via an Amazon EC2 machine image and only pay for their compute time — they won’t have to worry about storing, downloading, or cleaning the actual data.

亚马逊今天宣布发布一项新的网络服务,旨在促进更轻​​松地访问开放的公共数据集。 亚马逊网络服务上的公共数据集将尝试使各种公共数据可供任何人免费使用。 用户可以通过Amazon EC2机器映像与数据集进行交互,而只需要为他们的计算时间付费-他们不必担心存储,下载或清理实际数据。

According to Amazon business development manager Deepak Singh, the new program “significantly lowers the barrier for researchers and data analysts to access and use some of the most commonly used data sets in their communities.”

亚马逊业务发展经理Deepak Singh表示,新计划“显着降低了研究人员和数据分析师访问和使用其社区中一些最常用数据集的障碍。”

Previously, utilizing the type of large data sets that Amazon plans to host for research purposes was a tedious, multi-step affair. Researchers needed to locate the data, download it, and then often times convert, clean, or customize it into a usable format for their needs. Sometimes just downloading the data is a huge barrier for researchers. One of the data sets on Amazon, for example, is a MySQL database from life sciences project Ensembl that maintains an “automated annotation on a number of eukaryotic genomes.” Their data set weighs in at a mammoth 650 gigabytes and contains 31,000 files. The technical logistics of wrangling a database that large would be an insurmountable hurdle for many researchers with limited resources.

以前,利用Amazon计划托管用于研究目的的大数据集类型是一件繁琐的,多步骤的事情。 研究人员需要定位数据,下载数据,然后经常将其转换,清理或自定义为可满足他们需求的可用格式。 有时,仅下载数据是研究人员的巨大障碍。 例如,亚马逊上的数据集之一就是来自生命科学项目EnsemblMySQL数据库,该数据库维护“对许多真核生物基因组的自动注释”。 他们的数据集重达650 GB,包含31,000个文件。 对于许多资源有限的研究人员来说,整理数据库如此之大的技术后勤工作将是无法克服的障碍。

Now, the data will be available for use across the entire ecosystem of Amazon web services with almost no work on the part of researchers to get up and running. Amazon hopes that developers will create public tools to analyze the data and mash it up with other sources, and that by making data more easily available to a wider range of people, the project will help to foster innovation.

现在,这些数据将可在整个Amazon Web服务生态系统中使用,而研究人员几乎无需任何工作即可启动和运行。 亚马逊希望开发人员能够创建公共工具来分析数据并将其与其他来源融合在一起,并希望通过使数据更容易为更多人使用,该项目将有助于促进创新。

Amazon has a wide range of public data sets available now and plans to add more in the future.


At launch, or shortly after, Amazon’s service offers human genome and DNA sequencing data from Ensembl, and the National Center for Biotechnology Information; chemistry data from Indiana University; and economic data from the US Census Bureau, the Bureau of Labor Statistics, the Bureau of Transportation Services, and the Bureau of Economic Analysis.

在发布之时或之后不久,亚马逊的服务将提供来自Ensembl和国家生物技术信息中心的人类基因组和DNA测序数据; 来自印第安纳大学的化学数据; 和美国人口普查局,劳工统计局,运输服务局和经济分析局的经济数据。

How will you use the data Amazon is making available? What types of mashups would you likes to see created? And what sort of data would you like to see added? Let us know in the comments.

您将如何使用Amazon提供的数据? 您希望创建哪种类型的混搭? 您想添加什么样的数据? 让我们在评论中知道。



