National Science Foundation (NSF)


In 2008 the ACCI partnered with the National Science Foundation to provide grant funding to academic researchers interested in exploring large-data applications that could take advantage of this infrastructure. This resulted in the creation of the Cluster Exploratory (CLuE) program led by Dr Jim French, which currently funds 14 University projects.

The first round of CluE grants were awarded to universities, utilizing the IBM/Google cluster.

  •   Research  

Research Projects

A Comparative Study of Approaches to Cluster-Based Large Scale Data Analysis
This is a collaborative study being conducted by MIT, University of Wisconsin, and Yale University. These three universities are using a National Science Foundation CLuE grants for a comparative study of approaches to cluster-based, large-scale data analysis. Both MapReduce and parallel database systems provide scalable data processing over hundreds to thousands of nodes, yet it's important for researchers to know the differences in performance and scalability of these two approaches to know which is more suitable when designing new data-intensive computing applications. This project is engaged in systems research, much of which requires the ability to change the operating environment. Since this is not possible on the IBM/Google cluster, the project is also hosted on the Cloud Computi ....
A Hadoop Toolkit for Distributed Text Retrieval
Text search is a technology that is vital for modern information-based societies. Today's systems face the daunting challenge of handling quantities of text previously unimaginable. Cluster computing is the only practical solution for addressing the issue of scale. This project leverages the MapReduce framework (via the open-source Hadoop implementation) to tackle issues of robustness and scalability in processing large amounts of data for information retrieval applications.
A Unified Reinforcement Learning Approach for Autonomic Cloud Management
Cloud Computing, unlocked by virtualization, is emerging as an increasingly important service-oriented computing paradigm. The goal of this project is to develop a unified learning approach, namely URL, to automate the configuration processes of virtualized machines and applications running on the virtual machines and adapt the systems configuration to the dynamics of cloud.
Commodity Computing in Genomic Research
This NSF CLuE project focuses on developing parallel algorithms for analyzing the next generation of sequencing data. Scientists can now generate the rough equivalent of an entire human genome in just a few days with one single sequencing instrument. The analysis of this data is complicated by their size - a single run of a sequencing instrument yields terabytes of information, often requiring a significant scale-up of the existing computational infrastructure needed for analysis.
Data-Intensive Text Processing
The NSF CLuE initiative is funding a machine translation project that promises to bridge the language divide in today's multi-cultural and multi-faceted society. Systems capable of converting text from one language into another have the potential to transform how diverse individuals and organizations communicate.
Feedback-Controlled Management of Virtualized Resources for Predictable eScience
This project pursues a novel unified framework to ensure predictable eScience based on two dominant emerging uses of virtualized resources. The foundation of the approach is to wrap an eScience application in a performance container framework and dynamically regulate the application's performance through the application of formal feedback control theory.
Hierarchically-Redundant, Decoupled Storage Project (HaRD)
The Wisconsin Hierarchically-Redundant, Decoupled storage project (HaRD) investigates the next generation of storage software for hybrid Flash/disk storage clusters. The main objective of the project is to improve the performance of storage in a variety of diverse scenarios, including new application environments such as photo storage as found in Facebook and Flickr, high-end scientific processing as found in government labs, and large-scale data processing such as that found in Google and Microsoft.
Hybrid Opportunistic Computing for Green Clouds
Abstract: On-demand, service-oriented cloud computing infrastructures continue to increase in popularity with organizations. Three observations motivate us to investigate running high-throughput, data-intensive tasks as background workloads on these cloud infrastructures. First, the rapid growth in hardware parallelism leaves more residue resources to be exploited. Second, the "incremental power usage" of piggybacking a secondary background workload onto the foreground workload to utilize those residue resources is relatively low. Third, the advances in GPGPU (General-Purpose GPU) processing enable a novel coupling of concurrent workloads. This project will explore a new computing model of offering cloud services on active nodes that are serving on-demand utility computing users. We pla ....
Image Super-Resolution Using Trillions of Examples
Imagine continuously zooming into an image from your personal photo collection. Unlike the modern image processing software, however, this zoom operation would reveal details missing from the original image. Foe example, zooming into someone's shirt would eventually show a high-resolution image of the threads that compose it. A research team at the Department of Computer Science at the University of Virginia plans to develop techniques for intelligently enlarging a digital image that uses a database of millions of on-line images to find examples of what its components look like at a higher spatial resolution.
Learning Word Relationship Using TupleFlow
This project focuses on how researchers at the Center for Intelligent Information Retrieval (CIIR) are using the CluE infrastructure to learn more about word relationships. These relationships are not labeled explicitly in text and are quite varied; by exploiting these relationships, this project will help lead to a more effective ranking of web-retrieval results.
One Thousand Points of Light
A large class of distributed data-rich applications, including distributed data mining, distributed workflows, and Web 2.0 Mashups, are increasingly relying on cloud services to meet their data storage and computing demands. This project proposes a cloud proxy network that allows optimized and reliable data-centric operations to be performed at strategic network locations.
Scaling the Sky with MapReduce/Hadoop
Astrophysics is addressing many fundamental questions about the nature of the universe through a series of ambitous wide-field optical and infrared imaging surveys. New methodologies for analyzing and understanding petascale data sets are required to answer these questions. This research project is focused on developing new algorithms for indexing, accessing and analyzing astronomical images. This work is expected to have a broad range of applications to other data intensive fields.
Trustworthy Virtual Cloud Computing
Abstract: Virtual cloud computing is emerging as a promising solution to IT management to both ease the provisioning and administration of complex hardware and software systems and reduce the operational costs. With the industry’s continuous investment (e.g., Amazon Elastic Cloud Computing, IBM Blue Cloud), virtual cloud computing is likely to be a major component of the future IT solution, which will have significant impact on almost all sectors of society. The trustworthiness of virtual cloud computing is thus critical to the well-being of all organizations or individuals that will rely on virtual cloud computing for their IT solutions. This project envisions trustworthy virtual cloud computing and investigates fundamental research issues leading to this vision. Central to this visi ....
Where the Ocean Meets the Cloud
This project is building a new infrastructure for computational oceanography that uses the CluE platform to allow ad hoc, longitudinal query and visualization of massive ocean simulation results at interactive speeds. This infrastructure leverages and extends two existing systems: GridFields, a library for general and efficient manipulation of simulation results and VisTrails, a comprehensive platform for scientific workflow, collaboration, virtualization, and provenance.