Abstract: Researchers at Carnegie-Mellon University are using cloud computing to characterize the topicality of web content to more effectively process web searches. Routing searches topically requires less effort than traditional searches, enabling significant computational and financial savings. The project is using the Google/IBM cluster to "crawl" the web and perform the data cleansing and pre-processing necessary to develop a web dataset of 1 billion documents to support the research. The web dataset is also being made available to the larger information retrieval community to multiply the impact of the project on that discipline.
Presentation: Topic-Partioned Search Engine Indexes - October 2009 
Abstract: The second research project funded by the NSF CluE program is focused on developing the Integrated Cluster Computing Architecture (INCA) for machine translation (using computers to translate from one language to another). Open-source toolkits make it easier for new research groups to tackle the problem at lower costs, broadening participation. Unfortunately, existing toolkits have not kept up with the computing infrastructure required for modern "big data" approaches to machine translations; INCA will fill this void.
Presentation: Cluster Computing for Statistical Machine Translation - October 2009
Resources
- Paper: In Search of an API for Scalable File Systems: Under the Table or Above It?

By Swapnil Patil, Garth A Gibson, Gregory R Ganger, Julio Lopez, Milo Polte, Wittawat Tantisiroj, and Lin Xiao
Abstract: “Big Data” is everywhere – both the IT industry and the scientific computing community are routinely handling terabytes to petabytes of data. This preponderance of data has fueled the development of data-intensive scalable computing (DISC) systems that manage, process and store massive data-sets in a distributed manner. For example, Google and Yahoo have built their respective Internet services stack to distribute processing (MapReduce and Hadoop), to program computation (Sawzall and Pig) and to store the structured output data (Bigtable and HBase). Both these stacks are layered on their respective distributed file systems, GoogleFS and Hadoop distributed FS, that are designed “from scratch” to deliver high performance primarily for their anticipated DISC workloads. However, cluster file systems have been used by the high performance computing (HPC) community at even larger scales for more than a decade. These cluster file systems, including IBM GPFS, Panasas PanFS, PVFS and Lustre, are required to meet the scalability demands of highly parallel I/O access patterns generated by scientific applications that execute simultaneously on tens to hundreds of thousands of nodes.
Thus, given the importance of scalable storage to both the DISC and the HPC world, we take a step back and ask ourselves if we are at a point where we can distill the key commonalities of these scalable file systems. This is not a paper about engineering yet another “right” file system or database, but rather about how do we evolve the most dominant data storage API – the file system interface – to provide the right abstraction for both DISC and HPC applications. What structures should be added to the file system to enable highly scalable and highly concurrent storage? Our goal is not to define the API calls per se, but to identify the file system abstractions that should be exposed to programmers to make their applications more powerful and portable. This paper highlights two such abstractions. First, we show how commodity large-scale file systems can support distributed data processing enabled by the Hadoop/MapReduce style of parallel programming frameworks. And second, we argue for an abstraction that supports indexing and searching based on extensible attributes, by interpreting BigTable as a file system with a filtered directory scan interface.
Download the Paper 
- Paper: Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop

By Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan
Abstract: Mochi, a new visual, log-analysis based debugging tool correlates Hadoop’s behavior in space, time and volume, and extracts a causal, unified control- and data- flow model of Hadoop across the nodes of a cluster. Mochi’s analysis produces visualizations of Hadoop’s behavior using which users can reason about and debug performance issues. We provide examples of Mochi’s value in revealing a Hadoop job’s structure, in optimizing real-world workloads, and in identifying anomalous Hadoop behavior, on the Yahoo! M45 Hadoop cluster.
Paper - Slides 
|
|
|