Carnegie Mellon University


Carnegie Mellon University is actively involved in several cloud computing research programs and is one of the test sites for the Open Cirrus program. Their research includes studies on Multi-Tier Indexing for Web Search Engines, Integrated Cluster Computing Architecture, and others.

  •   Research Projects  
  •   Resources  

Research Projects

Integrated Cluster Computing Architecture (INCA)
This research project funded by the NSF CluE program is focused on developing the Integrated Cluster Computing Architecture (INCA) for machine translation (using computers to translate from one language to another). Open-source toolkits make it easier for new research groups to tackle the problem at lower costs, broadening participation. Unfortunately, existing toolkits have not kept up with the computing infrastructure required for modern big data approaches to machine translations INCA will fill this void.
Multi-Tier Indexing for Web Search Engines
Researchers at Carnegie-Mellon University are using cloud computing to characterize the topicality of web content to more effectively process web searches.

Resources

Presentation: Topic-Partioned Search Engine Indexes PDF
Topic-Partioned Search Engine Indexes


Presentation: Cluster Computing for Statistical Machine Translation PDF
Cluster Computing for Statistical Machine Translation


Paper: In Search of an API for Scalable File Systems: Under the Table or Above It? PDF
By Swapnil Patil, Garth A Gibson, Gregory R Ganger, Julio Lopez, Milo Polte, Wittawat Tantisiroj, and Lin Xiao.

Abstract: “Big Data” is everywhere – both the IT industry and the scientific computing community are routinely handling terabytes to petabytes of data. This preponderance of data has fueled the development of data-intensive scalable computing (DISC) systems that manage, process and store massive data-sets in a distributed manner. For example, Google and Yahoo have built their respective Internet services stack to distribute processing (MapReduce and Hadoop), to program computation (Sawzall and Pig) and to store the structured output data (Bigtable and HBase). Both these stacks are layered on their respective distributed file systems, GoogleFS and Hadoop distributed FS, that are designed “from scratch” to deliver high performance primarily for their anticipated DISC workloads. However, cluster file systems have been used by the high performance computing (HPC) community at even larger scales for more than a decade. These cluster file systems, including IBM GPFS, Panasas PanFS, PVFS and Lustre, are required to meet the scalability demands of highly parallel I/O access patterns generated by scientific applications that execute simultaneously on tens to hundreds of thousands of nodes.

Thus, given the importance of scalable storage to both the DISC and the HPC world, we take a step back and ask ourselves if we are at a point where we can distill the key commonalities of these scalable file systems. This is not a paper about engineering yet another “right” file system or database, but rather about how do we evolve the most dominant data storage API – the file system interface – to provide the right abstraction for both DISC and HPC applications. What structures should be added to the file system to enable highly scalable and highly concurrent storage? Our goal is not to define the API calls per se, but to identify the file system abstractions that should be exposed to programmers to make their applications more powerful and portable. This paper highlights two such abstractions. First, we show how commodity large-scale file systems can support distributed data processing enabled by the Hadoop/MapReduce style of parallel programming frameworks. And second, we argue for an abstraction that supports indexing and searching based on extensible attributes, by interpreting BigTable as a file system with a filtered directory scan interface.


Paper: Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop PDF
By Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan.
Abstract: Mochi, a new visual, log-analysis based debugging tool correlates Hadoop’s behavior in space, time and volume, and extracts a causal, unified control- and data- flow model of Hadoop across the nodes of a cluster. Mochi’s analysis produces visualizations of Hadoop’s behavior using which users can reason about and debug performance issues. We provide examples of Mochi’s value in revealing a Hadoop job’s structure, in optimizing real-world workloads, and in identifying anomalous Hadoop behavior, on the Yahoo! M45 Hadoop cluster.