General:Draft Proposal for SHARCNET

From CWRC

Parallelizing the Computation of Relationships Extracted from Textual Information

Michael Bauer and Susan Brown

Background The Orlando textbase constitutes the single most extensive and detailed resource in the area of literary history. Though Orlando resembles a reference work, its electronic structure embeds an entire critical and theoretical framework to support advanced literary historical enquiry, and has been recognized by reviewers as a trailblazer in the production of electronic resources for the humanities. The Orlando textbase constitutes a rare testbed for investigating the mining and visualization of structured text. Composed of more than 1,200 critical biographies, mostly of women writers from the British Isles, and of contextual materials, it is extensively encoded using an interpretive XML tagset with more than 250 tags for everything from cultural influences to relations with publishers or the use of nonstandard English in literary texts. The primary users of Orlando are literary scholars at both the faculty and graduate levels, and undergraduate students of English literature. Related references to Orlando include: [1]. Booth, Alison. Biography 31: 4 (Fall 2008), 725-34. [2]. Brown, Susan. Orlando: Women's Writing in the British Isles from the Beginnings to the Present. Ed. with Patricia Clements and Isobel Grundy. Cambridge: Cambridge University Press. [3]. Brown, Susan, Patricia Clements, and Isobel Grundy. Scholarly Introduction. Orlando: Women’s Writing in the British Isles from the Beginnings to the Present. Cambridge: Cambridge UP, http://orlando.cambridge.org: Help. [4]. Fraiman, Susan. "In Search of Our Mothers’ Gardens—With Help from a New Digital Resource for Literary Scholars," Modern Philology, August 2008, 142-48. Harner, James L. Literary Research Guide: An Annotated Listing of Reference Sources in English Literary Studies, 5th ed. New York: MLA, 2008. [5]. Hickman, Miranda. Tulsa Studies in Women's Literature, 27:1 (Spring 2008), 180-86. [6]. Reisz, Matthew. "In search of a good companion: Matthew Reisz weighs up the role of weighty tomes of literary reference in the digital age. "Times Higher Education, 928:1 (December - January 2009)

Overview of the Project A visualization tool, Orviz0, has been developed to enable literary scholars to explore portions of the Orlando textbase based on their research interests. The visualization tool creates a graph of all the names that were tagged in an XML file. The XML file is a user-specified subset of entries extracted from Orlando, which could be all 1200+ entries. These names are grouped, as nodes, around the name of the entry in which they were found and connected, by edges, to the entry. These edges are color coded corresponding to the tag they were encompassed in for that entry and the user is able to choose the tags they wish to look at. With the current implementation of the tool, there is challenge in dealing with the large number of nodes and edges that can result from a large data set. The tool is currently used on one of the SHARCNET visualization workstations and when a user changes the chosen set of relationships (tags) the reprocessing of the relationship data and rendering of the screens can take 15-20 minutes for a large set of entries This makes the tool ineffective for interactive use by literary scholars. There is a need to adapt the code to make use of parallel computation in order to enhance the performance of the tool and to make it generally more accessible to the literary scholars that wish to explore the information with the Orlando textbase. This summer the code is being cleaned, bugs fixed, and some performance enhancements incorporated. This proposal requests programming support to enhance the overall performance of Orviz, especially on large data Orlando data sets, through parallelization of some of the computational components.

Project Objectives Orlando is but one example of the growing number of large digital corpora of information that humanists have or will have access to. Approaches to enabling humanists to visualize and explore these data sets remain challenging problems. Part of the Orlando visualization effort is to begin to identify approaches, their computational challenges and to evaluate them in practice. This work is seen as a foundation for the development of subsequent novel approaches for visualization, exploration and information extraction. The specific objectives of this project are: • To improve on an existing tool to work with larger data sets. Orlando is already a large textbase with many relationships of interest to humanists. As indicated, there is a need to improve the interactive experience in exploring large data sets.

• To establish a basis for future work. Literary scholars and humanists are increasingly faced with growing volumes of data, such as Orlando. There is a need for a platform for future research into approaches for visualization and exploration.

• Enhance the computation of graph-based metrics and information. The relationships in Orlando are represented as graphs of labeled nodes and labeled edges. Graph algorithms for identifying paths, etc., are included already and others are needed. These algorithms can be computationally expensive in large data sets. Parallel versions of these algorithms can significantly speed up the computations.

• Evaluate the tool and interface. The Orlando textbase is large and literary scholars often need to explore relationships in unpredictable ways. Ultimately, the aim is to evaluate the interface and tool with humanities scholars.

Methodology The current visualization tool is being re-engineered to improve performance, to fix known bugs and to address some limitations of the interface. This version, Orviz0, will be used by a few select researchers during the fall of 2010 as part of experiments to evaluate the general form of the interface. This work is being done by a summer student, Jonathan Cable.

The visualization tool can be broadly defined into three computational parts: an file input/output part, primarily involved in parsing the XML data extracted from the Orlando textbase; a computational part, where the graphical structures are formed and relationships computed, and a visualization part, where the information is displayed and through which the user interacts with the tool. During this re-engineering, those modules/functions of Orvis0 involved in each of these parts will be identified. If possible, the code will be segmented into two components – a user interface component that handles interactions with the user and a computational component that recomputes the relationships and the data to be displayed. Orvis0 and this information will form the starting point for the project.

Since the LOI was accepted, there have a number of exchanges among Bauer, Cable, Partick Emond SHARCNET specialist) and Tyson Whitehead (SHARCNET specialist). This included making the code available to both SHARCNET staff. More recently, there has been further exchanges with Susan Brown, Denilson Barbosa (Computer Science, University of Alberta; also involved in aspects of the Orlando project), Weiguang Guang, Jeff Antoniuk and others involved in the Orlando project. These discussions have been useful in ensuring that developments and directions on this project were consistent with other projects and directions done within the Orlando project and other related humanities projects.

Patrick’s efforts have determined that two toolkits have the potential to be very useful in revamping the existing tool:

• Libmxl2++ (http://www.xmlsoft.org/ ): Libxml2++ is the libxml library for C++. It provides a library of functions for efficient parsing of XML data.

• Titan (http://titan.sandia.gov/ ): The Titan Informatics Toolkit is a toolkit available from Sandia National Laboratories and provides a flexible, component-based pipeline architecture for ingestion, processing, and display of informatics data. It makes use of VTK (scientific visualization), which is also used by the current prototype. It also provides support for a number of graph algorithms. Titan supports a distributed memory model of computation using MPI.

Re-engineering the existing tool using both of these toolkits will comprise the core of the programming effort: • Replace existing XML handling by Libxml++; • Modify the interface, currently done in VTK, with Titan libraries; • Incorporate selective graph algorithms available in Titan into the prototype. • Assess other parts of computations done within the tool to determine if they could fruitfully be parallelized; this might not be significant given the capabilities of Titan, but will need to be assessed during the course of the project.

Timelines and Deliverables • October 1-15: Project Start: set up code repositories, download and install toolkits and libraries (current work on prototype should be done by then).

Deliverable: Toolkits and libraries installed

• October 15-November 15: Retrofit prototype, Orviz0, to make use of Libxml++.

Deliverable: New single server version, Orviz1.

• November 15-30: Testing of new version; comparison of results/performance to initial (old) version.

Deliverable (Major Milestone): New version of tool using Libxml++, Orviz1; can be handed over to humanities scholars for use.

• December 1-January 15: Conversion of interface to Titan; porting of tool to mako (MPI use).

Deliverable: New version using Titan interface components.

• December 1-January 15: Identification of core set of graph algorithms.

Deliverable: Summary report (Brown, Bauer, etc.)

• January 15-February 15: New version using selected graph algorithms.

Deliverable: New version, Orviz2.

• February 15-February 28: Testing of Orviz2.

Deliverable: New version, Orviz2, running on mako – available for use by humanities researchers.

• February 28-March 15: Review of performance, function – identify further modifications to computations.

• March 15-March 31: Complete additional modifications, if any.

• April 1- April 15: Final testing.

Deliverable: Final version released.

Related Supporting Information Dr. Brown and Dr. Bauer have collaborated on the development on an initial tool for visualization of relationships from the Orlando textbase. A paper reporting on the potential for this kind of a tool in scholarly research was presented at the international conference DH2010, the leading conference in digital humanities, in July, 2010. The paper is: How Do You Visualize a Million Links?, by S. Brown, J. Antoniuk, M. Bauer, J. Berberich, M. Radzikowska, S. Ruecker, T. Yung. The abstract is available at http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/pdf/ab-764.pdf

Dr. Brown is currently the Director of the Orlando Project and leader of the CFI-funded Canadian Writing Research Collaboratory (CWRC), which will develop a platform for scholarly collaboration incorporating a range of tools, including the kind of experimental visualization tools that this project aims to extend, and test with users in the scholarly community.

Dr. Bauer received support from SHARCNET in 2007 to support a visitor (Dr. Mario Dantas) from Brazil. That visit resulted in a number of papers, including:

• R. Viegas Diogo, M. Dantas, and M. Bauer, A Case Study of Transport Protocols to Improve the Execution of Applications in Virtual Organisations Utilising Multicluster Network Confirgurations. International Journal of Networking and Virtual Organizations, (to appear).

• D. Ferreira, M. Dantas, J. Qin, M. Bauer, Dynamic Resource Matching for Multi-Clusters Based on an Ontology-Fuzzy Approach. High Performance Computing Systems and Applications, The 23rd Annual Symposium on High Performance Computing Systems and Applications (HPCS 2009), Lecture Notes in Computer Science LNCS 5976, Springer, pp. 230-240.

• Diogo Viegas, R.P. Mendonça, Mario Dantas, Michael Bauer, SCTP, XTP and TCP as Transport Protocols for High Performance Computing on Multi-Cluster Grid Environments. High Performance Computing Systems and Applications, The 23rd Annual Symposium on High Performance Computing Systems and Applications (HPCS 2009), Lecture Notes in Computer Science LNCS 5976, Springer, pp. 241-250.

• D. J. Ferreira, A. Silva, M. Dantas, J. Qin and M. Bauer. Toward Resource Management in Multi-Cluster Grid Configurations Through an Ontology-Fuzzy Approach. The 2009 International Conference on Grid Computing and Applications (GCA'09), July, 2009. pp. 10-16.

General:Draft Proposal for SHARCNET - CWRC

General:Draft Proposal for SHARCNET

From CWRC