General:Project Description

From CWRC

(Difference between revisions)
(Created page with '= Summary = This project will use the Orlando textbase as a testbed for studying complexities in the analysis and visualization of patterns of association. It will investigate ho…')
(Summary)
 
Line 1: Line 1:
= Summary =
= Summary =
This project will use the Orlando textbase as a testbed for studying complexities in the analysis and visualization of patterns of association. It will investigate how emergent methods in text mining and visualization can leverage that embedded structure to enable new discovery paths in literary history.  This project will investigate how literary historical analysis can be radically extended by text mining and visualization.  Building on preliminary work on text mining and the use of high-performance computing for literary scholarship, we will test and further develop existing techniques to develop new approaches to literary history using computers.  Our research goal in this project is to assist literary scholars working with Orlando materials by designing systems to support interactive speculative inquiry through text mining and visualization.
This project will use the Orlando textbase as a testbed for studying complexities in the analysis and visualization of patterns of association. It will investigate how emergent methods in text mining and visualization can leverage that embedded structure to enable new discovery paths in literary history.  This project will investigate how literary historical analysis can be radically extended by text mining and visualization.  Building on preliminary work on text mining and the use of high-performance computing for literary scholarship, we will test and further develop existing techniques to develop new approaches to literary history using computers.  Our research goal in this project is to assist literary scholars working with Orlando materials by designing systems to support interactive speculative inquiry through text mining and visualization.
 +
 +
 +
= Full Description =
 +
Orlando: Women’s Writing in the British Isles from the Beginnings to the Present is a literary-historical textbase comprising more than 1,228 core entries on the lives and writing careers of British women writers, male writers, and international women writers; 13,000+ free-standing chronology entries providing context; 22,000+ bibliographical listings; and more than 2 million tags embedded in 6.8 million words of borndigital text (Brown et al. 2005, 2006a, 2006b, forthcoming).
 +
 +
The extensive linkages in make Orlando a unique resource for experimentation with data mining, machine learning, and visualization techniques to investigate the impact that interpretive markup has on the data mining and the visualization of results. Building on preliminary work on text mining and the use of high-performance computing for literary scholarship, we will test and further develop existing techniques to develop new approaches to literary history using computers.
 +
 +
 +
This research project will ask:
 +
*1) What methods and algorithms are most appropriate to mining collections of scholarly texts for literary historical data?
 +
*2) How does interpretive XML encoding influence the outcomes of text mining and machine learning when compared with input from text with no encoding or only structural encoding?
 +
*3) What forms of visualization will be most useful to literary scholars using text mining tools?
 +
 +
 +
We hypothesize that:
 +
*1) Data mining can help identify patterns, sequences, and connections of interest to literary historians;
 +
*2) Semantic markup significantly enhances the results of such mining;
 +
*3) Such methods require new kinds of interfaces and visualization.
 +
This work requires a broadly interdisciplinary team-based approach. It takes a vital step towards providing literary scholars with next-generation networked tools that combine sophisticated procedures with interfaces that support more varied tasks than current, information-retrieval-oriented, tools.
 +
 +
 +
The Orlando encoding system, devised for digital rather than print textuality, permits collaborativelyauthored research structured according to consistent principles. The encoding creates a degree of crossreferencing and textual inter-relation impossible with print scholarship—not simply hyperlinking but relating separate sections of scholarly text in ways unforeseen even by the authors of the sections. It represents a new approach to the integration of scholarly discourse, one which allows the integrating components to operate in conjunction with, rather than in opposition to, historical specificity and detail (Brown et al, 2006c). However, the search-and-retrieval model of the current interface for Orlando, while user-friendly in that it resembles first-generation online research tools, cannot exploit this encoding to the fullest. Search interfaces only find what the user asks for, whereas mining and visualization enable exploration and discovery of patterns and relationships that one might not be able to search for. We need to assess second-generation text-exploration methods’ potential for literary inquiry.
 +
 +
Preliminary investigations confirm Moretti’s argument that visual representations enable kinds of literary historical inquiry which are not supported by conventional search interfaces. Orlando has the added advantage of making it possible to dive back into the source material to see the specifics from which the representation is produced. The combination of richly encoded humanities material with data mining and visualization has the potential to provide new ways of doing literary history.
 +
 +
Various approaches are available for studying usability of online systems. Most straightforward are heuristics methods or cognitive walkthroughs, where experts study the details of an interface with reference to guidelines (Nielsen 2000). In addition to such analysis, we will record participants’ use of the systems and encourage them to discuss what they are doing and why, while we track screen events and make an audio recording (Guha & Saraf 2005; Morrison 1999). This combination of session data with semi-structured interviews provides excellent insight into how participants approach the interface. Some theorists have questioned usability study as unable to guide design of new opportunities for action
 +
(e.g. Dillon 2001), or as inadequately attuned to the goals of scholarly, as opposed to commercial, uses (Brown et al 2006b). We therefore also include questions directed at the idea of affordance strength (Ruecker 2006b), where the goal is to assess how important a particular tool could be for users. Our approach to design also falls within the terrain of usability, since participatory design posits that interactions between designers, programmers, and end users produces superior results.
 +
 +
Interactive visualization tools require high resolution display devices, processors and software. For large data sets, substantial processing capabilities and storage are required. Visualization of data at a central location by geographically remote researchers requires a remote visualization capability. Given the size of the Orlando corpus and the processing needed for text mining, we will use the SHARCNET facilities in Ontario – part of the Compute Canada High Performance Computing consortium – which have the processing power to allow us to experiment with both interactive text mining processes and visualizations. This project will pioneer the adaptation of HPC facilities, typically optimized for scientific batch-processing, to address literary problems. SHARCNET has established several initiatives to help support digital humanities scholarship using HPC, and has provided part funding for a Postdoctoral fellow under the supervision of Sinclair.
 +
 +
 +
Methodology
 +
The project understands literary history as a study of the changes, through time, of complex interrelationships of writers, their texts, and the complex conditions under which they write and within which their texts are circulated and received (Brown et al 2009). The SRG program permits a more sustained interdisciplinary research trajectory than the ITST program that has supported much of the preliminary work.
 +
 +
 +
Phase 1 – Inquiry in Text Mining and Visualization (Year 1)
 +
 +
Review of literature on virtual representation of past; review and testing of existing text-mining algorithms and visualization techniques; comparison of results of using machine classification to mine the Orlando data and the same data stripped of the interpretive tags; experimentation with machine classification to leverage Orlando metadata to apply encoding to unstructured text as basis for more accurate mining; interviews, screen captures, and audio recordings of commentary during use of existing literary text mining and visualization systems, including the existing graph
 +
tool (15-20 study participants); refinement of the existing tool based on feedback.
 +
 +
 +
Phase 2 – Design of Data Mining Visualizations and Interfaces (Year 2)
 +
 +
Refinement of data mining applications; continued testing of impact of markup; iterative design and testing of interface concepts and data visualizations, beginning with 3-5 frequent Orlando users responding to static sketches, kinetic sketches, and working prototypes; initial design of interface and visualization prototypes in conjunction with feedback from the study participants.
 +
 +
 +
Phase 3 – Interface Implementation, Testing and Dissemination (Year 3)
 +
 +
Development of at least 2 online prototypes that will use insights of Phases 1 and 2; user testing using a study protocol involving questionnaires, semi-structured interviews, and screen captures connected to a thinkaloud audio recording (15-20 users). Results will be used to refine further. We will assess existing text mining algorithms on the basis of researchers’ experience, the scholarly literature, and experimentation with the most promising methods. We will assess the utility of markup by, for instance, running parallel tests on our data with and without the semantic markup (testing when less markup is preferable or sufficient, and when more markup affords additional opportunities for analytic operations, visual representations, and user manipulations). We will test existing interfaces with qualitative interviews and thinkaloud procedures (Guha & Saraf 2005; Morrison 1999), using purposive, maximum variation sampling to recruit a range of participants, analyzing interview results using a grounded theory approach (Glaser 1992; Glaser & Strauss 1967), incorporating the results into interface design, implementation and testing. We will go beyond user-centred design (e.g., Cockrell & Jayne 2002), to adopt a participatory design method that involves researchers directly in an iterative process.

Current revision as of 09:52, 3 August 2010

Summary

This project will use the Orlando textbase as a testbed for studying complexities in the analysis and visualization of patterns of association. It will investigate how emergent methods in text mining and visualization can leverage that embedded structure to enable new discovery paths in literary history. This project will investigate how literary historical analysis can be radically extended by text mining and visualization. Building on preliminary work on text mining and the use of high-performance computing for literary scholarship, we will test and further develop existing techniques to develop new approaches to literary history using computers. Our research goal in this project is to assist literary scholars working with Orlando materials by designing systems to support interactive speculative inquiry through text mining and visualization.


Full Description

Orlando: Women’s Writing in the British Isles from the Beginnings to the Present is a literary-historical textbase comprising more than 1,228 core entries on the lives and writing careers of British women writers, male writers, and international women writers; 13,000+ free-standing chronology entries providing context; 22,000+ bibliographical listings; and more than 2 million tags embedded in 6.8 million words of borndigital text (Brown et al. 2005, 2006a, 2006b, forthcoming).

The extensive linkages in make Orlando a unique resource for experimentation with data mining, machine learning, and visualization techniques to investigate the impact that interpretive markup has on the data mining and the visualization of results. Building on preliminary work on text mining and the use of high-performance computing for literary scholarship, we will test and further develop existing techniques to develop new approaches to literary history using computers.


This research project will ask:

  • 1) What methods and algorithms are most appropriate to mining collections of scholarly texts for literary historical data?
  • 2) How does interpretive XML encoding influence the outcomes of text mining and machine learning when compared with input from text with no encoding or only structural encoding?
  • 3) What forms of visualization will be most useful to literary scholars using text mining tools?


We hypothesize that:

  • 1) Data mining can help identify patterns, sequences, and connections of interest to literary historians;
  • 2) Semantic markup significantly enhances the results of such mining;
  • 3) Such methods require new kinds of interfaces and visualization.

This work requires a broadly interdisciplinary team-based approach. It takes a vital step towards providing literary scholars with next-generation networked tools that combine sophisticated procedures with interfaces that support more varied tasks than current, information-retrieval-oriented, tools.


The Orlando encoding system, devised for digital rather than print textuality, permits collaborativelyauthored research structured according to consistent principles. The encoding creates a degree of crossreferencing and textual inter-relation impossible with print scholarship—not simply hyperlinking but relating separate sections of scholarly text in ways unforeseen even by the authors of the sections. It represents a new approach to the integration of scholarly discourse, one which allows the integrating components to operate in conjunction with, rather than in opposition to, historical specificity and detail (Brown et al, 2006c). However, the search-and-retrieval model of the current interface for Orlando, while user-friendly in that it resembles first-generation online research tools, cannot exploit this encoding to the fullest. Search interfaces only find what the user asks for, whereas mining and visualization enable exploration and discovery of patterns and relationships that one might not be able to search for. We need to assess second-generation text-exploration methods’ potential for literary inquiry.

Preliminary investigations confirm Moretti’s argument that visual representations enable kinds of literary historical inquiry which are not supported by conventional search interfaces. Orlando has the added advantage of making it possible to dive back into the source material to see the specifics from which the representation is produced. The combination of richly encoded humanities material with data mining and visualization has the potential to provide new ways of doing literary history.

Various approaches are available for studying usability of online systems. Most straightforward are heuristics methods or cognitive walkthroughs, where experts study the details of an interface with reference to guidelines (Nielsen 2000). In addition to such analysis, we will record participants’ use of the systems and encourage them to discuss what they are doing and why, while we track screen events and make an audio recording (Guha & Saraf 2005; Morrison 1999). This combination of session data with semi-structured interviews provides excellent insight into how participants approach the interface. Some theorists have questioned usability study as unable to guide design of new opportunities for action (e.g. Dillon 2001), or as inadequately attuned to the goals of scholarly, as opposed to commercial, uses (Brown et al 2006b). We therefore also include questions directed at the idea of affordance strength (Ruecker 2006b), where the goal is to assess how important a particular tool could be for users. Our approach to design also falls within the terrain of usability, since participatory design posits that interactions between designers, programmers, and end users produces superior results.

Interactive visualization tools require high resolution display devices, processors and software. For large data sets, substantial processing capabilities and storage are required. Visualization of data at a central location by geographically remote researchers requires a remote visualization capability. Given the size of the Orlando corpus and the processing needed for text mining, we will use the SHARCNET facilities in Ontario – part of the Compute Canada High Performance Computing consortium – which have the processing power to allow us to experiment with both interactive text mining processes and visualizations. This project will pioneer the adaptation of HPC facilities, typically optimized for scientific batch-processing, to address literary problems. SHARCNET has established several initiatives to help support digital humanities scholarship using HPC, and has provided part funding for a Postdoctoral fellow under the supervision of Sinclair.


Methodology The project understands literary history as a study of the changes, through time, of complex interrelationships of writers, their texts, and the complex conditions under which they write and within which their texts are circulated and received (Brown et al 2009). The SRG program permits a more sustained interdisciplinary research trajectory than the ITST program that has supported much of the preliminary work.


Phase 1 – Inquiry in Text Mining and Visualization (Year 1)

Review of literature on virtual representation of past; review and testing of existing text-mining algorithms and visualization techniques; comparison of results of using machine classification to mine the Orlando data and the same data stripped of the interpretive tags; experimentation with machine classification to leverage Orlando metadata to apply encoding to unstructured text as basis for more accurate mining; interviews, screen captures, and audio recordings of commentary during use of existing literary text mining and visualization systems, including the existing graph tool (15-20 study participants); refinement of the existing tool based on feedback.


Phase 2 – Design of Data Mining Visualizations and Interfaces (Year 2)

Refinement of data mining applications; continued testing of impact of markup; iterative design and testing of interface concepts and data visualizations, beginning with 3-5 frequent Orlando users responding to static sketches, kinetic sketches, and working prototypes; initial design of interface and visualization prototypes in conjunction with feedback from the study participants.


Phase 3 – Interface Implementation, Testing and Dissemination (Year 3)

Development of at least 2 online prototypes that will use insights of Phases 1 and 2; user testing using a study protocol involving questionnaires, semi-structured interviews, and screen captures connected to a thinkaloud audio recording (15-20 users). Results will be used to refine further. We will assess existing text mining algorithms on the basis of researchers’ experience, the scholarly literature, and experimentation with the most promising methods. We will assess the utility of markup by, for instance, running parallel tests on our data with and without the semantic markup (testing when less markup is preferable or sufficient, and when more markup affords additional opportunities for analytic operations, visual representations, and user manipulations). We will test existing interfaces with qualitative interviews and thinkaloud procedures (Guha & Saraf 2005; Morrison 1999), using purposive, maximum variation sampling to recruit a range of participants, analyzing interview results using a grounded theory approach (Glaser 1992; Glaser & Strauss 1967), incorporating the results into interface design, implementation and testing. We will go beyond user-centred design (e.g., Cockrell & Jayne 2002), to adopt a participatory design method that involves researchers directly in an iterative process.