General:Corpus Creation Instructions - Voyeur

From CWRC

1. Create a new folder in which to put your collection, and give it a meaningful name. Although Voyeur does not require it, make sure there are no spaces in the name if you want this collection to work with Mandala as well.

2. Assemble a collection of between one and six related works, e.g. texts by the same author, or ones that are thematically, generically, or chronologically related. Files can be in plain text (.txt), HTML (.html), XML (.xml), MS Word (.doc, .docx), RTF (.rtf), or PDF (.pdf). If a variety of file formats is available for a particular work and you are wondering which one to pick, go with plain text (.txt). Good sources for files are Project Gutenberg, available at http://www.gutenberg.org/ or Internet Archive, available at http://www.archive.org (To access all file formats on Internet Archive, click: All Files: HTTP.) To download a plain text (.txt) file, click on its link and then select one of the following menu options, depending on the browser you are using:

  • Internet Explorer: “File > Save As” or “Page > Save As”
  • Firefox or Chrome: “File > Save Page As”
  • Safari: “File > Save”
  • Opera: “Menu > Page > Save as” or “File > Save As”

XML files can usually be downloaded in the same way shown above for text files. Alternatively, clicking on them sometimes opens a dialog box asking whether you want to save the file to your computer; if that is the case, then save the file.

3. Put all the files you collected into the folder you created.

4. Open each file and delete headers and footers that contain metadata about the file, if they are present. The headers and footers contain supplementary material that you will likely not want to analyze; they are located before and after the beginning and end of the actual work, and are typically well marked with asterisks and/or a statement to the effect that the work is beginning or ending.

5. Rename each file in the following way: Date-Title-Author.txt. This will allow you to view your texts historically in Voyeur. Make sure there are no spaces in the names if you want this collection to work with Mandala as well.