Literary Computing

[The original headnote that appeared with the essay is included here.]

Dr. Whalley has been interested in the use of computers in the humanities since 1963, and in the application of computers to the variety of problems presented by the editing and composition of literary texts.  We are indebted to him for this excellent state-of-the-art résumé of the field.  But Dr. Whalley goes farther than this.  He asks computer scientists to respond to the plea of the humanist.  A great deal of the research effort during the past 10 years has been directed into the computer-oriented development of formal languages.  We draw attention to Chomsky's pioneering efforts in the analysis of syntactic structures and of the phrase structure of grammars, as well as his more recent development of the algebraic theory of context-free languages.  Now Dr. Whalley is asking us to consider the language-oriented development of computers.  He points out the direction in which both hardware and software developments might be directed to satisfy the needs of the humanist.  Rather than remaking our own image to suit the lines of the machines which have been developed with such haste to satisfy the needs of the scientist, engineer and business-man, let us begin to remake machines in our own image - a far more difficult task.

Dr. Whalley graduated in Classics from Bishop's University in 1935, and was awarded degrees in Theology by Oriel College, Oxford, in 1939 and 1946.  He returned to Bishop's University to serve as lecturer and later as Assistant Professor and there received an M.A. in English in 1948.  He was awarded a Ph.D. in English by King's College, London, in 1950.  Since then he has served on the staff of Queen's University at Kingston, where he is now Professor and Head of the Department of English.  He is a Fellow of the Royal Society of Canada and of the Royal Society of Literature.  He has published four books: "Poetic Process", "Colleridge and Sara Hutchinson", "The Legend of John Hornby" and "A Place of Liberty."

This paper is based on the presentation made by Dr. Whalley in San Francisco at SHARE XXVIII on February 16th, 1967, and is reproduced by kind permission of the SHARE Organization.

 

SCOPE

I have given as my title "Literary Computing" rather than "The Use of Computers in Literary Research" because I want to consider the possibility that computers can be used in literature for uses other than those normally conveyed by the word "research".  The phrase "literary computing" may call up some really glamorous procedure: of making (perhaps) a machine translation from Urdu into Chinese with an interlinear gloss in Basic English or Hungarian; or the composition of poems that, for their profound obscurity and unashamed salacity, would be declared by some renowned critic as the work of an accomplished human poet.  My theme, to begin with, is editorial rather than "creative", scholarly rather than productive, in the field of applied computing rather than in some area of pure speculation upon the nature of possible computers.  But in the end I wish to suggest that the future of literature computing could have a good deal to do with the future of computing altogether.

The suggestion for a machine translation from Urdu to Chinese with an interlinear gloss in Basic English or Hungarian is theoretically conceivable, and almost possible, since there is already a substantial corpus of approximately successful work in machine translation.  Nevertheless, even though the output might well be in some sense useful we should not expect it to be stylish.  The original military requirement for machine translation which gave impetus to this work was presumably for a means of rapid mechanical translation of technical treastises and intelligence reports.  Stylistic elegance, or even ready intelligibility, may not have been a strict requirement.  The difficulties of machine translation - probably insuperable - have not interfered much with the flow of supporting research funds; nor have the men working in the area grown despondent.  Whether or not they were interested in the usefulness of a possible final solution, they were people who liked to ask difficult questions in the field of language and insisted upon answering them in a way that could be programmed in a computer.  "What is a sentence?  What is a verb, or a noun?  How does the structure of a sentence in English differ from the structure of a sentence in (for example) Russian?  What is meaning?  How far is meaning controlled by context?  and if it is much controlled, how can the verbal context be detected, defined, compared and modified, and be translated by machine?  What is the logical function of a verb?  What is the dynamic and semantic function of a verb?  of a participle?  of a noun?"  Without at least tentative answers to such questions, the feat of writing sentences by machine even in one's own language would not be possible; and far less likely that a machine could convert statements from one language to another.  Most of the machine translation makes pretty murky reading; but all of it is extremely impressive because of the resolute inquiry into fundamentals that has made it possible at all.  The results are to be seen in the development of structural and descriptive linguistics, and in an increasingly vigorous interaction between pure linguistic theory and the linguistic theory that is guided by the structure of the machines by which the inquiry is being prosecuted.

All literary analysis and criticism turns in the end upon judgment, or what philosophers call "value judgments".  As far as I know, machines cannot make value judgments; but human beings can, and do.  Literary research and scholarship provide means of localising, clarifying, analysing, correlating items of detail within selected literary fields so that judgments may be formed; and these may be of detailed or comprehensive scope.  Scholarship and criticism provide means of bringing the fruits of these judgments together into sustained structures often of great intricacy and originality.  But literary criticism is not always identical with literary research; and its ends are not answers but sharply defined centres for reflection and judgment.  Even the most obvious product of literary research, an accurate literary text, is the outcome of a sequence of value judgments.  Although machines cannot make the judgments necessary to literary criticism, they can clearly be used to accumulate, correlate, arrange, select, and (under certain conditions) analyse detail, with great rapidity and thoroughness, and present it in a form conducive to intermediate or terminal judgment.  Most of these operations are variations on listing and merging, selecting and arranging.  (It is curious to consider that one of the strongest and earthiest descriptions of art is that it is simply "selection and arrangement.")         

Main Fields of Computerized Literary Research

Apart from the general field of "indexing", there are four main fields of literary research to which machines have been applied: (a) stylistic analysis; (b) the writing of concordances; (c) the collation of texts; (d) the manipulation of bibliographical detail.

(a) Stylistic Analysis.  This includes the associated operations of word-counting, the formation of word inventories (a branch of lexicography), the study of word-clusters, studies of spelling, the identification of printers and typesetters, the study of literary influence, the identification of pseudonymous and putative authors.  It is in this general field that some of the earliest literary work by computer was performed.  With this field is associated the whole area of structural linguistics.  The earliest large-scale use of a computer for literary purposes that I know of was an analysis of the style of the Pauline Epistles (in Greek) to determine which were Pauline and which were not.  A project has recently been started by Alastair McKinnon at McGill University to analyse the variations in style among the various pseudonymous personae through whom Soren Kierkegaard wrote his copious theological and philosophical works, the originals being in Danish.  Three or four years ago a stylistic analysis was completed by machine by Alvar Ellegard identifying the author of the Letters of Junius - a literary secret that has hitherto baffled all the gullible and the self-deceived.  Joe Raben of the University of New York has already reported very convincing and original work in tracing Milton echoes in Shelley's poems, and is continuing his inquiry with increasing minuteness, sophistication, and success.  Some of Vincent Dearing's work on the text of Dryden is stylistic, though much of it deals with some of the more formidable problems focollating literary texts.  The most searching question that lies behind all these procedures, including machine translation, is "What is style?".  Without an answer to this question, none of this kind of literary work can proceed at all, least of all by machine, because it inquires into the basis of all literary art and all use of language.  There are different interpretations of "style": each stylistic project tends to be, as much as anything, a test of its own particular theory of style.  When machine work on style can be extended to an inquiry into the recurrence and valency of metaphor, and into the dynamic aspects of rhythm, literary criticism can be expected greatly to extend its precision and validity.

(b) Concordances.  A concordance is a methodical list, with locations, of all the words and/or phrases that occur in a given text.  The compilation of concordances by hand is a laborious process; since the early 18th century (at latest) a great many concordances of the Bible, and of literary and classical texts have been prepared by hand; but few of them can escape the question of completeness.  The computer brings superior speed and the prospect of completeness in concordance-making.  A group at Cornell University has already written two concordances by machine: Matthew Arnold's Poetical Works and W. B. Yeats's Poems, and others are projected.  Professor Bessenger and Dr. Philip Smith are preparing a concordance of the corpus of Anglo Saxon literature; and other literary concordances are being prepared by machine.  I understand that a concordance of the dialogues of Plato - about the same volume as the Bible - is being prepared at MIT; the text being in Greek, words can change at both ends and this raises tricky problems of canonizing that are apparently not insoluble.  Practically all concordances, except of quite small texts, are in some respects selective in order to save bulk or to save trouble.  But decisions about what to omit from a concordance may systematically preclude certain important recognitions and connexions from being made; and many existing concordances ideally now need rewriting for this reason.  The computer is an ideal instrument for concording all the elements of a literary text, down to punctuation, italics, and capitals.  It is hoped that arguments for selective concordances will no longer prevail.

(c) Collation.  The word has two standard meanings in literary scholarship.  One is the description of a book or manuscripts by its basic bibliographical elements - signatures, leaves, sheets, etc.; this meaning is not being considered here.  The other is the systematic and detailed comparison of the various drafts, versions, and printings of a literary text with the presentation of all the variants from a chosen standard text.  This is an operation suitable for the computer.  Although certain difficulties present themselves, there is some doctrine and experience already established, particularly with texts written in verse; but interesting work is also in progress on Dryden and Henry James.  Collation is so important a part of the scholarly activity of laying down a complete and accurate text with a systematic record of all variants that only the difficulty of the procedure explains the comparatively slow development in this area.  It would be an essential part of any proposal for a comprehensive system for literary editing by computer.

(d) Bibliography.  There are two general kinds of bibliography, in both of which problems arise from the need for flexibility and from the copiousness of the data.  One kind of bibliography is the drawing up of lists of titles of books and articles according to a scheme of authors, or of subject matter, with or without summary or evaluative comment.  (A better title for this is Handlists, or Booklists, or Reading lists.)  This operation can be closely related to library applications of computers (in both ordering and cataloguing) and in general with various excursions into "information retrieval".  The second kind, often called "descriptive bibliography", is the detailed description of printed books in whatever physical and typographical detail can give evidence of the history and integrity of the text, the history of the printing of the book, its ideal state, and the variant forms it passes through in successive printings and editions.  Descriptive bibliography involves intricate, highly detailed data, and is afflicted with many subtleties of formula and presentation some of which, though canonised by long use, are illogical; and the output is always extremely difficult to print accurately.  Author-bibliographies and genre-bibliographies often combine both kinds of bibliographical procedure.  As a literary-critical discipline, bibliography is not of great antiquity, yet has already generated much rigid dogma.  Since typographical lay-out and design are of paramount importance for descriptive bibliography, the increasing sophistication of graphic and visual devices for computers will probably be of increasing importance for bibliographical applications.

My own work so far has consisted of preparing for the IBM 1620 a program for writing author handlists of some length for the New Cambridge Bibliography of English Literature.  The form in which the material has to be presented for NCBEL is perverse: the materials upon which I have recently been conducting most of my literary research - the writings of S. T. Coleridge - are usually complex, detailed, and intractable.  I therefore decided to evolve a program that was capable of a number of different applications for the same material, and to test the capacity and flexibility of the system with materials that would resist superficial treatment.  The intention, which is now to be transferred to the IBM System/360, is to extend the same procedures into the area of descriptive bibliography by a more minute exploration of further Coleridge material.  Aware of the work already successfully carried through in some of the areas of literary research, I wanted to see developed a comprehensive and versatile system for literary-critical and editorial purposes and to associate such a system with the systematic accumulation of literary texts in machine-readable form.

Possible Extensions and Generalizations

Most of the standard ways of manipulating the materials of literary research have been successfully explored by computer; but the progress across the whole field has been uneven, and the methods are difficult to reconcile because of the variety of machines and computer languages used.  The most serious objection is that each project or program tends to be enclosed within its own conceptual aim and its own form of presentation; and most literary projects so far have had as their end the production of a hand-tool of research - a printed index, handlist, concordance, bibliography, or literary text without any attempt to add a segment to the correlated functions of a machine system.  What we know can be done by computer in literary editing - and is now being done - is not difficult to define and not very difficult to execute; but it has not yet done much to extend the boundaries of computing or of literary study.  Concordances, wordlists, handlists of publications, minute comparative records of textual variants, indexes in various degrees of minuteness and convenience - all these are staple tools of literary scholarship and have long been prepared by hand.  By producing these tools by computer, we do not have a fundamentally different product, even though the product may be more crushingly exact and more monumentally complete than any such tool made by hand and we do not extend the function of computing in the literary field - or even in the field of literary research.  The completeness and accuracy of a computer process may in some cases give us a product so superior to any predecessor that it is qualitatively different.  But I should share with many of my literary colleagues a sinking of the heart if I thought computers could do no more for editorial and critical procedures than provide us with larger concordances, indexes, and bibliographical accumulations.  The work of Roberto Busa, with his team of Italian enthusiasts, had - after some years of work, beginning with Hollerith cards unassisted by computers - assembled some 15 million lines of text in eight languages and three alphabets in order to form wordlists.  At the end of the first phase of his project his output, if printed up, would comprise 500 volumes of 500 pages each.  One of the worst embarrassments in literary computing is the quantity of output.  Greatly to extend the size and minuteness of our literary hand-tools could paralyze scholarly activity if it depended upon the use of those tools.  Selective procedures will be increasingly important.

The aim I had in mind is not simply to put together a family of literary-editorial programs that will run on a single machine, but to evolve an elegantly coordinated system that would allow quick access to a number of large data banks so that any of the literary-critical operations I have outlined could be used to produce single short-run answers to single questions, and intermediate and interim correlations, lists, or substantive index, concordance, collation of bibliographical entries as starting points for further specific critical inquiry.  In such an arrangement, the computer would become, not simply a manufacturer of hand-tools for literary research, but an instrument of direct literary inquiry capable of guiding and supporting a sequence of literary judgments towards some wider or deeper recognition.  The computer would become heuristic rather than merely productive; and its activities might well support activity in literary composition as well as in literary research.  Ideally, I thought, the system would be applied from remote terminals to a time-sharing large-capacity high-speed machine with extensive ancillary support.  The programming would be done in a single easily accessible macro-language (such as PL/I promises to be) so that, as data banks of literary texts and other data accumulated, an increasingly large group of scholars and critics - and perhaps also writers and poets - would find their work extended, enriched, and guided by a system which was capable of supporting heuristic activities as well as providing "information".

In this way, the machine, no longer being used as a producer of hand-tools simply, would become an element in that working relation which in the fashionable current jargon is inappropriately called "dialogue".  But beyond that, such a coordinated system would not only involve an extension of known methods, languages, and machines but might also lead to a radical inquiry into current methods, languages, and machines, and a revision of them.

Peculiar Difficulties

What are the peculiar difficulties encountered in literary computing, and how can they be so formidable as to affect the development of languages and machines?  At first examination they do not seem unmanageably beyond the expected resources of (say) PL/1 and computers in a class with advanced models of System/360.

  1. The quantity of material to be processed tends to be very large, and is literal rather than numerical.  Statistical, numerical, and coded data, however, also need to be included.
  2. The data are in variable length records, often arranged in hierarchical series.
  3. Judgmental intervention is needed in varying degrees.  In view of the characteristically bulky output in literary work, the print-out must be very legible if intervention is to be prompt, direct, and free of error.
  4. Ideally - and this has been a serious concern in my own work - workers should be able to contribute to a large literary project at various levels of skill and judgment.  (A project can become genuinely educational by accepting integral contributions from individual workers.)  Individual contributions, correspondingly, have to be isolatable for assessment and, if necessary, revision.  These difficulties would not seem insuperable if only for machines and procedures, Paradise and the Fortunate Islands were not always just below the horizon.  Certainly our project has run into other difficulties at an elementary level.
  5. The intractability of programmers.  Literary computing is so unlike anything that most programmers are trained for or accustomed to, that we have had to train programmers and operators of our own; and when seriously stuck have had to seduce, suborn, or beguile non-literary programmers into helping us to unravel our procedural knots.
  6. The inappropriateness of programming languages.  Digital computers have so far been designed primarily for mathematical and numerical manipulations; computing languages tend to be mathematical in structure and therefore not very suitable for literary computing; the habits and assumptions of programmers, being mathematical, may be unfriendly to literary work because they think of language as exclusively structured by logic, and seem to think of language exclusively in the oversimplified (though unhappily fashionable) figure of "communication" - the telegraphic transfer of "information".
  7. The mathematically structured computer, by forcing an inappropriate analogy of both "language" and "thinking" tends to become logically circular.  Instead of finding and extending new functions for exploring language and literary constructions, it comes back always to its own image, is narcissistic, and may induce paralysis.

Philosophical Considerations

If computers are so designed and managed that they reject or ignore the nature of language as recognized and demonstrated by literary use, the machines will be limited either to certain technical functions and operations of a manipulative but non-literary sort; or they will, by uncritical insistence, produce a grotesque parody of our central activities - thinking, feeling, and the distillation of these in the use of language.  I suggest that if computers are to be conceived on more flexible, organic, and self-constructing principles than has so far occurred, they will be conceived in terms of language: that is, not according to a system which is alternative to or exclusive of the mathematical and logical, but according to a comprehensive analogy that - like the mature use of language - includes logic but is not limited by it.  This proposal raises some philosophical questions that we are not likely to settle at once.

The greatest difficulty of all for literary computing, at any level except the obviously factual and statistical, is to define the functions that obtain in literature, and to frame the questions about language and literature which, being unanswerable, are worth asking.  Literature - that is, things made in language under the conditions of imagination, or simply language used excellently - can be structured on principles which are not logical (though they include logic) and yet are not less strict and consistent than logic.  The "meaning" of a word cannot be exactly discriminated except in its verbal context and in the context of the mind that is using the word: it is people, not words, that mean.  Contrary to vulgar expectation, imagination is a realising activity and the literary or imaginative use of language is concerned almost entirely with precision, a precision which must be adequate to unique and single instances rather than to generalities or general situations.  Metaphor, the fundamental structural principle of imaginative language, cannot be distinguished by formal criteria: it must be distinguished functionally and qualitatively and therefore eludes generalized description.  The same applies to symbolic elements and functions; and a poetic symbol is absolutely different from a mathematical symbol.

The questions that we know can be answered now by computer in the field of language submit to well-known procedures.  It is the questions that the machine cannot yet begin to answer that are the questions scholars, critics, and poets need to insist upon asking.  Any technically definable method may become self-enclosed.  At present in the field of language we are inclined to limit ourselves to computer-structured answers to computer-structured questions.  What literary computing needs is a computer structured by literary questions, by the functions and dynamic of language itself.

The Future of Computers for Literary Computing, and Beyond

In literary computing there is a strong need for legibility, for a large range of characters in both storing and printing complex literary data, and for appropriate linkage with the methods of printing whether hot-metal or photographic.  Literature must be readable, and it cannot be readable if printed out in a 40-character alphabet of capitals, particularly when the capitals (intrinsically less legible than lower case letters) defy most of the traditional principles of legibility known to affect the design of typefaces for printing.  There are admittedly narrow restrictions on the shape of metal letters that have to run at high speed in a printer and make reliable marks on the paper, but those limitations would have to be surmounted if there were serious need for a high degree of legibility at intermediate stages of "proof" and if there were need to outflank cumulative textual error by direct book production from machine output.  In literary computing these needs become very clamorous.  There is need for continuous personal inspection - reading, that is - of text; this cannot be done rapidly and accurately unless the print-out is much more readable than at present.  The Monotype matrix frame of 256 characters is a good target, even beyond the 240-character universal print chain promised for the System/360; but the shapes of the letters need improving, and fortunately there are several centuries of technique and observation to guide designers in that respect.

Again, a serious objection to the computer in literary research is that a scholar may not gain through the computer the intimate - almost tactile - acquaintance he needs to establish with his materials by handling, sorting, and notetaking.

Some sort of subliminal learning may arise even from the actual rate at which these manual and routine tasks are carried out.  Whether a scholar can dispense with this aspect of his learning, I do not yet know: probably he can with effort.  But I suspect that since there will always have to be systematic and frequent human intervention in any literary program that is at all subtle or well-sustained a scholar may learn in those interventions how to establish the tactile relations with his material he has always had in the past.  That could happen, however, only with extremely legible print-out.

Much of the special language that computing people use is affectionately metaphorical and unashamedly anthropomorphic.  The manuals all hasten to assure us now, as a matter of public policy apparently, that machines do not "think"; but the emphasis, as in cases of party discipline, is suspiciously insistent.  It is commonly supposed that poets and artists should be allowed a margin for sloppy imprecision called "poet's license".  Actually poets are, and have to be, extremely precise in their use of words.  In the present climate of opinion, it is only those people who are supposed to be "objective", impersonal, absolutely logical - scientists and mathematicians, let us say - who could get away with many of the amiable approximations of computer terms.  Much of the special vocabulary of computing is precise, happily conceived, and indispensable.  Nevertheless I am concerned that computer people, in their jealously guarded exclusiveness, should not slip into an uncritical habit of taking some of their affectionately facetious approximations for structural principles.  "Memory" in a machine may indeed be retentive, and accessible under certain conditions; but the function of computer memory bears very little relation to the functions of human memory as far as we understand them.  Again, the trick of homing a manned satellite on to another satellite already in orbit depends largely upon the computer's ability to do the navigational sums quickly enough to make them useful, and so to absorb continuously the relevant variables that the updated calculations provide a dynamic movement towards exact solution.  This really is not thinking, even though it may be much more successful for certain purposes than any conceivable human thinking would be.  Literary people will have to be relied upon to tell computer people - and psychologists - what their "thinking" is like, and how it works, and how it affects language, and is affected by language.  Machine language is very little like language in the literary sense of the word.  Consequently, the "dialogue" that will go on in criticism or composition will not be a dialogue with the machine, but a dialogue with oneself through the machine.  Literary criticism is not a technique, even though some methods persist and the critical procedures are capable of precision.  Criticism is judgment; it is also a sort of overheard thinking-to-oneself, "controlled woolgathering" (as Cecil Day Lewis has called it).  If, considering the possible structure of a non-numeric computer of the future, we ask "After Boule, what?", the answer would seem to be: "A machine for wool-gathering with."

My requirement for literary computing is: an elegantly coordinated computer system that would allow rapid and continuous conduct of a variety of procedures upon massive data banks of different kinds of "literary" material.  It should allow for the variable length records and the copious working data incidental to literary work, and must therefore show unusual qualities of speed and versatility.  All the various functions and operations of the system should be so related that they can be used in any order at short notice, allowing for manual - or even spoken - interrogation from remote terminals to a large capacity time-sharing machine.  A versatile macro-language should make the system usable with a minimum of preliminary training, while at the same time not placing restrictions on the development of new or modified programs through that language.  To facilitate the accumulation, correction, and progressive updating of the data, very large data banks would perhaps be collected in a few central or strategic positions.  (These data banks would not supersede libraries as we know them, but might well be maintained by libraries.  The computer itself would be devised on the analogy of language and would embrace certain non-logical associative functions.  Such a machine would no doubt preserve the characteristics of the computer's logical ancestry, but would also be capable of dealing with the non-mathematical symbolism in which, under the conditions of imagination, it is possible to be at once rational and illogical, or irrational and logical; in which contradictions are not mutual exclusions, and in which implications may be (as it were) vertically structured over symbolic centres (? words) in a hierarchical sequence of exact and simultaneous definition and reference.

Jan van Krimpen was the first modern type-designer to make a fount of type that included in one series the three Roman alphabets - Roman, italic, and sloped Roman - and Greek, so that the whole formed one visually unified series.  The development of a comprehensive literary computing system would be a little like that, with the same rational elegance; but much more difficult.  So far, computing has thrived on ad hoccery, on providing packages for clamorous clients for limited purposes.  Machines, languages, terminology, assumptions, analogies, procedures are all contaminated by limited foresight and by haste: so productive and spectacular has the haste been that one hesitates to call it indecent.  Men make gods of their own image.  Unfortunately we have not yet started making machines in our own image.  Perhaps we know less about our own inner workings than we know about our desires.  Our machines are at present limited to the analogy of our techniques; in spite of the remarkable and rapid developments that have occurred, and the success with which the first three generations of machines are being applied, we are in danger, in our uncritical resolve to make do with what we have, of making our own image to the pattern of our machines.  The application of computer design to the realities of highly sophisticated literary use of language might well bring us to a new generation of computers.  Meanwhile the literary users of computers must strive to locate and define the precise areas in which the mathematically-based machine is appropriate.