+ Page 1 +
-----------------------------------------------------------------
The Public-Access Computer Systems Review
Volume 5, Number 3 (1994) ISSN 1048-6542
-----------------------------------------------------------------
To retrieve an article file as an e-mail message, send the GET
command given after the article information to
listserv@uhupvm1.uh.edu. (Files are also available from the
University of Houston Libraries' Gopher server: info.lib.uh.edu,
port 70.)
CONTENTS
COMMUNICATIONS
Using the World-Wide Web to Deliver Complex Electronic Documents:
Implications for Libraries
By John Price-Wilkin (pp. 5-21)
To retrieve this file: GET PRICEWIL PRV5N3 F=MAIL
The World-Wide Web (also called the Web) is a very promising tool
for libraries to use to explore the delivery of rich and complex
documents. Nevertheless, there are many limitations in the Web's
HTML markup language and the ability of Web servers to deliver
structured information. This paper explores the benefits and
limitations of the Web in the context of several projects taking
place at the University of Virginia, both in the Library and in
the University's Institute for Advanced Technology in the
Humanities. A gateway between the Web and the SGML-based PAT
system that helps to overcome the Web's inherent limitations is
also described.
+ Page 2 +
-----------------------------------------------------------------
The Public-Access Computer Systems Review
-----------------------------------------------------------------
Editor-in-Chief
Charles W. Bailey, Jr.
University Libraries
University of Houston
Houston, TX 77204-2091
(713) 743-9804
Internet: lib3@uhupvm1.uh.edu
Associate Editors
Columns: Leslie Pearse, OCLC
Communications: Dana Rooks, University of Houston
Editorial Board
Ralph Alberico, University of Texas, Austin
George H. Brett II, Clearinghouse for Networked Information
Discovery and Retrieval
Priscilla Caplan, University of Chicago
Steve Cisler, Apple Computer, Inc.
Walt Crawford, Research Libraries Group
Lorcan Dempsey, University of Bath
Pat Ensor, University of Houston
Nancy Evans, Pennsylvania State University, Ogontz
Charles Hildreth, READ, Ltd.
Ronald Larsen, University of Maryland
Clifford Lynch, Division of Library Automation,
University of California
David R. McDonald, Tufts University
R. Bruce Miller, University of California, San Diego
Paul Evan Peters, Coalition for Networked Information
Mike Ridley, University of Waterloo
Peggy Seiden, Skidmore College
Peter Stone, University of Sussex
John E. Ulmschneider, North Carolina State University
+ Page 3 +
Technical Support
Tahereh Jafari, University of Houston
Publication Information
Published on an irregular basis by the University Libraries,
University of Houston. Technical support is provided by the
Information Technology Division, University of Houston.
Circulation: 8,202 subscribers in 65 countries (PACS-L) and 2,562
subscribers in 52 countries (PACS-P).
Back issues are available from listserv@uhupvm1.uh.edu. To
retrieve a cumulative index to the journal, send the following e-
mail message to the list server: GET INDEX PR F=MAIL.
Back issues are also available from the University of Houston
Libraries' Gopher server. Point your Gopher client at
info.lib.uh.edu, port 70, and follow this menu path:
Looking for Articles
Electronic Journals
University of Houston Libraries E-Journals
The Public-Access Computer Systems Review
The journal's URL is gopher://info.lib.uh.edu:70/11/articles/e-
journals/uhlibrary/pacsreview.
The first three volumes of The Public-Access Computer Systems
Review are also available in book form from the American Library
Association's Library and Information Technology Association
(LITA). The price of each volume is $17 for LITA members and $20
for non-LITA members. All three volumes can be ordered as a set
for $45 (indicate that you want the PACS Review set, order number
7712-X). To order, contact: ALA Publishing Services, Order
Department, 50 East Huron Street, Chicago, IL 60611-2729, (800)
545-2433.
+ Page 4 +
-----------------------------------------------------------------
The Public-Access Computer Systems Review is an electronic
journal that is distributed on the Internet and on other computer
networks. There is no subscription fee.
To subscribe, send an e-mail message to
listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name
Last Name.
The Public-Access Computer Systems Review is Copyright (C)
1994 by the University Libraries, University of Houston. All
Rights Reserved.
Copying is permitted for noncommercial use by academic
computer centers, computer conferences, individual scholars, and
libraries. Libraries are authorized to add the journal to their
collection, in electronic or printed form, at no charge. This
message must appear on all copied material. All commercial use
requires permission.
-----------------------------------------------------------------
+ Page 5 +
-----------------------------------------------------------------
Price-Wilkin, John. "Using the World-Wide Web to Deliver Complex
Electronic Documents: Implications for Libraries." The Public-
Access Computer Systems Review 5, no. 3 (1994): 5-21. To
retrieve this file, send the following e-mail message to
listserv@uhupvm1.uh.edu: GET PRICEWIL PRV5N3 F=MAIL. (The file
is also available from the University of Houston Libraries'
Gopher server: info.lib.uh.edu, port 70.)
-----------------------------------------------------------------
1.0 Introduction
The World-Wide Web (also called the Web) is a very promising tool
for libraries to use to explore the delivery of rich and complex
documents. [1] Nevertheless, there are many limitations in the
Web's HTML markup language and the ability of Web servers to
deliver structured information. This paper explores the benefits
and limitations of the Web in the context of several projects
taking place at the University of Virginia, both in the Library
and in the University's Institute for Advanced Technology in the
Humanities. A gateway between the Web and the SGML-based PAT
system that helps to overcome the Web's inherent limitations is
also described.
2.0 SGML and TEI
The most worthwhile products that libraries can buy are ones that
conform to standards and are not tied to a specific software
package or operating system. These are the only products with
enduring value. Certainly, there are exciting electronic
resources being produced for specific software packages and
operating systems, but the extent to which libraries can build
collections of hypertext resources that are usable in the future
will depend entirely on the conformance of their resources to
true national and international standards. The most important
standard for this discussion is SGML, a standard designed to
express the organization of documents and to accommodate even the
most complex multimedia materials.
+ Page 6 +
A brief (and admittedly superficial) discussion of SGML and
the Text Encoding Initiative may be helpful. SGML (Standard
Generalized Markup Language, ISO 8879) is a standard approved by
the ISO for the descriptive markup of documents. The language of
SGML is sufficiently flexible that the sense of "document" has
been expanded to include coordinated time-based elements of
hypermedia (e.g., animated dance, music, and character-based
score and choreography moving in synchrony at a pace controllable
by the user). SGML is not a tag set: there are no pre-set tags.
Instead, SGML is a set of rules (or a grammar) for articulating
that vocabulary. These rules are sufficiently rigorous so that
specialized software can check the validity or conformance of a
document. The specification of that grammar is a DTD (Document
Type Definition); the DTD can also function to document many
decisions about the organization of a text. Without that
validity--i.e., without being parsed against a DTD--the document
is not SGML encoded, although it may share many of the
characteristics of SGML.
For our work at Virginia, the most notable of these
characteristics has been the descriptive nature of the tagging.
Rather than saying that an element of the text appears in bold,
17 point Helvetica, centered at the top of a new page, we use the
tags to define the function of a textual element (e.g., a title).
The tag set used must necessarily elaborate the elements of the
texts we see in an academic environment: a tag set designed for
articles or documentation, for example, will omit important
elements needed for encoding poetry. To serve those needs, the
Text Encoding Initiative (or TEI) has published a set of
guidelines for the application of SGML to texts in the
humanities. Functions of the text or hypertext, expressed
descriptively and with a standard language, are freed from the
constraints of a specific software package or application. SGML-
encoded works can serve a variety of functions, depending on the
user's needs and available software.
+ Page 7 +
3.0 The Potential of the Web
The Web uses a client/server architecture. Sophisticated Web
clients, such as Mosaic, offer an exciting sense of the
possibilities of electronic publishing on the network. Several
revolutionary concepts that have been awaited with anticipation
are incipient in all aspects of that relationship between client,
server, and publication. These characteristics are:
o Open systems--the ability to make resources available
to a variety of operating systems and a variety of
applications is evident throughout the Web. Computers
running X Windows, Microsoft Windows, and the Macintosh
System 7 all participate equally. In addition to
Mosaic, other clients, such as Cello and OmniWeb, are
available. Multimedia tools, such as image viewers,
are a matter of personal choice.
o Standards--given the Web's use of HTML, the importance
of standards is heightened, and HTML is inexorably
moving toward greater expressiveness and greater
conformance to the SGML standard.
o Distributed information--the notion of a universe of
distributed information, scattered throughout the
Internet while being conceptually linked to other
information, is becoming a reality through the use
of the Web.
4.0 Representative Web Projects
Over the past two years at the University of Virginia, faculty
and staff involved in several projects began to develop a variety
of electronic materials using the SGML standard. Partly this was
to serve already apparent needs, but it was also to take
advantage of the potentials of electronic publishing. While the
Library's Electronic Text Center and, later, its Digital Image
Center began to develop skills in creating electronic materials
in standard formats for networked access, scholars at the
Institute for Advanced Technology in the Humanities undertook the
daunting task of composing advanced, standards-based electronic
research materials without having the tools with which to publish
these materials. With the introduction of Mosaic, the Web was
quickly seen as a way to deliver these materials, and, with
relative ease, large bodies of SGML-encoded material were
converted to HTML for Web access. In order to focus on
particular aspects of those projects, the following example
projects are divided into sections on editions, history, image
archives, and instruction.
+ Page 8 +
4.1 Editions
In general, the Web offers creators of editions of literary or
other works the ability to represent a vast, interconnected web
of scholarly resources in a variety of different ways. The user
might view the resources simply, as in an edition of a work
without the introduction of a critical apparatus. A more complex
approach is also possible, with the user following the critical
apparatus at every turn. And finally, a rich and scholarly
approach is possible, allowing the user to view manuscript (or
printing) evidence or to examine the editor's assessment of the
evidence by comparing high-quality scans of original pages to the
marked-up transcriptions. With proper markup, an edition can be
viewed in as many ways as the reader desires. It can be a
variorum, a study edition, a critical edition, or historical
evidence. The form the edition takes is defined by the user's
needs or preferences.
4.1.1 British Poetry
The British Poetry Archive documents are perhaps the simplest of
those discussed here. (The project's URL is http://
www.lib.virginia.edu/etext/britpo/britpo.html.)
The two texts now available were transcribed by students in
Jerome McGann's graduate courses. In addition to the SGML-
encoded text itself, each work includes material such as
introductions, notes, and glosses as well as high-quality digital
facsimiles of pages from the original editions. The materials
are freely available on the Internet, and Mr. McGann hopes that
others will contribute to the archive. These texts represent the
simplest of the hypertext editions available on the University of
Virginia's Web, with supporting materials providing potential
deviations from an otherwise linear progression. The texts were
encoded in TEI-conformant SGML with the assistance of the
Library's Electronic Text Center, and they were then converted to
HTML for the purpose of making them available on the Web.
+ Page 9 +
4.1.2 Dante Gabriel Rossetti
To date, the most fully developed project is Jerome McGann's
ongoing edition--or archive--of the works of Dante Gabriel
Rossetti. (The project's URL is http://
jefferson.village.virginia.edu/rossetti/rossetti.html.)
According to McGann, the Rossetti archive is:
a hypermedia environment for studying the works of the
Pre-Raphaelite poet and painter D. G. Rossetti (1828-1882).
The archive is a structured database holding digitized
images of Rossetti's works in their original documentary
forms. Rossetti's poetical manuscripts, early printed texts
--including proofs and first editions--as well as his
drawings and paintings are stored in the archive, in full
color as needed. The materials are marked up for electronic
search and analysis, and they are supplied with full
scholarly annotations and notes. [2]
The organization of the archive is designed to capitalize on the
uniquely intertwined nature of Rossetti's artistic process,
linking image to text and text to image. When Rossetti
accompanied a painting by sonnets, the poems are included in the
archive along with an image of the painting. When Rossetti
illustrated a poem with a painting, an image of the painting is
included. Since Rossetti frequently designed his own editions,
electronic versions of his print works, with linked text and
images, are also available. McGann describes the difficulty of
studying Rossetti's works in a traditional print environment, and
then sets about trying to overcome those difficulties by melding
the resources in a way that allows the reader to follow the
threads of art, poetry, or translations without losing access to
the other materials.
4.1.3 Piers Plowman
The third project was begun in the 1994-95 academic year by one
of the most recent Institute fellows, Hoyt Duggan. (The
project's URL is http://jefferson.village.virginia.edu/piers
/archive.goals.html.)
+ Page 10 +
Mr. Duggan, an accomplished editor of Middle English texts,
created an edition of the Piers Plowman B text using the Web.
More in the model of the traditional scholarly edition, Mr.
Duggan's project brings together transcription and facsimile to
resolve vexing editorial problems. When the scribe uses an
abbreviation to represent a letter combination (e.g., a barred
"p" for "pre"), the reader typically wants the editor's best
judgement in rendering what was intended (i.e., "pre"). Many of
those decisions deal with unambiguous evidence, and some with
less certain evidence. Through SGML, both the suspension or
abbreviation is registered as well as the reading of the
character.
To the greatest extent possible, digital facsimiles of all
seventeen surviving manuscripts will be included. With facsimile
evidence, it is always possible to return to something resembling
the original document to evaluate the editor's decision. Duggan
has also found that it is possible to create extremely
high-resolution images that, with enlargement and other digital
treatments, can reveal important new information about the
original composition.
4.2 History
With new technological tools, historians are offered both
challenges and opportunities. Electronic resources allow them to
blend evidence and interpretation in ways that help both student
and researcher. A simple approach in using the materials is
possible, where the reader follows the argument without examining
evidence. It is also possible for the reader to examine the
methodology of the researcher, either to scrutinize the research
or to be instructed in the methodology of research. The process
of bringing evidence and interpretation together brings
challenges of immense proportions. For example, the role
geography plays in defining an event can be brought to bear on
the problem, but it may involve the use of sophisticated systems
of geographic analysis. Two projects at the Institute have used
many diverse resources to explore their topics, incorporating
nineteenth Census data, geographic models, and animated
sequences.
4.2.1 Ayers (Valley of the Shadow)
Edward Ayers, a historian of the Civil War and the
Reconstruction, was one of the Institute's first two fellows.
(The project's URL is http://jefferson.village.virginia.edu/
vshadow/vshadow.html.) According to Ayers, the project:
+ Page 11 +
interweaves the histories of places on both sides of the
Mason-Dixon line. It is the story of two communities
relatively close to one another, sharing considerable prewar
characteristics and similar experiences in the war itself.
There was one area in the United States for which that was
most clearly the case: the Great Valley that stretched from
Pennsylvania, through Maryland and Virginia, into Tennessee.
[3]
Ayers focuses on two towns--Staunton, Virginia and Chambersburg,
Pennsylvania--as representative communities from that Valley that
served as such an important economic, cultural, and military
locus of the War. The Web serves the historical ends by
balancing narrative--a filtering or interpretation of evidence--
with the presentation of that evidence. Ayers has described one
dilemma of the historian as a tight-rope act between providing
access to evidence and creating an organizing argument that does
not also obscure that evidence. His approach, providing the
deepening layers of evidence as "rhizomes" beneath the surface of
narrative, has been well-supported by the Web.
4.2.2 Dobbins (The Forum at Pompeii)
Dobbins, a classical archaeologist, reconstructs Pompeii from
archaeological evidence in a virtual space to advance his
argument. (The project's URL is http://
jefferson.village.virginia.edu/pompeii/page-1.html.)
He uses computer-aided design (CAD) tools to bring precision
to his reconstruction. Animation is being added to the CAD
representations to provide a three-dimensional perspective of
buildings and space. Structures that are normally seen in
isolation from each other are assembled in a total vision of
Pompeii that may suggest a degree of planning and coordination.
4.3 Image Archives
The Digital Image Center's image collections can be seen as
passive collections of standards-based images. (The project's
URL is http://www.lib.virginia.edu/dic/class/arh102.)
The image collections are organized to reflect the focus of
an individual class or an art exhibit. All of the images are
TIFF files subjected to JPEG compression. As such, they can be
examined with a variety of image tools, ranging from simple
viewers to software with analytical capabilities. Most
importantly, the tool used is largely the choice of the user. As
a result of planning and philosophy, all images are durable
enough to stand close scrutiny: they were scanned in 24-bit color
at a sufficiently high resolution to be enlarged several times
without significant degradation.
+ Page 12 +
The most developed collection is representative of this
archival philosophy. William Westphal's graduate architectural
history course on urban form includes hundreds of architectural
images, primarily from the Italian Renaissance, organized around
his lectures. Students can access these resources at all times
over the network as well as in a closed classroom environment
designed to efficiently access the images. Since they were
scanned at high resolutions, the images compare favorably with
the original slides, and they can be examined closely on screen.
The original slides have frequently degraded or had imperfections
that were corrected in the scanning process.
4.4 Instruction
The final project demonstrates the instructional capabilities of
the Web. (The project's URL is http://www.lib.virginia.edu/
etext/scanner.html.)
Using the Web to provide access to training materials has
many strengths. It gives variation to what would otherwise be a
flat, linear document. The document is dynamic and can easily
accommodate other elements as they are created by staff.
Scanning text is one of the most repetitive training
operations provided in the Electronic Text Center. Unlike
searching electronic texts, where every research need may entail
a different approach and different training needs, many of the
scanning decisions are generalizable and can be represented in a
training document. The project's instructional Web pages on
scanning were designed to reduce the amount of staff intervention
and give a greater degree of freedom to users.
4.5 Evaluation of the Projects
While the majority of the projects discussed here could be
supported by numerous stand-alone, operating-system specific
hypertext products, the Web has several advantages.
The projects' electronic resources are widely available on
the Internet, and users can access them on a variety of computer
platforms, regardless of the fact that the Web server is running
on a UNIX computer. (Attractive graphical Web clients, such as
Mosaic and OmniWeb, are available for Macintoshes, IBM-compatible
computers using Microsoft Windows, UNIX computers with X Windows,
and NeXTs.)
+ Page 13 +
Another key advantage is that the source material for the
editions either conforms to or is in the process of being
composed using international standards; it is marked up to
suggest the functional characteristics of the collections, rather
than their representational characteristics. Elements, such as
titles, quotations, and headings, are marked to suggest their
functional role in the document, rather than any presumed display
value. Displays depend instead on the capabilities of the user's
software, which utilizes the functional characteristics of the
elements to determine how to present the information.
This reliance on functional--not representational--
characteristics means that the same materials can be used in a
variety of different ways, supporting the creation of editions
with other software packages (e.g., Electronic Book Technology's
DynaText), use with different analytical tools (e.g.,
morphological parsers), and access through different database
schemes (e.g., text-specific systems or relational database
managers designed for images). A high degree of flexibility,
viability, and multi-platform access can be maintained.
Each of the mentioned editions and historical analyses was
first composed in a very rich SGML format that was designed to
discriminate between the functional characteristics of low-level
elements. They were subsequently converted (as automatically as
possible) to static HTML versions for use with the Web.
Elements, such as discrete descriptive bibliographic
characteristics, become simple list items, and most complex prose
and verse elements are reduced to paragraphs and line breaks.
After this conversion, it was discouraging to see that richness
disappear, but the original document remained unchanged.
There is a continued expectation by the scholars who created
these resources that better tools will be developed to tap the
inherent complexity of these materials. The standards-based
format of the materials ensures that these scholars will be able
to take advantage of these new tools when they become available.
5.0 The Web as an Authoring and Document Delivery Environment
The authoring and document delivery capabilities of the Web are
significantly limited for documents of even moderate complexity.
Authoring for the Web is usually done in HTML. HTML has many
virtues, not least of which is its striving for expressiveness
and SGML validity. It is, however, an impoverished tag set with
little ability to reflect the complexities of most of the
documents discussed earlier, despite their being offered through
the Web. It is important to note that the Web is a limited
document delivery environment. Its inability to recognize or use
structural features of documents forces unpleasant administrative
decisions that will likely restrict the later use of these
documents.
+ Page 14 +
5.1 HTML's Lack of Expressiveness
The range of HTML tags available to users is limited. In
contrast to the hundreds of tags made available by the TEI
guidelines, roughly two dozen tags are made available in HTML.
While HTML will be expanded with HTML+ to give greater precision
in areas such as tabular data, HTML+ cannot be expected to
provide the breadth needed to support literary and historical
documents, or even to support standard journal literature.
This lack of expressiveness and insufficient breadth of tags
also leads to the author's inability to differentiate important
elements with HTML. In HTML, the same small set of tags is
necessarily used for diverse sets of elements. For example, the
code (line break) is used for verse lines, table elements,
stanza divisions, dramatis personae, and many features. Authors
are also left with little ability to represent the structural
organization of a document. Where the author wishes to define a
bounded segment of text, such as a stanza or chapter, no tag is
available for this purpose. Instead, authors rely extensively on
dividing documents into files representing major structural
divisions. Elements that are normally defined as structural tags
in SGML, such as the paragraph (or
) tag, are not defined by HTML in a way that reliably defines the contents of a paragraph. This paucity of tags in HTML results in the author of any document of moderate complexity using many tags to effect a desired appearance, rather than to characterize the content. This type of tagging confuses function and appearance. The inability of HTML to represent complexity is often closely linked to the inability of Web servers to provide access to complex representations of documents. This inability is fundamentally linked to the notion of structure. Where structural distinctions exist in the markup language, there is no inherent ability in the Web to deliver that individual element. So, for example, HTML defines glossaries and glossary entries, but, in order to provide access to an individual glossary entry from a hypertext link, the server must send the entire file (i.e., the file containing the glossary) to the user. Smaller glossaries cause few problems, but this makes providing access to individual "glossary" entries in a document such as the Oxford English Dictionary, where all 500 MB would be transferred across the network, effectively impossible. While Web browsers are intelligent enough to move automatically within the file to the chosen glossary entry, the file transfer paradigm is impractical for large-scale information delivery. Given this, it must also be pointed out that there are very few HTML tags that define structural relationships. Structures such as chapters, sections, or poems are not represented. + Page 15 + The Web's deficiency with regard to structural features leads to decisions with serious negative administrative consequences. Because the Web does not include structure awareness in its protocol and because HTML markup provides so little support for structural representation of features, the author and the administrator are forced to fragment documents into a sets of reasonably sized components. In converting the ARL book University Libraries and Scholarly Communication (URL: http://www.lib.virginia.edu/mellon/mellon.html) to HTML, I found that, using the Web and HTML alone, it was necessary to divide the dozen chapters into separate files. While this may not sound onerous, extending this practice to a large collection of documents--or even a small collection of large documents--would be very difficult. An HTML version of the OED would become a set of 300,000 files. Chadwyck-Healey's English Poetry Database would become either 2,500 files (if the administrator wished to provide access at the volume level) or 65,000 files (if access to individual poems were supported). Even this severe approach does not solve needs that might arise for substructures, such as quotations and definitions within the OED or specific stanzas within a poem. 5.2 Overall Limitations of HTML For documents of limited complexity, HTML is an effective authoring environment; however, it seriously limits the ways in which a more complex document or a set of documents can be used. No differentiation of important elements (e.g., stanzas and subdivisions of prose) can take place, and it will be necessary to upgrade the coding of HTML documents within the year. The Web also lacks inherent document management or document access capabilities. In part because of the limitations of the markup language and in part because of the design of the protocol, there is a paucity of structure represented and no structure recognized. I emphasize "inherent," however, because the Web also provides a gateway capability that can more than compensate for this deficiency. 6.0 Exploring Alternatives I have been developing a gateway from the Web to an indexed collection of texts in an SGML-aware system to take advantage of the complexity of the documents and yet make them available through the Web. The texts are nearly all in fully validated SGML tag sets, each with significant expressiveness. In contrast to an HTML collection, potentially consisting of many files representing the many component parts of the collection, each text is a single file with as many as hundreds of thousands of structural components. + Page 16 + 6.1 Collections Three diverse examples are provided to help understand the nature of the collections used in the gateway. 6.1.1 University of Virginia Middle English Collection The Middle English collection assembled by the University of Virginia's Electronic Text Center is approximately thirty texts in a single file. (The collection's URL is http:// etext.virginia.edu/Mideng.query.html.) Texts vary in size from several dozen pages to several hundred pages. One of the Library's smaller collections is approximately 11 MB of raw text, but it grows as new materials become available. The markup language used is SGML complying with the Oxford Text Archive's DTD, a tag set that will eventually represent a valid subset of the TEI DTD. The tags differentiate major structural elements, such as tales in the Canterbury Tales, bibliographic elements, and elements of composition (e.g., verse lines, stanzas, and paragraphs). Markup is rich enough to support a wide range of analytical requirements, and the texts have been made available for the purpose of analysis to the University of Virginia community for much of the past two years. With the permission of Open Text, the Oxford Text Archive, and creators of individual texts, access to this collection is unrestricted. It can be accessed in a variety of ways, including the Web. 6.1.2 Chadwyck-Healey English Poetry Database The Chadwyck-Healey English Poetry Database is purchased on tape from the publisher and made available indexed by PAT. Access to this collection is restricted to a consortium of five universities in Virginia. As yet incomplete, the collection currently consists of nearly 1,600 works with more than 64,000 poems and 233,000 pages. The raw text is relatively large (340 MB), but, indexed with PAT, searches usually yield results in less than one second. The SGML used with the English Poetry Database is a very rich set of tags designed in consultation with a TEI representative. It is more than adequately expressive about the poems, including structural markup for poems, poem divisions such as stanzas, lineation, and attributes such as whether rhyme is used. + Page 17 + 6.1.3 Oxford English Dictionary The Oxford English Dictionary is the largest and arguably the most complex resource made available through this service. The 570 MB document contains approximately 300,000 entries, many with more than fifty subelements. Strictly speaking, it is not in SGML form because it has not been validated against a DTD. The electronic version was, however, designed to take advantage of SGML's characteristics, and it significantly benefits from the file's structural and descriptive markup. 6.2 Web to PAT Gateway I have constructed a gateway between the Web and the more sophisticated SGML texts using the Web's CGI (Common Gateway Interface) and PAT, an SGML-aware text retrieval program. Text is returned from PAT to the Web in the richer SGML, and it is converted on the fly to HTML, primarily using HTML to control the appearance of the text on the screen. This gateway is being documented elsewhere (URL: http://sansfoy.lib.virginia.edu/pub /www-to-pat/), but several facets are relevant to this discussion. 6.2.1 Expressive Representation of Text is Retained The original unmodified texts are accessed through the gateway without compromising the expressiveness of the original markup. Although the sophisticated SGML markup is dynamically rendered as HTML as the user retrieves results, the text remains in the original rich SGML form behind the Web representation. Decisions about the way that the fuller tag set maps to HTML are registered in filters, and, as HTML becomes more expressive, a better match between the original tags and the HTML can be made. 6.2.2 Simple Queries and Simple Access Users need not be familiar with PAT's query language to search texts and take advantage of the structural characteristics of the more expressive markup. A word or phrase search returns keywords-in-context (KWIC) views to the user, from which a view of larger context is possible. Eventually, this process may lead the user to retrieval of entire sections (e.g., chapters or acts). All expanded views are made from hypertext links that initiate structural retrievals such as "the chapter that includes this search result." + Page 18 + 6.2.3 Menu-Driven Structural Queries It is possible to facilitate complex queries through menus. For example, in the OED, the word lookup function facilitated by the Web includes queries such as: "give me entries that include my word within the Lookup field of the Headword Group field," or "give me entries that include my word in the Variant Form field." The user is not aware of the complexity of the query taking place, but can modify the type of query by selecting different variations on the search menus. Boolean queries that ask for the intersection of document structures have been challenging to users employing command-line and analytically oriented interfaces. However, through simple fill-out forms and menu selections, queries such as "(stanzas including [word/phrase]) INTERSECT (stanzas including [word/phrase])" are executed without the user needing to understand the system's command syntax. While we also offer access through several complex, analytical interfaces (PatMotif and PowerSearch from Open Text as well as a locally developed VT 100 interface), most users can avoid these more complicated interfaces. 6.2.4 Access to Structure Finally, the administrator of a collection need not resort to fragmenting files to make it possible to provide access to the component parts of a collection. As mentioned earlier, an HTML approach to the OED would require us to divide it into 300,000 files. I was recently able to represent the dozens of parts, chapters, sections, and subsections of a voluminous SGML technical document through this strategy, making hypertext links and each component accessible by utilizing the fairly rich markup; however, the document remained a single file. Resource management is made more reasonable through a system cognizant of a file's structure. + Page 19 + 6.2.5 Future Approaches This strategy has many possibilities. Journal literature coded in SGML may be successfully accessed through this sort of strategy. For example, a journal run marked up according to the more elaborate Association of American Publishers DTD could return articles to the user through PAT queries. Another approach would facilitate browsing by recognizing the structural relationship of author and abstract to article, article to issue, issue to volume, and volume to collection. Throughout, the collection would exist as a single file, searchable across all articles by a single query. The collection would not need to be compromised by converting the articles to HTML, but would instead continue to remain in the more expressive AAP SGML format, filtered for display in the process of retrieving information. Through this strategy, the Web can be an effective means of accessing the original files in a fuller SGML, without resorting to fragmenting the material into files corresponding to the individual articles or even parts of articles. Similar strategies for books and documentation are possible. 7.0 What Does the Web Offer Libraries? The Web is a complex system with great potential and serious limitations. We should use caution as we consider composing in HTML: it is a short-term coding strategy. Documents composed in HTML will have limited expressiveness, and, because HTML is not yet stable, they are likely to need continuing enhancement to be used in the Web. There is much to be excited about with the Web: it is a viable system that suggests what electronic publishing on the Internet can be. We have lacked credible, demonstrable examples of standards-based, networked hypertext in the past, and the Web has changed that. There is a great deal of untapped potential in the Web. By exploiting the Web's ability to talk to other more sophisticated programs, we can begin to take advantage of that potential and make tomorrow's promise real today. + Page 20 + A subtext of this article has been the importance of standards--both employing them in creating hypertexts and extending the Web to take greater advantage of them. Standards have been attractive to libraries because they help ensure long- term viability. However, as Jefferson remarked in 1790, standards are also an important key to information being generally useful, regardless of context: Measures, weights and coins, thus referred to standards unchangeable in their nature . . . will themselves be unchangeable. These standards, too, are such as to be accessible to all persons, in all times and places. The measures and weights derived from them . . . are within the calculation of every one who possesses the first elements of arithmetic, and of easy comparison, both for foreigners and citizens, with the measures, weights, and coins of other countries. [4] Notes 1. A version of this article was presented as a paper at the Yale Hypertext Conference, May 1994. An HTML version of the original speech, with active links to the resources discussed, is available via the World-Wide Web; URL: http:// sansfoy.lib.virginia.edu/pub/yale.html. 2. Jerome McGann, The Complete Writings and Pictures of Dante Gabriel Rossetti: A Hypermedia Research Archive (Charlottesville, VA: Institute for Advanced Technology in the Humanities, University of Virginia, 1994). (Electronic document available via the World-Wide Web; URL: http:// jefferson.village.virginia.edu/rossetti/rossetti.html.) 3. Edward Ayers, The Valley of the Shadow: Living the Civil War in Pennsylvania and Virginia (Charlottesville, VA: Institute for Advanced Technology in the Humanities, University of Virginia, 1994). (Electronic document available via the World-Wide Web; URL: http://jefferson.village.virginia.edu/vshadow/vshadow.html.) 4. Thomas Jefferson, "Public Papers," in Writings (New York: Literary Classics of the U.S., 1984), 410. About the Author John Price-Wilkin, Systems Librarian for Information Services, Alderman Library, University of Virginia, Charlottesville, VA 22903. Internet: jpw@virginia.edu. + Page 21 + ----------------------------------------------------------------- The Public-Access Computer Systems Review is an electronic journal that is distributed on the Internet and on other computer networks. There is no subscription fee. To subscribe, send an e-mail message to listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name Last Name. This article is Copyright (C) 1994 by John Price-Wilkin. All Rights Reserved. The Public-Access Computer Systems Review is Copyright (C) 1994 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by academic computer centers, computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. -----------------------------------------------------------------