The Relationship Between General and Specific DTDs: Criticizing TEI Critical Editions

David J. Birnbaum
University of Pittsburgh
Department of Slavic Languages and Literatures
Email: djbpitt+@pitt.edu
URL: http://clover.slavic.pitt.edu/~djb/


Copyright © 2000 by David J. Birnbaum. All rights reserved.
Do not reproduce or cite without permission. Comments welcome.

Last revised: 2000-07-05 12:28:05.


Keywords: architectural forms, attributes, critical editions, document type definitions, DTD, elements, extensible markup language, XML, OmniMark, standard generalized markup language, SGML, text encoding initiative, TEI

Abstract: The present study discusses the advantages and disadvantages of general vs specific DTDs at different stages in the life of an SGML document based on the example of support for textual critical editions in the TEI. These issues are related to the question of when to use elements, attribute, or data content to represent information in SGML and XML documents, and the article identifies several ways in which these decisions control both the degree of structural control and validation during authoring and the generality of the DTDs. It then offers three strategies for reconciling the need for general DTDs for some purposes and specific DTDs for others. All three strategies require no non-SGML structural validation and ultimately produce fully TEI-conformant output. The issues under consideration are relevant not only for the preparation of textual critical editions, but also for other element-vs-attribute decisions and general design issues pertaining to broad and flexible DTDs, such as those employed by the TEI.


1. Introduction: Elements, Attributes, and Data Content

SGML (Standard Generalized Markup Language) provides at least three ways for representing textual information: 1) GIs (generic identifiers) (the names of elements), 2) attributes, and 3) pcdata element content. For example, if a witness entitled "witnessname" includes a reading "here is some text", these two pieces of information (witness name and textual content) might reasonably be represented without redundancy in SGML in any of the following five ways:

  1. <witnessname>here is some text</witnessname>
    
  2. <reading name="witnessname">here is some text</reading>
    
  3. <reading>
      <content>here is some text</content>
      <witness>witnessname</witness>
    </reading>
    
  4. <witnessname content="here is some text">
    
  5. <reading name="witnessname" content="here is some text">
    

Thus, an editor must make the following decisions:

  1. How to record the name of the witness: in the GI (1 and 4), as an attribute value (2 and 5), or in the content of a separate enclosed element (3).
  2. How to record the text provided by the witness: as pcdata content of the main element (1 and 2), in the content of a subordinate element (3), or as the value of an attribute (4 and 5).

Versions 4 and 5 can be dismissed immediately, and I have mentioned the possibility of encoding the text provided by a witness as an attribute value of an empty element only because this type of approach to representing textual content has recently achieved a certain popularity in an XML (Extensible Markup Language) e-commerce context as an alternative to traditional DBMS (database management system) representations. Under this approach, a DBMS record is encoded as an XML empty element and DBMS fields are each encoded as attribute values of that element. This approach may be effective when the field content is plain text, but it raises complications when the text in question may itself contain markup that reflects an internal hierarchical element structure, especially if this markup needs to be parsed for other purposes. This complication means that whatever its merits in certain specific contexts, encoding significant (and possibly hierarchically-structured) natural-language text as an attribute value of an empty element is unappealing as a general solution to the problem of preparing textual critical editions. For this reason, I will assume that the text from witnesses should be represented as pcdata content, which means that I will omit further discussion of strategies 4 and 5, above. Instead, I will concentrate on examining the different consequences of representing the names of witnesses as GIs, as attribute values, or as pcdata content, as exemplified by strategies 1-3, above.

In some respects the informational differences among strategies 1-3, above may appear slight. All three representations record unambiguously the name of the witness and the content, and these are the most important pieces of information for philological research. Each representation can be converted automatically to either of the others. There are, however, important differences among these three strategies involving both what they are capable of representing and how they interact with SGML syntax.

Before these issues can be addressed, it is necessary to examine the three methods employed by the TEI (Text Encoding Initiative) for encoding textual critical editions, all of which are based on strategy 2, above. Section 2, below (Flexible DTDs) outlines the general TEI strategy for ensuring the flexibility of a set of DTDs designed for multiple purposes. Section 3 (The TEI Approach to Critical Editions) narrows the focus from general TEI flexibility issues to those that pertain specifically to the encoding of textual critical editions. Section 4 (Problems with the TEI Approach to Critical Editions) identifies limitations imposed by the TEI DTDs on both what may be represented in critical editions and how that representation may be processed with SGML tools. Section 5 (Three Solutions) outlines three strategies for preparing TEI-conformant textual critical editions that overcome the liabilities identified in Section 4. Section 6 (Conclusions) summarizes the principal conclusions that emerge from this report.

2. Flexible DTDs

One virtue of DTDs (document type definitions) is that they can help ensure structural uniformity (or, at least, coherence) across a body of document instances. This uniformity enables a set of similar or related documents to be processed, whether automatically by an application or intuitively in the mind of a user, in a consistent way.

The TEI DTDs are intended to serve a very broad community of users, whose interests and prejudices may vary considerably. This ambitious purpose creates an inevitable tension between the uniformity inherent in the notion of shared, communal DTDs, on the one hand, and the variation inherent in a heterogeneous user community, on the other. The TEI attempts to resolve this tension in three ways: modules (discussed in section 2.1, below), alternatives (discussed in section 2.2, below), and extensions (discussed in section 2.3, below).

2.1. Modules

One of the most ingenious features of the TEI architecture is its modular design. One popular metaphor for this design is that of a Chicago pizza:

All pizzas have some ingredients in common (cheese and tomato sauce); in Chicago, at least, they may have entirely different forms of pastry base, with which (universally) the consumer is expected to make his or her own selection of toppings. Using SGML syntax this might be summarized as follows:

<!ENTITY % base "(deepDish | thinCrust | stuffed)" >
<!ENTITY % topping "(sausage | mushroom | pepper | anchovy ...)">
<!ELEMENT pizza - - (%base, cheese [amp   ] tomato, (%topping;)*)>

In the same way, the user of the TEI scheme constructs a view of the TEI DTD by combining the core tag sets (which are always present), exactly one "base" tag set and his or her own selection of "additional" tag sets or toppings. ([Burnard])

For example, the TEI DTD for critical editions used in the present report selects the prose base and the textual criticism topping. This modular approach to DTD design represents an attempt to compromise between the competing and equally unrealistic goals of providing a single, uniform DTD that will be suitable for all users, on the one hand, and enabling each user to design a custom DTD that is tailored to the specific needs of his or her documents and purposes, on the other. One practical implication of this approach is that there should be nothing completely surprising in a document designed according to the TEI pizza philosophy; while not all TEI-conformant documents will encode the same information the same way (or even at all), and not all will have been authored with the same specific TEI-conformant DTD, all such documents will be have been constructed from common elements used in a flexible but not unrestricted way.[1]

2.2. Alternatives

In addition to permitting users to select the DTD modules they wish to include in their individual DTDs, in some cases the TEI DTDs also provide support for multiple ways of encoding particular structures. One striking example of this approach is the three different mechanisms proposed for linking an apparatus to a text in a critical edition: 1) location referencing (using line numbers or some other canonical reference scheme), 2) double-end-point attachment (indicating precise locations of variant readings), and 3) parallel segmentation (providing variant readings in parallel within the text) ([TEI P3], section 19.2).

These three mechanisms are described and discussed in greater detail below, but the differences among them are essentially of two types: the technological and the personal. Concerning technological differences, for example, the location reference and double-end-point attachment methods may be used either in line or in an external apparatus, while the parallel segmentation method may be used only for an in-line apparatus. Similarly, the double-end-point attachment method can identify the exact end points of variants, while the precision of the location reference method depends on the precision of the canonical reference system underlying it (for example, the use of biblical chapters and verses does not directly permit references to units smaller than a verse). As for personal differences, scholars may be used to visualizing a critical edition in a particular way (concerning both the reference method and the choice between an in-line and an external apparatus), and may find it either intellectually difficult or psychologically unacceptable to adopt a markup strategy that does not reflect their conceptualizations of a text directly, even when there is no informational difference at stake. The result of the compromise adopted by the TEI is that most editors will be able to prepare TEI-compatible critical editions that model their analytical perspectives on the texts fairly closely, but at the expense of reducing the extent to which a user can anticipate how an arbitrary TEI-conformant critical edition will be structured. The three methods mentioned above are discussed individually in section 3, below.

2.3. Extensions

The TEI developers understood that it would not be possible to anticipate all the ways in which users would wish to encode documents, and that no degree of flexibility in choice of encoding methods could prove sufficient for all members of a broad community with varying goals and established practices. Accordingly, the TEI DTDs contain a standardized extension mechanism, which permits the introduction of markup not anticipated by the TEI editors, including the deletion of elements, the renaming of elements, the extension of classes, and the modification of content models or attribute lists. The mechanism is, in fact, so powerful that it:

... if used in an extreme way, permits deletion of the entire set of TEI definitions and their replacement by an entirely different DTD! Such revisions would result in documents that are not TEI conformant in even the broadest sense, and it is not intended that encoders use the mechanism in this way. ([TEI P3], section 29)[2]

Where elements are renamed or classes extended, the TEI guidelines provide for the inclusion of a TEIform attribute for each element, which can be used to associate a new GI with the original TEI element on which it is based. This strategy enables an application to refer to the TEIform attribute value when deciding how to process the new element. For example, a processing application might regard as TEI paragraphs not just elements where the GI is equal to <p> (the default TEI element for encoding paragraphs), but also all other (new) elements where the value of the TEIform attribute is also p. This type of system is illustrated in Section 5.1, below.

3. The TEI Approach to Critical Editions

Texts undergo alteration during copying, either accidentally (when a scribe inadvertently miscopies a source text) or deliberately (when a scribe consciously changes a source text, often in an attempt to improve it or to correct what he perceives to be an error). A textual critical edition attempts to provide evidence from multiple copies of a work (called witnesses), which scholars may then use for a variety of purposes (e.g, to determine the filiation of the witnesses, to reconstruct lost early or intermediary copies, to trace the history of the transmission of the text, etc.). Editors sometimes treat all witnesses as equivalent; in other cases they will identify a principal witness (called a copy text), which is transcribed in its entirety, and cite selected variants from other witnesses (called control texts) only when those witnesses provide important information that the editor considers important that is not available from the copy text. A reading from a privileged text is sometimes called a lemma and the collection of variants from control texts is called a critical apparatus. Witnesses in a critical apparatus are traditionally identified by short unique identifying strings of letters, numbers, and symbols called sigla (a contraction of plural sigilla, singular sigillum).

As was noted above, the TEI provides three methods for linking a critical apparatus to a text: 1) location referencing (using chapter and verse numbers or some other canonical reference scheme), 2) double-end-point attachment (indicating precise locations of variant readings), and 3) parallel segmentation (providing variant readings in parallel within the text) ([TEI P3], section 19.2).

What all three methods have in common is that the apparatus, whether in-line or external, is contained in <app> (apparatus) elements. An <app> element normally consists of <rdg> (reading) elements, plus an optional <lem> (lemma) element, which may be used to represent the reading from a privileged witness. Individual reading elements may be included immediately within an apparatus element or they may be combined within intermediary <rdgGrp> (reading group) elements to represent any grouping considered desirable by the editor. In either case, the names of the witnesses to each reading will normally be encoded as the value of a wit attribute of the <rdg> element. Alternatively or additionally, the names of the witnesses may be specified in a <wit> element inside the <rdg> element.

A skeletal apparatus element that uses reading groups might look like the following:

<app>
  <rdgGrp>
    <rdg wit="A">Text from witness A</rdg>
    <rdg wit="B">Text from witness B</rdg>
  </rdgGrp>
  <rdgGrp>
    <rdg wit="C">Text from witness C</rdg>
    <rdg wit="D E">Text that is identical in witnesses D and E</rdg>
  </rdgGrp>
</app>

As was noted above, the witnesses attesting a particular reading may be recorded as the value of the wit attribute of the <rdg> element (as above) or they may be enclosed in a separate <wit> element within the <rdg> element. In either case, the witnesses that are included in an edition are supposed to be documented in <witness> elements inside a <witList> element found elsewhere in the document.[3] The <witness> element has an attribute sigil, representing the sigillum associated with that witness, and it is intended that the values found in these sigil attributes will correspond to the witness identifiers associated with <rdg> elements (whether these are given as values of the attribute wit or as data content of the element <wit>).

Unfortunately, because the attribute wit of the <rdg> element and the attribute sigil of the <witness> element inside a <witList> element are both of type cdata, an SGML parser is unable to validate any aspect of the desired correspondences. But [TEI P3] tantalizingly suggests that:

The advantage of holding witness information in the wit attribute of <lem> or <rdg> is that this may make it more convenient for an SGML application to check that every sigil identifier has been declared elsewhere in the document. By giving the wit attribute a declared value of idrefs, for example, one could more easily ensure that readings are assigned only to witness sigla given as id values for witnesses in a <witList> element ... . (Section 19.1.4.2)

Because the standard TEI DTDs do not, in fact, declare the attributes in question as id and idrefs, as described in the preceding paragraph, and declare them instead as cdata, the advantages described above are not available. A user could, however, modify the TEI DTDs to change the attribute types, and because all attributes of type id and idrefs also meet the requirements for cdata, documents created in this way would be fully TEI conformant.[4]

The following subsections describe briefly each of the three methods for associating an apparatus with texts, identify the strengths and weaknesses of each, and compare their features.

3.1. Location Reference

The location reference method gives the reading from the base text in line and inserts the apparatus wherever the editor wishes. The apparatus may be associated with the text either by physical location (e.g., it may be included within the element to which it refers) or by explicit location reference. The precision of the reference depends on the precision of the reference method employed, which is related to the granularity of the markup. For example, as was noted above, a location reference system that relies on biblical chapters and verses is not able to associate an apparatus element with any portion of text smaller than a verse. ([TEI P3], section 19.2.1)

Location referencing is convenient where a canonical reference system is well-known and where precise alignment of the apparatus with the base text is not required. Location referencing also requires that there be a base text, since this method requires that the reading from exactly one witness appear in line.

3.2. Double-End-Point Attachment

Double-end-point attachment indicates the exact endpoints of a span of text either by referring to attributes of type id located within the main text or through indirect pointing location methods. If the apparatus is in line, the <app> element itself can mark one end point of the span to which it refers, with the other end point indicated by reference to an attribute of type id. ([TEI P3], section 19.2.2)

The primary advantage of double-end-point attachment is that it is the only one of the three methods that is designed to handle overlap (discussed in greater detail in section 3.3, below). On the other hand, one consequence of this power is that double-end-point attachment is the most complex and least legible method, making it the most difficult to implement and process without specialized tools.

3.3. Parallel Segmentation

Parallel segmentation is the only method that does not require a privileged base text. Under the parallel segmentation method, all readings are grouped together in parallel, and although one may be designated as a lemma, this is not required, and except for assigning it a special name, the markup does not necessarily treat the lemma any differently from any other witness. ([TEI P3] section 19.2.3)

The only disadvantage to the parallel segmentation method is that it does not support overlap. For example, suppose three witnesses attest the following:

Witnesses A and B vary only with respect to the first word, witnesses B and C vary only with respect to the last word, and witnesses A and C vary with respect to both the first and last words. Under parallel segmentation, one must either treat the entire line as a single <app> element with three separate readings (which fails to create any formal record of the agreement that does occur) or divide it into three portions: an <app> element for the first word (where A != B = C), data content for the central section (where there is no variation), and another <app> element for the last word (where A = B != C). This last strategy creates a formal record of all agreement and disagreement, but at what may be an inappropriate granularity; for example, with respect to A and B there are only two logical segments, the first word (where they disagree) and the rest of the sentence (where they agree). Because parallel segmentation, unlike double-end-point attachment, must be applied to all witnesses at once, the only alternative to one long segment is three short segments, as if the agreement between A and B in the middle of the sentence was a separate phenomenon from their agreement at the end.

4. Problems with the TEI Approach to Critical Editions

The present section begins by describing a different type of critical edition (section 4.1). It then identifies both limitations of this type of edition (section 4.2) and ways in which this type of edition is capable of resolving problems inherent in the standard TEI methods (section 4.3). Section 5, below provides three solutions to the problems identified in Section 4.

4.1. Introduction: An Alternative Type of Edition

The present section outlines the rationale for employing an alternative type of edition (section 4.1.1) and describes its structure (section 4.1.2).

4.1.1. Rationale

Traditional printed critical editions most commonly provide a base text (either transcribed from a privileged principal witness or constructed by the editor) and record variants in a separate apparatus, usually printed in the margins of the page. This type of presentation simplifies reading the base text and it economizes on paper, but it achieves these goals at the expense of complicating both reading any text other than the base (since that reading must be reconstructed on the fly by mentally replacing selected base readings with variants plucked from the apparatus) and studying variation in general (since readers must move their eyes constantly between the in-line reading of the base text and the apparatus that is found elsewhere). Furthermore, the compromised legibility of this type of apparatus encourages editors to be selective about citing variants, and this type of selectivity means that readers will be unable to distinguish text where there is no variation at all from text where the editor has determined that although variation exists, it is not significant. These two problems are identified in [Birnbaum] as "compromised legibility" and "incomplete presentation of the evidence," respectively.

4.1.2. Structural Description

The problems of compromised legibility and incomplete presentation of the evidence, noted above, can be avoided by transcribing all witness in full in parallel, along the lines of:

A: Line 1 from Witness A 
B: Line 1 from Witness B 
C: Line 1 from Witness C

A: Line 2 from Witness A
B: Line 2 from Witness B
C: Line 2 from Witness C

etc.

The presentation of full transcriptions of all text from all witnesses in parallel resembles a conductor's musical score, and is sometimes called a "score-like" edition. A score-like edition enables a user to read any witness easily, since there is no need to move one's eyes between a base text in one location and variants in another, and it also allows a user to see at a glance which witnesses agree with which others at a particular location.

The score-like structure may be considered a special case of TEI parallel segmentation, and it can be implemented using standard TEI parallel segmentation methods. It is similar to this method in that it incorporates an in-line apparatus in a way that does not privilege a single base text. But it differs from the TEI model in two respects: 1) all text is included in <app> elements (that is, even text that is identical in all witnesses will nonetheless be recorded separately for each witness) and 2) each <rdg> element is associated with exactly one witness.

4.2. Limitations of a Score-Like Edition

The two differences noted above result in important limitations in the power of a score-like edition that are not present in a true TEI parallel segmentation edition. The first difference, the inclusion of all text in <app> elements, even when there is no variation, means that there is no formal distinction between locations where some witnesses differ and locations where all witnesses agree. In the TEI model, <app> elements would be introduced only where there is variation, which enables text that varies to be identified automatically (although, as was noted above, overlap in variants may require segmentation that is either narrower or broader than the agreement patterns among specific witnesses would justify). The second difference, the association of each <rdg> element with exactly one witness, means that there is no formal encoding of agreement among selected witnesses. In the TEI model, a <rdg> element has a wit attribute of type cdata that contains a list of sigla for all witnesses attesting the reading in question, a strategy that enables an application to locate patterns of agreement by parsing the document and postprocessing the content of the wit attributes. Under the score-like structure, each reading will be associated with exactly one witness, which means that the only way to identify patterns of agreement will be to postprocess the data content of the <rdg> elements, a much more complicated operation than comparing simple and standardized sigla (and, furthermore, one that makes it impossible to use SGML tools to locate patterns of agreement ).

These two types of differences make it impossible to determine from the markup of a score-like edition where there is variation and where there is not (the first difference, but one that is also present to a lesser extent in the TEI parallel segmentation method) and where particular manuscripts agree and where they do not (the second difference). These are important limitations, and editors will need to decide whether supporting the formal encoding of this type of information is more valuable than overcoming the legibility and completeness problems noted above.

It is, of course, the case that a true TEI parallel segmentation encoding, if it includes all variation, can be converted to a score-like edition automatically, and one might argue that the more informative TEI parallel segmentation structure should be used for encoding a source file that can then be transformed into the more legible score-like format for rendering. This is a sensible compromise, although it cannot address fully two problems: 1) as was noted above, even the pure TEI parallel segmentation method is not able to formalize all patterns of agreement at the appropriate level of granularity because it is inherently unable to represent overlap, and 2) it is easier for an editor to avoid the incompleteness issue by encoding all witnesses in full.

From a different perspective, the intellectual process of identifying variation involves comparing all readings from all witnesses, which is precisely the perspective afforded by a score-like edition. With this consideration in mind, one might envision the relationship between the score-like and strict TEI parallel segmentation representations the other way around: the score-like representation is the input to identifying the patterns of variation that may then be encoded explicitly in a strict TEI parallel segementation representation. From a production perspective, one might start by transcribing all witnesses in full, use these to collate the witness in a score-like encoding, run the collated composite text through an application that identifies variation, and then use the output of the variant analysis to generate a critical apparatus using one of the standard TEI methods. Not only does an approach that takes full transcriptions of individual manuscripts as a starting point mirror the intellectual process of textual criticism, but it is also easily rerun should the editor need to change the list of witnesses, whether because of new discoveries or because certain witnesses must later be eliminated.

4.3. Problems with the Standard TEI Critical Edition Methods and Their Solution in Score-Like Editions

A score-like edition provides an opportunity to overcome certain limitations in structural control that are inherent in the three standard TEI methods of encoding a critical apparatus. Section 4.3.1 describes the technical weaknesses of those methods and section 4.3.2 describes how a score-like edition provides an opportunity to overcome them.

4.3.1. Problems Inherent in the Standard TEI Critical Edition Methods

One striking difference between the parallel segmentation and score-like editions is that parallel segmentation supports the association of a single <rdg> element with multiple witnesses, and this is in many respects a strong argument for the superiority of the parallel segmentation method. But this feature of the parallel segmentation method also imposes a certain cost, since there are times when the association of a single <rdg> element with multiple witnesses would be an error, and an SGML parser is unable to distinguish these situations from those where this association is appropriate. As the TEI guidelines note: "The hand and resp attributes [of <rdg> elements] are intelligible only on an element recording a reading from a single witness ... If more than one witness is given for a reading, they are undefined." ([TEI P3], 19.1.2) This must be given as a prose admonition, rather than encoded in the DTD, because SGML tools are incapable of distinguishing when attributes are defined based on the content of another attribute. This means that the considerable advantages in the parallel segmentation method of being able to associate a single reading with multiple witness is partially offset by the creation of an opportunity to introduce undefined markup that cannot be discovered through normal SGML validation.

The representation of all readings from all witnesses as <rdg> (reading) or <lem> (lemma) elements in the TEI DTDs introduces an important control problem. An editor who is creating a score-like edition using the TEI parallel segmentation apparatus method might mark up a section of text as follows:

<p id="p1">
  <app>
    <rdg wit="WitnessA">text from witness A</rdg>
    <rdg wit="WitnessB">text from witness B</rdg>
    <rdg wit="WitnessC">text from witness C</rdg>
  </app>
</p>

Because SGML controls linear and hierarchical document structure through elements, but not through attributes, there is no way that an SGML parser can ensure that all witnesses are represented, that no witness appears more than once, or that the witnesses occur in a particular order. If reading groups are used, an SGML parser is unable to validate whether <rdg> elements associated with specific witnesses occur in the correct <rdgGrp> element. This type of validation can be performed externally, but because what SGML does best is validate structure, it seems perverse to create an SGML document that depends on structural features that are not representable in the DTD. However, this type of validation could be performed internally using standard SGML tools if individual witnesses were distinguished not by attributes, but by elements. That is, the what is at issue is not an inherent limitation in the expressive power of DTDs, but a difference between the syntactic properties of elements and attributes within a DTD framework.

4.3.2. How a Score-Like Edition Provides an Opportunity to Address Problems Inherent in the Standard TEI Critical Edition Methods

It is clear that a score-like critical edition offers both advantages and disadvantages with respect to traditional critical editions. It is also clear that the TEI parallel segmentation method is able to represent many--but not all--of the features of a score-like edition. Finally, it is clear that some features of a score-like edition could be represented with greater structural control using the TEI parallel-segmentation method if the TEI DTDs could be changed as follows:

  1. The TEI DTDs cannot restrict the occurrence of any reading from any witness. That is, the TEI DTDs are unable to verify whether a required witness may have been omitted through error (although the editor might wish to treat some witnesses as required and others--such as fragments--as optional), and they are similarly unable to verify whether a particular witness may have been included more than once in a single apparatus element (which would always be an error). If individual witnesses were identified by GIs, rather than attribute values, the content models could control occurrences.
  2. The TEI DTDs cannot restrict the order of readings. From an informational perspective readings may be considered inherently unordered, which is to say that they might be envisioned not as constituent beads on a matrix string, but as constituent pendants on a matrix mobile, which enforces hierarchical but not linear structure. Readings are normally ordered for presentation, but that order is not part of their meaning. On the other hand, if the editor wishes to enforce a consistent order when the document is eventually prepared for presentation, and if this order is not going to be subject to change, it would be convenient to make the order an obligatory feature of the underlying SGML document, which would free the editor from having to process the document to ensure a particular consistent output order. If individual witnesses were identified by GIs, rather than attribute values, the content models could enforce order.
  3. As was mentioned above, the TEI DTDs permit the grouping of readings (<rdg> elements) into readings groups (<rdgGrp> elements). However, the DTDs do not provide a mechanism for ensuring that readings are grouped correctly. For example, an editor might wish to group manuscript witness in one reading group and printed witnesses in another, but although the TEI DTDs support the creation of these groups, they do not support the use of an SGML parser to verify which witnesses are listed in which group. If individual witnesses were identified by GIs, rather than attribute values, the content models could control grouping.

5. Three Solutions

This section examines three strategies for implementing the desiderata listed above. The initial requirements for solutions to this problem were that 1) all validation had to be performed using SGML tools and 2) the final result had to be fully TEI-conformant.

The three solutions involve 1) modifying the TEI DTDs according to the recommendations in the guidelines ([TEI P3], Section 29) and subsequently processing the document by referring to the TEIform attribute (Section 5.1); 2) encoding the document using a custom DTD and then transforming it to a standard TEI DTD using an arbitrary transformation tool (Section 5.2), and 3) encoding the document using a custom DTD that incorporates the TEI DTD as a base architecture, and then using SGML architectural processing to transform the document to a standard TEI DTD (Section 5.3).

The test document used in this report is the following small hypothetical critical edition:

Witness A: First line from witness A
Witness B: First line from witness B
Witness C: First line from witness C

Witness A: Second line from witness A
Witness B: Second line from witness B
Witness C: Second line from witness C

The standard TEI markup for this document in a parallel segmentation edition would be:

<!-- tei-standard.sgml -->
<!doctype tei.2 public "-//TEI P3//DTD Main Document Type 1996-05//EN" [
<!entity % TEI.prose 'INCLUDE'>
<!entity % TEI.textcrit 'INCLUDE'>
]>
<tei.2>
  <teiheader>
    <filedesc>
      <titlestmt>
        <title>TEI Critical Edition Test Document, Standard TEI Version</title>
      </titlestmt>
      <publicationstmt>
        <p>Unpublished.</p>
      </publicationstmt>
      <sourcedesc>
        <p>Original test document created 2000-03-10 by djb.</p>
      </sourcedesc>
    </filedesc>
  </teiheader>
  <text>
    <body>
      <p id="p1">
        <app>
          <rdg wit="A">First line from witness A</rdg>
          <rdg wit="B">First line from witness B</rdg>
          <rdg wit="C">First line from witness C</rdg>
        </app>
      </p>
      <p id="p2">
        <app>
          <rdg wit="A">Second line from witness A</rdg>
          <rdg wit="B">Second line from witness B</rdg>
          <rdg wit="C">Second line from witness C</rdg>
        </app>
      </p>
    </body>
  </text>
</tei.2>

As was noted earlier, this representation is unable to use SGML tools to validate the occurrence, order, or grouping of the witnesses. All three solutions proposed below will achieve greater control over these features in two ways: by representing each witness as its own element with its own GI and by revising and constraining certain content models to a subset of the content permitted by the TEI DTDs.

Creating new GIs for each witness makes it possible to develop a DTD that ensures that each <rdg>-type element refers to exactly one witness, that no witness is omitted inadvertently (although it is possible to declare certain witnesses as omissable, should that be desired), that no witness occurs more than once, and that all witnesses occur in a consistent order. This test case does not use the TEI <rdgGrp> element, but the technique described here can also be applied to ensure that specific witnesses appear only in the appropriate reading groups, and [Birnbaum] illustrates a custom DTD approach that incorporates reading groups.

These issues of content control are not unique to critical editions. For example, Simons notes that it may be convenient to design custom GIs for frequent combinations of standard TEI GIs with particular attribute values, and he gives the example of replacing the markup <foreign lang="SIK"> with a custom <sik> element in a dictionary of Sikaina. ([Simons], Section 2.2)

Convenience is certainly important, but an even more compelling reason to design custom GIs is that elements can provide types of structural control that are unavailable with attributes. As Simons notes, one advantage to creating a new <idiom> element as a replacement for the standard TEI <eg type="idiom"> (where <eg> represents examples of any type) and a new <lit> element for literal translations of idioms as a replacement for the standard TEI <tr type="lit"> element (where <tr> represents translations of any type) is that the <lit> element then can be constrained to occur only in the <idiom> element (that is, translations may occur freely in examples of all types, but literal translations may occur in examples only when those examples are idioms).

Constraining the content models of elements addresses a problem that Simons describes as the SGML and XML counterpart to "fatware". Much as software may be encumbered with features that not only are not needed by many users, but also may get in the way, so a large and general DTD, such as the TEI DTDs, may support more elements, broader content models, and more and broader attributes than are required for a specific project. ([Simons], Section 2.3) This is to be expected, since general DTDs need to support a variety of projects, but the availability of unneeded markup is both inconvenient (for example, unwanted markup in a menu is clutter, and may overwhelm visually the list of elements a user might actually need) and pernicious (since it enables the author to use markup that is legal in the general DTD but not desirable in the particular project).

One crucial feature of these issues is that the greater control over frequency, ordering, and grouping that is provided by the new GIs and content models is required at certain stages in the life of the document, but not at others. Most commonly, one might require strict control during authoring with a validating editor; alternatively, one might use a non-validating authoring tool and then ensure the validity of the document through an iterative process of external validation and revision. But whether validation is part of the authoring process from the beginning or introduced only at the end, once the validity of a completed document has been confirmed, subsequent processing does not require additional validation. This means that although the new structural control features discussed above may need to be present in the DTD used for validation during or after authoring, once the document has been completed and validated, transformation engines, rendering engines, and other post-authoring processes may have no direct need for witness-specific GIs or constrained content models. In this respect, the strategy in question extends a feature that was first observed as a general principle when XML was developed: much useful processing can be performed independently of a DTD.

From a slightly different perspective, the standard TEI DTDs may be unable to constrain document structure during authoring and validation as well as the alternatives discussed below, but as long as those contraints can be ensured in some other way, this limitation of the standard TEI DTDs may be unimportant during subsequent processing. This distinction reflects the different roles of the DTD at different stages in the life of the document. Specifically, during authoring and subsequent validation, the DTD defines the set of possible valid documents that can be created. But once the document has reached the stage where it is valid and will not be edited further, the document instance itself defines its own unique structure, and any other possible document structures that may also be licensed by the DTD become irrelevent as far as that particular document is concerned.[5]

5.1. The TEIform Attribute Approach

The TEIform-attribute approach is the only one of the three strategies that does not require the explicit use of a non-TEI DTD at any stage. Instead, this approach involves modifying the TEI DTDs as prescribed in the published guidelines ([TEI P3], Section 29), in this case by:

  1. Creating new elements for each witness that have the same content model as <rdg> and that declare "rdg" as the value of the TEIform attribute, which will enable a processing system to determine that the new elements should be processed identically to standard TEI <rdg> elements. The value of the wit attribute of each new element will be declared as a specific fixed value, which will prevent mismatches, omission, or duplication. Because no other standard attributes of <rdg> are used in the test file, these are not declared for the new elements, thus avoiding an opportunity for error.
  2. Changing the content model for <app> to admit these new elements and prohibit the use of the original <rdg> element or any other original content (to avoid an opportunity for error).

Section 5.1.1, below, illustrates these modifications of the TEI DTDs. Section 5.1.2, below, evaluates these modifications according to the clean/unclean dichotomy established by the TEI Guidelines ([TEI P3], Section 29.1). Section 5.1.3, below, demonstrates how a document created with a modified TEI DTD can be processed by a generic TEI-aware tool without requiring any special knowledge about the modifications.

5.1.1. How To Modify the TEI DTDs

The approach described above is implemented by creating the following TEI.extensions.ent file (called teiform-test.ent) and TEI.extensions.dtd file (called teiform-test.dtd). As was noted above, unused parts of the original content models and unused original attributes are removed, since their presence only creates an opportunity for error. The consequences of these modifications and of others that achieve a similar effect are discussed below.

<!-- teiform-test.ent -->
<!-- The following element is revised -->
<!entity % app 'IGNORE'>

<!-- teiform-test.dtd                                -->
<!-- The following declaration defines a new content -->
<!--   model for the revised app element             -->
<!--                                                 -->
<!element app      - - (witnessa, witnessb, witnessc)  >
<!attlist app          %a.global
          teiform      cdata           #fixed "app"    >
<!--                                                 -->
<!-- The following three declarations define new     -->
<!--   elements, which occur in the revised content  -->
<!--   model of the app element                      -->
<!--                                                 -->
<!element witnessa - o (%paraContent) +(%m.fragmentary)>
<!attlist witnessa     %a.global
          wit          cdata           #fixed "A"
          teiform      cdata           #fixed "rdg"    >
<!--                                                 -->
<!element witnessb - o (%paraContent) +(%m.fragmentary)>
<!attlist witnessb     %a.global
          wit          cdata           #fixed "B"
          teiform      cdata           #fixed "rdg"    >
<!--                                                 -->
<!element witnessc - o (%paraContent) +(%m.fragmentary)>
<!attlist witnessc     %a.global
          wit          cdata           #fixed "C"
          teiform      cdata           #fixed "rdg"    >

With modifications of this type in place, the following valid TEI-conformant document can be created:

<!-- tei-teiform.sgml -->
<!doctype tei.2 public "-//TEI P3//DTD Main Document Type 1996-05//EN" [ 
<!entity % TEI.extensions.ent system "teiform-test.ent" >
<!entity % TEI.extensions.dtd system "teiform-test.dtd">
<!entity % TEI.prose 'INCLUDE'>
<!entity % TEI.textcrit 'INCLUDE'>
]>
<tei.2>
  <teiheader>
    <filedesc>
      <titlestmt>
        <title>TEI Critical Edition Test Document, TEIform Version</title>
      </titlestmt>
      <publicationstmt>
        <p>Unpublished.</p>
      </publicationstmt>
      <sourcedesc>
        <p>Original test document created 2000-03-10 by djb.</p>
      </sourcedesc>
    </filedesc>
  </teiheader>
  <text>
    <body>
      <p id="p1">
        <app>
          <witnessa>First line from witness A</witnessa>
          <witnessb>First line from witness B</witnessb>
          <witnessc>First line from witness C</witnessc>
        </app>
      </p>
      <p id="p2">
        <app>
          <witnessa>Second line from witness A</witnessa>
          <witnessb>Second line from witness B</witnessb>
          <witnessc>Second line from witness C</witnessc>
        </app>
      </p>
    </body>
  </text>
</tei.2>

As was noted above, the TEI DTDs allow the association of multiple witnesses with a reading by declaring the type of the wit attribute of <rdg> elements as cdata. A score-like edition, on the other hand, presents the full text of each witness on a separate line, which can best be represented in SGML by requiring that each reading element be associated with exactly one witness. The declaration of the wit and TEIform attributes as fixed and the use of shorttag in the default TEI SGML declaration associates the correct attribute value with each new element, while ensuring both that these attributes do not need to be included explicitly in the markup (which is a convenience) and that the inadvertent inclusion of any value other than the one specified in the DTD will raise a parser error (which is a safeguard). ([DeRose], Section 5.15.)

This modification both permits (actually, requires) the use of the newly-defined elements inside <app> elements and prohibits the use of other elements usually permitted in that context. It does not, however, restrict the content of the <text>, <body>, or <p> elements that surround the <app> element, which creates an opportunity for a user to input element or data content that is legal in standard TEI documents but not wanted in this particular modified document. To provide greater protection against such errors, the content of these outer elements could be redefined similarly to the redefinition of the <app> element documented above. Furthermore, if the content of the new witness elements will always be pcdata, the content models can be narrowed, providing additional protection against the inadvertent inclusion of unwanted markup.

5.1.2. Clean and Unclean Modification

The TEI Guidelines divide modifications of the TEI DTDs into two classes, called "clean" and "unclean." Clean modifications are of two types: "The set of documents parsed by the original DTD may be properly contained in the set of documents parsed by a modified DTD, or vice versa." ([TEI P3], Section 29.1) In the present study, the first type of modified DTD will be called a "new superset DTD" and the second type will be called a "new subset DTD." The TEI Guidelines draw no further distinction between the two types of clean modification, and they also do not state explicitly that clean modifications are preferable to unclean ones, although this might be inferred from the vernacular meaning of the terms themselves.

In fact, there are striking practical differences in the utility of new subset DTDs and new superset DTDs. Document instances prepared with new subset DTDs may be parsed by any user who has a standard TEI configuration. This means that such document instances conform fully to the standard TEI distribution and may be exchanged without regard for the modifications (which in this case are relevant only during document preparation). Document instances prepared with new superset DTDs, on the other hand, cannot be parsed in arbitrary environments configured for the standard TEI distribution, which means that the document instances themselves do not conform to the standard (unmodified) TEI model. While the interchange and processing advantages of new subset DTDs are clear, the only processing advantage to new superset DTDs accrues to the developer, whose modified environment will be able to process standard TEI documents alongside his or her superset documents. On the other hand, such a use of modified DTDs to parse unmodified TEI document instances would make it impossible to verify whether the instances are valid against the unmodified TEI DTDs.

In light of these considerations, the different practical consequences of these two models suggest that the most important distinction may lie not between clean and unclean modifications, but between document instances prepared with clean subset DTDs, which can be exchanged freely without special preparation, and those prepared with clean superset DTDs, which cannot.

The modifications of the text file documented in the preceding section are necessarily unclean because they are of two types that have opposite consequences: the creation of entirely new elements means that the new DTD cannot be a subset of the original TEI DTDs, while the imposition of new restrictions on the content model of some standard elements means that the new DTD also cannot be a superset of the original TEI DTDs. As is demonstrated below, however, it is nonetheless possible to process documents created with the modified DTD using tools that do not need to refer explicitly to the newly-created elements. This is the strategy that motivated the creation of the TEIform attribute ([TEI P3], Section 3.5), and the use of this feature during processing means that the unclean documents in question can enjoy the same interchange and processing benefits of documents that reflect clean subset modifications.

From a more general perspective, there is no way to achieve a clean subset DTD while creating new elements. Once one is committed to creating new elements, it is possible to create a clean superset DTD by permitting the optional use of the new elements alongside all other elements that are already legal in the context in question. For example, instead of redefining the <app> element to admit only the new elements, one could extend the content model to admit either the new elements or the original content licensed by the standard TEI DTDs. This approach was not adopted in the present case for two reasons: 1) it creates the opportunity for error should the user inadvertently populate the <app> element with the original TEI content, rather than the new elements, and 2) the advantages of clean superset modifications are much less than the advantages of clean subset modifications, and they were not considered sufficient to justify the compromise in content control that would result.

5.1.3. How To Process a Modified TEI Document

The advantage of the modified TEI DTD approach is that the resulting document is fully TEI-conformant from start to finish, since although it extends the TEI DTDs, it does so in a standardized and well-documented way. The disadvantage of this approach is that although this type of extension may be standardized and well-documented, tools and applications that have been configured to process TEI-conformant documents based on standard TEI GIs will not know how to process the new elements created above. One solution to this problem involves modifying the processors explicitly as needed to recognize new elements, but this approach has no general value and is suitable only for small and infrequent use.

A more robust and scalable approach is to reconfigure TEI processors to use the TEIform attribute value in lieu of or in addition to the GI. For example, while most Omnimark scripts designed to process TEI documents are designed to respond to the GIs of the standard TEI elements, an alternative approach that ignores the GIs and acts instead on TEIform attribute values can deal with the modified DTD above without having to know anything more about the new GIs. Omnimark permits the definition of a catch-all "implied" element, which fires whenever an element is encountered for which specific actions are not declared. The use of this feature means that a generic Omnimark TEI transformation script could read the TEIform attribute of an unfamiliar element and act on it according to that attribute value. More generally, an Omnimark script for processing both unmodified and modified TEI documents could be written with only one element rule (for the default "implied" element), with different actions depending on the value of the TEIform attribute. A script designed in this way for standard TEI documents will be able to process any extended document for which the new elements were created only to provide control during authoring, and do not require custom processing.

The following brief Omnimark script converts the test document in section 5.1.1, above, to HTML 4.0 without calling any explicit rules for the newly-defined elements. Similar strategies can be implemented with any scripting language that is capable of processing an element with an unfamiliar GI according to its TEIform attribute value.

; tei-teiform.xom 
; run with omnimark -s tei-teiform.xom 
;                   teisgml.dec tei-teiform.sgml 
;                   -d socat "e:\lib\sgml\dtd\teip3\catalog" 
;                   -i "e:\program files\omnimark\xin\" 
;                   -of tei-teiform.html 
down-translate
include "socatete.xin"
global counter rdgcount initial {0}
element #implied
  do when attribute teiform = "TEI.2"
    output '<!doctype html public "-//W3C//DTD HTML 4.0//EN">'
    output "%n<html>"
    output "%c"
    output "%n</html>"
  else when attribute teiform = "teiHeader"
    output "%n<head>%c%n</head>"
  else when attribute teiform = "fileDesc"
    output "%c"
  else when attribute teiform = "titleStmt"
    output "%c"
  else when attribute teiform = "title"
    output "%n<title>%c</title>"
  else when attribute teiform = "publicationStmt"
    output '%n<meta name="publicationStmt" content="%c">'
  else when attribute teiform = "sourceDesc"
    output '%n<meta name="sourceDesc" content="%c">'
  else when attribute teiform = "text"
    output "%c"
  else when attribute teiform = "body"
    output "%n<body>%c%n</body>"
  else when attribute teiform = "p"
    output "%n<p>" unless ancestor is teiheader
    output "%c"
    output "</p>" unless ancestor is teiheader
  else when attribute teiform = "app"
    set rdgcount to 0
    output "%c"
  else when attribute teiform = "rdg"
    increment rdgcount
    output "%n"
    output "<br>" when rdgcount > 1
    output "<strong>Witness %uv(wit):</strong> %c"
  else
    output "%nUndefined element: %q"
    suppress
  done

Although the preceding script can be considered only a proof of concept (in its current form it includes only the TEIform attribute values that occur in the test document), it is nonetheless clear that:

  1. The script can be extended into a general-purpose TEI transformation engine by adding declarations for all TEI elements (or, rather, all standard TEI TEIform attribute values).
  2. Once so extended, the script can process any unextended TEI document as well as many TEI documents that are extended through the use of the TEIform attribute, as documented in [TEI P3], (Section 29).

Should the modification strategy discussed here become widespread (and this was clearly the intent underlying the creation of the TEIform attribute), designers of general TEI processing tools might wish to build in the appropriate support for processing the TEIform attribute when new GIs are encountered. This strategy would enable a generic TEI transformation or other processing tool to process an arbitrary document that includes a DTD that has been extended in this way without compromising the ability to process standard (unextended) TEI documents. A generic processor obviously cannot anticipate newly-created GIs, but if those GIs are used only to increase structural control during authoring and do not otherwise require special handling, subsequent processing can be controlled through the TEIform attributes, which are not extended during this type of modification.

One might wish, on this basis, to distinguish two types of unclean modifications: those where the uncleanliness is important during authoring but can then be ignored during processing and those where the uncleanliness must be maintained at all stages of the life of the document. The score-like edition project described here is of the first type.

5.2. The Custom DTD Approach

As was demonstrated above, modifying the TEI DTDs is not as difficult as it may appear to those who have never tried, thanks to the developers' creative use of parameter entities and marked sections and the excellent documentation available in [TEI P3]. But the full TEI DTDs, even when only the necessary modules have been selected, will be overkill for many projects, and using a DTD that licenses an element one doesn't need creates an opportunity for error. The preceding section dealt with this problem by redefining TEI content models and attribute declarations to exclude unneeded markup, but a comparable result might also be achieved by creating a TEI-independent minimal custom DTD for use in authoring and then converting the document to the standard TEI DTDs later. This approach does not require any modification to the TEI DTDs because the custom DTD provides the necessary control during authoring, and once authoring is finished, that control is no longer needed.

The custom DTD approach is described in greater detail in [Birnbaum], where it was used in a large troff-to-SGML conversion that was part of a critical edition project. Briefly, the prolog including a custom DTD for the test document used here might look like:

<!-- tei-custom.dtd -->
<!doctype document [
<!element document - - (p)+>
<!element p        - - (witnessa, witnessb, witnessc)>
<!attlist p        id  id #required>
<!element (witnessa | witnessb | witnessc) - - (#pcdata)>
<!attlist witnessa wit cdata #fixed "A">
<!attlist witnessb wit cdata #fixed "B">
<!attlist witnessc wit cdata #fixed "C">
]>

The principal advantage of this approach is the extreme simplicity of the DTD. The TEI header is not included because there is no advantage to authoring it outside the real TEI DTDs. That is, one would encode the TEI header (using the real TEI DTDs) and the body of the textual critical edition (using a custom DTD) separately, transform the body so that it will be TEI-conformant, and then combine the two parts for publication. The custom DTD illustrated here contains no markup that is not used in the test document, but the content models could be expanded to correspond to those found in the standard TEI DTDs, if desired.

Note that the attribute wit is declared as cdata with a fixed value. The cdata declaration corresponds to the value in the TEI DTDs, although an alternative declaration requiring a name (such as id) would also be TEI-compatible, since names are a subset of cdata strings (although they are subject to case folding and character restrictions). As was noted in section 5.1.1, above, the fixed declaration ensures that only the declared value will be permitted, and the use of shorttag in the standard TEI SGML declaration means that the default value does not need to be specified explicitly in the document instance. This strategy allows the user to declare any value that would be acceptable in the TEI DTDs, but it also avoids the opportunity for error by ensuring that the user does not need to specify the value explicitly, and that incorrect values specified explicitly will be caught by the parser.

The test document instance marked up according to this DTD would look like:[6]

<!-- tei-custom.sgml -->
<document>
  <p id="p1">
    <witnessa>First line from witness A</witnessa>
    <witnessb>First line from witness B</witnessb>
    <witnessc>First line from witness C</witnessc>
  </p>
  <p id="p2">
    <witnessa>Second line from witness A</witnessa>
    <witnessb>Second line from witness B</witnessb>
    <witnessc>Second line from witness C</witnessc>
  </p>
</document>

The following Omnimark script will convert the custom version of the test document into the standard TEI version, suitable for combination with a <TEIheader> (with a <TEI.2> root element):

; tei-custom.xom 
; run with omnimark -s tei-custom.xom 
;                   teisgml.dec tei-custom.dtd tei-custom.sgml 
;                   -of tei-custom-standard.sgml
down-translate 
element document 
  output "%n<text>" 
  output "%n   <body>" 
  output "%c" 
  output "%n   <body>" 
  output "%n</text>"
element p 
  output '%n     <p id="%lv(id)">' 
  output "%n       <app>" 
  output "%c" 
  output "%n       </app>" 
  output "%n     </p>"
element (witnessa | witnessb | witnessc )
  output '%n <rdg wit="%uv(wit)">%c</rdg>'

As with the modified TEI DTD approach, above, any SGML-aware tool can be used for the transformation.

5.3. The Architectural Approach

This section discusses the use of SGML architectures to implement the mapping and conversion between a custom DTD and the TEI DTDs. Section 5.3.1 discusses the advantages of architectural processing, section 5.3.2 describes how architectural processing works, and section 5.3.3 illustrates how architectural processing can implemented to support the current project.

5.3.1. The Advantages of Architectural Processing

The custom DTD approach described above allows the user to construct a project-specific DTD that enforces much greater structural control than is available in the standard TEI DTDs. The transformation of the document from markup according to the custom DTD to markup according to the standard TEI DTDs enables the author to take advantage of this structural control when it is needed, viz. during authoring, and then to get it out of the way when custom markup becomes an impediment, viz. during publication and interchange.

The transformation process described above uses a custom Omnimark script to convert the custom document to a standard TEI document, and the advantages of this approach are the simplicity of the custom DTD and the fact that users may employ any scripting language with sufficient power to accomplish the desired transformation. One disadvantage of this approach, however, is that the transformation script must deal with all elements, including those that have the same GIs and attributes in both the custom DTD and the TEI DTDs, although this limitation could be overcome by building into the transformation script a default identity transformation.

A more significant limitation to the transformation process described above is that the relationship between the custom DTD and the TEI DTDs is completely external to the custom DTD itself. Since the mapping between the custom DTD and the TEI DTDs is logically part of the informational value of the custom DTD (that is, the custom DTD is designed with remapping to the TEI DTDs in mind), it is desirable to build that mapping into the DTD itself. This relationship could be expressed through comments in the custom DTD, but 1) this approach is not obligatorily formalized, 2) the completeness of the mapping cannot be validated automatically, and 3) the TEI version of the document that results from the implementation of the mapping cannot be validated without performing the conversion and then validating the output separately. That is, the validity of the remapped document against the TEI DTDs is not inherent in the document itself.

SGML architectural forms provide a mechanism for formalizing the mapping between a custom DTD (called a "document DTD") and a TEI DTD (called an "architectural DTD"). The principal informational advantage of architectural processing over the custom DTD strategy described above is that architectural processing integrates into the document DTD itself the identity of the architectural DTD, information about how an architectural processor can access the architectural DTD, and information about how markup in the document DTD is associated with markup in the architectural DTD. The principal practical advantage of architectural processing is that an architectural processor can validate a document against both the document DTD and the architectural DTD simultaneously, and some architectural engines can even generate an output document that implements the associations between the two DTDs as transformations. In other words, architectural engines of this type are capable of converting a document marked up with a document DTD into a new document with the same basic content, but with the original markup replaced by markup taken from the architectural DTD. Architectural forms are not an all-purpose transformation mechanism, and their transformational power is considerably less than that of Omnimark and other scripting languages that are common in SGML environments, but, as is shown below, they are fully capable of supporting the associations required by the critical edition project described here.

Simons observes that architectural processing provides an alternative to the TEI notion of clean and unclean modifications, described in section 5.1.2, above. According to this new model, a custom document that employs a TEI DTD as an architectural DTD may be considered architecturally cleanly conformant if the document in question is valid with respect to both the document DTD and the architectural DTD. As Simons notes, unlike with the TEI definition of clean and unclean modifications, architecturally clean conformance can be validated in one step with an architectural parser. ([Simons], Section 6) Furthermore, this interpretation avoids the imbalance between clean subset and clean superset modifications, since a document can be architecturally clean only in one way. The architectural implementation illustrated below is architecturally clean, which is to say that the test document is valid against both the document DTD and the TEI architectural DTD.

5.3.2. How Architectural Processing Works

As is shown below, architectural processing, like modifying the TEI DTDs, is not as complicated a procedure as it may appear to those who have never tried it. Excellent introductions include [Kimber1], [Kimber2], [Clark1], and [Clark2]. The architectural form standard is defined in [ISO10744] and a convenient set of links to additional information is available at [Cover].

Briefly, architectural processing requires the following steps (illustrated in full in section 5.3.3, below):

  1. Create a document DTD incorporating the necessary custom markup.
  2. Create an architectural DTD (in this case, a standard TEI DTD) that does not require additional declarations in the internal DTD subset. For the present TEI purposes, the most convenient approach is to use the TEI Pizza Chef ([Pizza]) to create a single DTD that declares the prose base and the text-critical apparatus. For the test project described here, this file was saved under the name tei-textcrit-pizza.dtd.
  3. Insert the following markup into the DTD subset of your document:
    <?IS10744 ArcBase tei                             >
    <!entity % teidtd system "tei-textcrit-pizza.dtd" >
    <!notation tei system                             >
    <!attlist #notation tei
              arcDocF  name  #fixed TEI.2
              arcFormA name  #fixed tei
              arcDTD   cdata #fixed "%teidtd"         >
    
    The system entity refers to the standard TEI DTD created in the preceding step. The first attribute (arcDocF) identifies the root element of the standard TEI DTD as <TEI.2>. The second attribute (arcFormA) assigns the name tei to the architectural DTD attribute (that is, it says that a global attribute with the name tei will be used to identify the element in the standard TEI DTD that corresponds to each new element in the document DTD). The third attribute (arcDTD) identifies the architectural DTD as the standard TEI DTD declared earlier.
  4. Declare an attribute called tei for each element that must be mapped to a different element in the TEI architectural DTD as follows:
    <!element witnessa - - (#pcdata)            >
    <!attlist witnessa wit cdata   #fixed "A"
                       tei nmtoken #fixed "rdg" >
    
    This example says that the element <witnessa> in the document DTD will be mapped to the element <rdg> in the architectural DTD.
  5. To parse the document against both the document and architectural DTDs using SP ([SP]), type "nsgmls -s -A tei filename". "-A tei" tells the parser to use the architecture named tei when parsing the document and "filename" represents the name of the document file.
  6. To convert the document from the custom markup declared in the document DTD to the markup declared in the TEI architectural DTD using SP, type "sgmlnorm -A tei filename". The command line arguments have the same values as above, but sgmlnorm outputs an SGML document, and the "A" switch tells it that the output should incorporate markup from the architectural document, rather than the original input document.

This procedure is illustrated below.

5.3.3. The Use of Architectural Processing with TEI Critical Editions

One useful feature of architectural forms is that GIs that correspond in the document and architectural DTDs are mapped automatically. This means that it is not necessary to specify architectural attributes for such elements, which can greatly simplify the creation of the document DTD. To take advantage of this feature, one needs to assign the same GIs to corresponding elements wherever possible. For this reason, all GIs in the following document DTD are borrowed from the architectural TEI DTD except where they correspond to new elements. In the custom DTD strategy discussed in section 5.2, there was no necessary advantage to implementing this type of correspondence, although one could create a script that performs the same type of default mapping that occurs in architectural processing (which would be done in Omnimark, for example, by using the default [implied] element rule).

The custom DTD strategy described in Section 5.2 created only the eventual content of the <text> element of the TEI version of the document, under the assumption that the <teiHeader> element could be authored separately and then combined with the <text> element after the latter was generated from the custom document. This strategy is inconvenient under the architectural approach because this approach validates the document against both DTDs simultaneously, and a TEI document without a <teiHeader> element is invalid. For this reason, the most convenient authoring strategy within the architectural approach involves authoring the entire document (including the <teiHeader>) using the document DTD.

For the present project, it was important to restrict the content of the eventual TEI <text> element as much as possible as a way of preventing the inadvertent use of unwanted markup. On the other hand, there was no desire to restrict the content of the <TEIHeader> element. This means that the document DTD, which would be used for authoring, needed to include all features of the <TEIHeader>, but very little of the original content of the <text> element.

As a shortcut to including the <teiHeader> markup in the document DTD, a monolithic DTD was generated for a TEI independent header by creating the following empty TEI document:

<!-- tei-header.sgml -->
<!doctype ihs public "-//TEI P3//DTD Auxiliary Document Type:
        Independent TEI Header//EN" [
]>

and then using the spam tool from the SP suite to run "spam -pp tei-header.sgml > tei-header.dtd". The output of this procedure was then edited by hand to remove the doctype declaration line at the top of the file and the "]>" at the bottom. As was noted above, because the <teiHeader> element and all its content elements in the document DTD will have the same GIs as in the architectural DTD, it is not necessary to edit the <teiHeader> DTD to declare tei architectural attributes for them explicitly.

The resulting DTD fragment supports all elements in the standard <teiHeader> but none of the content that is specific to the <text> element. It also does not define a root <TEI.2> element. This means that the user must define the root element and non-header markup, which can be designed to support only what is needed for a particular project. Because some elements may be used in both the <teiHeader> and the <text>, it is safest to create entirely new elements where any special constraints are required inside the <text>, rather than to redefine existing elements. For example, it might be useful for this project to define paragraphs in the body of the document as consisting entirely of <app> elements, but paragraphs also occur in the <teiHeader>, where it is important that they be able to support their usual content. Conflicts are avoided by a type of name-spacing, where the string "djb-" is prefixed to the original GI of corresponding original TEI elements. For example, <djb-p> is a replacement for the standard <p> element in the <text> (actually, <djb-text>), while the standard TEI <p> element remains available within the <teiHeader> element. Because no standard TEI element begins with the string "djb-", this strategy ensures that new GIs will not conflict with existing ones.

The following is a possible SGML prolog (doctype declaration and internal DTD subset) for the architectural version of the test file:

<!-- tei-architecture.dtd                                   -->
<!doctype tei.2 [
<!-- magic incantation to support architectural processing  -->
<!-- uses "tei" as architecture name                        -->
<!-- incorporates TEI DTD with prose and text crit modules  -->
<?IS10744 ArcBase tei                                         >
<!entity % teidtd system "tei-textcrit-pizza.dtd"             >
<!notation tei system                                         >
<!attlist #notation tei
          arcDocF       name        #fixed TEI.2
          arcFormA      name        #fixed tei
          arcDTD        cdata       #fixed "%teidtd"          >
<!-- header markup is all declared in a separate file       -->
<!entity % tei-header system "tei-header.dtd"                 >
%tei-header;
<!-- root and all non-header markup declared here           -->
<!-- all new elements are prefixed "djb-" and bear "tei"    -->
<!-- attributes to define architectural mapping             -->
<!element tei.2         - - (teiheader,djb-text)              >
<!element djb-text      - - (djb-body)                        >
<!attlist djb-text
          tei           nmtoken     #fixed "text"             >
<!element djb-body      - - (djb-p)+                          >
<!attlist djb-body 
          tei           nmtoken     #fixed "body"             >
<!element djb-p         - - (djb-app)                         >
<!attlist djb-p
          id            id          #required
          tei           nmtoken     #fixed "p"                >
<!element djb-app       - - (djb-wita, djb-witb, djb-witc)    >
<!attlist djb-app
          tei           nmtoken #fixed "app"                  >
<!element (djb-wita | djb-witb | djb-witc)
                        - - (#pcdata)                         >
<!attlist djb-wita
          wit           cdata   #fixed "A"
          tei           nmtoken #fixed "rdg"                  >
<!attlist djb-witb
          wit           cdata   #fixed "B"
          tei           nmtoken #fixed "rdg"                  >
<!attlist djb-witc 
          wit           cdata   #fixed "C"
          tei           nmtoken #fixed "rdg"                  >
]>

The test document instance marked up according to the preceding DTD looks as follows:

<!-- tei-architecture.sgml -->
<tei.2>
  <teiheader>
    <filedesc>
      <titlestmt>
        <title>TEI Critical Edition Test Document, Architectural Version</title>
      </titlestmt>
      <publicationstmt>
        <p>Unpublished.</p>
      </publicationstmt>
      <sourcedesc>
        <p>Original test document created 2000-03-10 by djb.</p>
      </sourcedesc>
    </filedesc>
  </teiheader>
  <djb-text>
    <djb-body>
      <djb-p id="p1">
        <djb-app>
          <djb-wita>First line from witness A</djb-wita>
          <djb-witb>First line from witness B</djb-witb>
          <djb-witc>First line from witness C</djb-witc>
        </djb-app>
      </djb-p>
      <djb-p id="p2">
        <djb-app>
          <djb-wita>Second line from witness A</djb-wita>
          <djb-witb>Second line from witness B</djb-witb>
          <djb-witc>Second line from witness C</djb-witc>
        </djb-app>
      </djb-p>
    </djb-body>
  </djb-text>
</tei.2>

The document DTD ensures that only newly-defined elements may occur outside the <teiHeader>, while the architectural attributes ensure that these new elements are associated with standard TEI elements.

Using the SP toolkit, the document may be parsed simultaneously against both the document and architectural DTDs using nsgmls by typing "nsgmls -A tei -s tei-architecture.dtd tei-architecture.sgml" (where "tei-architecture.dtd" represents the prolog file illustrated above and "tei-architecture.sgml" represents the instance file). The instance may be converted to standard TEI markup by typing "sgmlnorm -A tei tei-architecture.dtd tei-architecture.sgml".

6. Conclusions

This section summarizes the conclusions that emerge concerning several related problems posed at the beginning of and during the course of this report.

6.1. Criteria for Deciding Whether to Encode Information as GIs, Attributes, or Data Content

If one marks up witness variants as <rdg> elements distinguished by the value of the wit attribute, the principal strategy available in the standard TEI DTDs, it is impossible to use an SGML parser to ensure that 1) each witness appears exactly once in each <app> element, 2) the witnesses occur in a consistent and specific order, and 3) where reading groups are employed, the witnesses fall inside the desired <rdgGrp> element within the <app> element. This limitation can be overcome by representing the names of witnesses not as attribute values (as in the TEI wit attribute) or data content (as in the TEI <wit> element), but as GIs.

As was noted above, there are advantages in structural control to changing the declarations in the standard TEI DTDs for the sigil attribute of <witness> elements (from cdata to id) and the wit attribute of <rdg> elements (from cdata to idrefs). This control can be combined with the strategy of creating new GIs for each witness, should that be desired, although it becomes less necessary once new GIs have been declared, since the new content models of the DTD already ensure better control over witness identification than would be available from the id/idrefs mechanism.

If we now return to the first three strategies for representing witness names listed in section 1, above (GIs, attributes, and data content, respectively), we can conclude that the first (GIs) provides the most structural control, the second (attributes) can enforce some coordination between readings and witnesses (especially if the attribute type is changed from cdata to idref or idrefs), and the third (data content) provides no significant structural control (it can ensure that an element exists to hold a witness identifier, but it cannot validate the specific data content of that element at all).

6.2. Different Requirements for DTDs during Authoring and Subsequent Processing

Very tight structural control may be desirable during authoring, but much looser control is often completely satisfactory for subsequent processing. This suggests that instead of assuming that a single DTD will be used for all purposes, it might be profitable to employ a strict DTD for authoring and a flexible one for interchange and subsequent processing. This observation extends the XML philosophy that one can do many useful things with a structured document without accessing a formal DTD. Another way to look at this issue is that the DTD is important during authoring because it constrains the types of documents that may be created. Once a document has been created, it can have only one type, which is the type implemented in the specific document instance itself. A processor that has to deal with such a document will have no need to know about all the other structures it might have had.

6.3. Strategies for Addressing the Different Requirements of Authoring and Subsequent Processing

As was noted above, there are inherent contradictions between the need for DTDs that provide appropriate structural control during authoring for specific projects, on the one hand, and the need for DTDs that are flexible enough to enable a community of users to exchange files without requiring special accommodations. This contradication can be resolved by modifying a communal DTD (such as the TEI DTDs) in a way that enforces authoring control but still permits the resulting document to be processed without needing to revise the tools to accommodate the modifications directly. Alternatively, the contradication can be resolved by using one DTD for authoring and then transforming the document instance so that it conforms to a different DTD before publication or other processing. This transformation can be performed with an arbitrary scripting language or with SGML architectural processing.

6.4. Strengths and Weaknesses of Score-Like Critical Editions

From a publishing perspective, score-like critical editions address two problems that are widespread in traditional critical editions: incomplete presentation of the evidence and compromised legibility. From an SGML engineering perspective, score-like critical editions make it possible to develop project-specific DTDs that provide much better structural control than is available through the standard TEI approach. On the other hand, encoding for score-like editions does not distinguish formally between situations where there is textual variation and situations where there is not; similarly, where there is variation, encoding for score-like editions does not provide a formal record of which witnesses agree.

A compromise approach might view a score-like edition as a presentational view that can be generated from one of the standard TEI models. This is true, but it turns out that two of the three TEI methods are also not capable of formalizing all details of variation.

6.5. Structural Control Required during the Preparation of Score-Like Textual Critical Editions

A robust authoring environment for score-like critical editions requires that all witnesses be represented in all sections (unless they are designed to be omissable), that no witness occur more than once, that all witnesses occur in a particular order, and that where reading groups (<rdgGrp>) are used, witnesses occur only in the appropriate reading groups. These issues cannot be controlled purely through SGML with standard TEI parallel segmentation editions because only GIs (and not attributes) can enforce these requirements, and the ability to associate a reading with multiple witnesses makes it impossible (or, at least, grossly impractical) to try to replace attribute value witness identifiers with GIs. A score-like edition, on the other hand, because it does not permit a single reading entry to be associated formally with multiple witnesses, provides an opportunity to implement additional structural control features by designing an appropriate DTD.

6.6. Mechanisms for Extending the TEI DTDs and Overcoming the Limitations Inherent in Unclean Modification

Clean subset modified TEI DTDs differ from clean superset TEI DTDs in that only the former produce document instances that show no trace of having been created with modified DTDs. Although the declaration of new elements combined with restrictions on the content models of standard elements creates what the TEI Guidelines ([TEI P3]) call an unclean modification, the resulting document instance can nonetheless be processed with unmodified TEI-aware scripts. This observation can be extended to all unclean modifications where the superset aspects of the new DTD are required only during authoring, and not during subsequent processing.

6.7. Mechanisms for Performing Transformations from Non-TEI DTDs to TEI DTDs

Some projects may be authored with very small specific DTDs, after which the document instances may be converted so that the custom markup is replaced with standard TEI markup. Designing a custom DTD is relatively easy, especially if it is used only for the material that will be included in the TEI <text> element, with the <teiHeader> authored separately. The principal disadvantages of this method are that the mapping from the custom DTD to the TEI DTDs is external to the document and that the eventual TEI document can be validated only after the transformation has been performed.

6.8. Mechanisms for using Architectural Processing to Create Formal Associations between Non-TEI DTDs and TEI DTDs, and to Transform the Former to the Latter

Although SGML architectures are not as powerful as Omnimark and other scripting languages commonly used for SGML processing, they are fully capable of implementing the types of associations required by current project. Architectural tools can validate a document simultaneously against both the document DTD and the architectural DTD, and can also output a new version of a document with the original markup replaced by the corresponding markup from the architectural DTD.

6.9. General Conclusions

Any of the three strategies discussed in section 5, above (processing a modified TEI DTD with respect to TEIform attribute values, [section 5.1], transformation of a custom DTDs to a TEI structure [section 5.2], and architectural forms [section 5.3]) provides a solution to the issues posed by a score-like edition. Specifically, these strategies all permit much greater structural control than is available in the standard TEI DTDs, rely entirely on SGML for all validation, and produce a final document that is fully TEI-conformant.


Notes

[1] The TEI DTDs actually also permit the simultaneous use of multiple bases (a "mixed" base), which means that it is technically possible to create a universal TEI DTD that includes all markup available in all TEI DTD modules. This, in turn, means that any TEI document created without modifying existing TEI components (see below) should be able to be parsed against this universal TEI DTD.

[2] This warning is not formalized, which is to say that it is not possible to determine unambiguously when deletion, renaming, extension, and modification have become so extensive that one can no longer claim TEI compatibility. While the text states explicitly that deleting all TEI definitions would not produce a TEI-conformant document, it is surely the case that deleting all but one such definition, or all but two, etc. would also not be considered conformant practice.

[3] For reasons explained immediately below, the TEI DTDs do not strictly require the inclusion of <witness> elements for each witness; in fact, they do not require the inclusion of a <witList> element at all. The <witList> element has been omitted from the examples in this paper to save space, but this would not normally happen with real critical editions.

[4] The principal impediment to revising the official TEI DTDs to change the declarations of the sigil and wit attributes to id and idrefs, respectively, is that this change could render some existing documents invalid. In particular, some editors may wish to employ sigla that begin with digits (such as years), and because attributes of type id are names, which must begin with name start characters, and digits are not name start characters according to the standard TEI SGML declaration, a value for name attributes that began with a digit would not be valid. Other issues that arose during TEI development discussions of this question included sigillum references of the type "c-e" (to indicate witnesses c, d, and e) and the convenience of including annotations (such as question marks to indicate uncertainty) within the wit attribute (C. M. Sperberg-McQueen, personal communication).

In retrospect, since the <wit> element is already available as an alternative to the wit attribute and can answer the needs described above, it seems particularly unfortunate that the utility of the sigil and wit attributes was compromised through the cdata declarations. An alternative solution might have involved providing an opportunity for editors who wish to employ sigla that are not valid SGML names to use id-type sigil and idrefs-type wit attributes purely for internal control, which could involve, for example, tagging a witness that one would like to call "1643" as <witness sigil="witness1643" n="1643"> (using the global cdata attribute n). An application could then render the vernacular name by accessing the cdata value of the n attribute, while the system could validate the relationship between the witnesses in the <witList> element and those cited in readings through the SGML id/idrefs mechanism.

Because names (including id and idrefs attribute values) are case-insensitive (using the standard TEI SGML declaration), while cdata is case-sensitive, authors who change the default type of the sigil and wit attributes will need to monitor the consistency of their case usage. Authors do not normally need to be consistent in case usage when authoring using attribute values that are names, but if the document is then distributed with a standard TEI DTD, where the values that were authored as names come to be published as cdata, they will all be valid, but the ESIS (element structure information set) output of cdata attributes that differ in case will also differ in case. A user can choose to ignore this mismatch during subsequent processing (which will have to be handled separately from SGML validation in any case, since SGML tools cannot validate correspondences between cdata attributes) or by normalizing the case usage of the name-type attributes before changing DTDs (for example, with a tool such as sgmlnorm from the SP toolkit).

[5] This is not intended to suggest that a DTD is always needed only during editing. For example, during processing one might wish to identify not only the attribute value that has been associated with a particular element, but also the universe of possible values from which a particular one was chosen.

[6] The assignment of id attributes to <p> elements is left to the editor, but the DTD and script could easily be revised to prohibit the editor from specifying an id value explicitly and require the script to assign consecutive numerical values automatically.


Acknowledgements: I am grateful to David Mundie, Casey Palowitch, and Elizabeth Shaw for comments on an earlier version of this paper.


Works Cited

[Birnbaum]
David J. Birnbaum. In press. "A TEI-Compatible Edition of the Rus' Primary Chronicle." To be published in Medieval Slavic Manuscripts and SGML: Problems and Perspectives, (Anisava Miltenova and David J. Birnbaum, ed). Sofia: Institute of Literature, Bulgarian Academy of Sciences, Marin Drinov Publishing House. Preprint available at <http://clover.slavic.pitt.edu/~djb/sgml/pvl/article.html>.
[Burnard]
Lou Burnard. July 1995. "Organization of the TEI scheme." <http://users.ox.ac.uk/~archive/teij31/ORGN.htm>. Part of Text Encoding for Information Interchange. An Introduction to the Text Encoding Initiative. TEI Document no TEI J31. <http://users.ox.ac.uk/~archive/teij31/index.htm>.
[Clark1]
James Clark. n.d. "Architectural Form Processing." <http://www.jclark.com/sp/archform.htm>.
[Clark2]
James Clark. 1996. "Architecture engine examples." <http://www.oasis-open.org/cover/clarkArchSamp.html>.
[Cover]
Robin Cover. n.d. "Architectural Forms and SGML/XML Architectures." <http://www.oasis-open.org/cover/topics.html#archForms>.
[DeRose]
Steven J. DeRose. 1997. The SGML FAQ Book. Boston: Kluwer.
[ISO10744]
International Organization for Standardization. 1997. "Architectural Form Definition Requirements (AFDR)," Annex A.3 of ISO/IEC N1920, Information Processing--Hypermedia/Time-based Structuring Language (HyTime), Second edition 1997-08-01. <http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html>.
[Kimber1]
W. Eliot Kimber. n.d. "A Tutorial Introduction to SGML Architectures. <http://www.isogen.com/papers/archintro.html>.
[Kimber2]
W. Eliot Kimber. 1996. "Architectural Processing with Spam." <http://www.oasis-open.org/cover/kimberArchSPAM.html>.
[Pizza]
The Pizza Chef: a TEI Tag Set Selector. <http://www.hcu.ox.ac.uk/TEI/newpizza.html>
[Simons]
Gary F. Simons. December 1998. "Using Architectural Processing to Derive Small, Problem-Specific XML Applications from Large, Widely-Used SGML Applications." SIL Electronic Working Papers 1998-006. Originally presented at Markup Technologies '98, Chicago, 19-20 Nov 1998. <http://www.sil.org/silewp/1998/006/SILEWP1998-006.html>.
[SP]
James Clark. n.d. "SP An SGML System Conforming to International Standard ISO 8879 -- Standard Generalized Markup Language" <http://www.jclark.com/sp/>
[TEI P3]
Guidelines for Electronic Text Encoding and Interchange. C. M. Sperberg-McQueen and Lou Burnard, ed. Chicago and Oxford: Text Encoding Initiative. 16 May 1994. Revised reprint May 1999 (including corrections affecting markup for textual critical editions) available on line at <http://www.hcu.ox.ac.uk/TEI/Guidelines/>.