Online Cultural Heritage Research Environment

OCHRE

ArchaeoML Schema

The following assumes familiarity with XML, XML Schema, and database concepts. You can use OCHRE very effectively without knowing these details. This page is provided for those who wish to know more about the underlying structure of the OCHRE database.

Twenty XML document types, defined via the World Wide Web Consortium’s “XML Schema” specification, provide the formal specification of OCHRE’s database structure. These interrelated document types, each with its hierarchy of elements and attributes, comprise the “Archaeological Markup Language” (ArchaeoML) created by David Schloen.

From the point of view of database structure, the ArchaeoML document types are analogous to relational tables; thus an XML document (data object) that is an instance of a particular ArchaeoML document type is comparable to a record (a row or tuple) of a relational table. The “key” field of a relational record is analogous to the globally unique UUID (Universal Unique Identifier) that is stored as an attribute of each ArchaeoML document’s root element. These unique document identifiers serve as keys for creating database “joins” between ArchaeoML documents.

ArchaeoML employs a hierarchical, item-based approach (discussed in the “Understanding OCHRE” page of this website) that is completely generalizable and extensible, facilitating the integration of data from diverse projects. The main design principles underlying ArchaeoML are that the number of element types should be kept to a minimum and recursive nesting of the same element type within itself should be used wherever possible. This allows recursive programming techniques to be used in software that manipulates the XML data, and it allows the software to be as modular as possible.

Recursive Hierarchies

Many of the hierarchies in OCHRE are recursive; in other words, any item in the hierarchy can be the “parent” of other items of the same kind, and there is no limit on the depth of the hierarchy. For example, the spatial hierarchies in OCHRE’s “Locations & Objects” category are recursive. Each database item in this category represents a spatially situated unit of observation, such as a standing monument or an excavated artifact, but the spatial scale of the unit can be as large or small as needed. A recursive spatial hierarchy might contain geographical regions at the highest level, then archaeological sites within each region at the next level, then stratigraphic and architectural units within each site, and then artifacts and other excavated finds at the lowest level. From a structural perspective, these are recursively organized spatial units of the same type. Each project is free to set up its hierarchies with as many levels as needed to represent the particular scales of observation and semantic distinctions used by that project.

Texts that use complex writing systems (e.g., Egyptian hieroglyphs) are represented in OCHRE by the same kind of recursive hierarchies. The techniques used to construct spatial hierarchies are also used for linguistic hierarchies. A database item in OCHRE’s “Texts & Dialogues” category can contain an epigraphic hierarchy or a discourse hierarchy, or both. The epigraphic hierarchy represents the physical structure of a text in terms of its division into “epigraphic units” at various levels (e.g., sections, columns, lines, and individual signs or graphemes). The discourse hierarchy represents the meaningful structure of the text in terms of its division into “discourse units” at various levels (e.g., paragraphs, sentences, clauses, phrases, words, and morphemes). Here is a case where cross-cutting links that connect the items in one hierarchy to those in another are also important. A discourse unit (e.g., a word) must be linked to the epigraphic units (physical signs) to which it refers.

Recursive hierarchies are used in many other places in OCHRE. Most users will find them to be intuitively understandable and will have little difficulty in using them to represent their own data. The OCHRE pilot projects have found these hierarchies to be an effective means of managing information about various kinds of cultural heritage entities. This reflects the widespread use of hierarchical structures to organize information, which may be related ultimately to the central role of recursion in the human faculty of language.

General-Purpose versus Purpose-Specific Schemas

XML is often used to accomplish purpose-specific exchanges of data from one computer to another; but that is not how it is being used here. The twenty XML document types that make up ArchaeoML constitute a “global schema” for cultural heritage information. It is true that XML documents are often used to represent real-world documents (e.g., articles, invoices, books, etc.). But they are used in ArchaeoML in a more abstract fashion to represent fundamental data entities within a normalized database structure (i.e., a database in which a given piece of information exists in only one location and is linked to related information in an optimal way, eliminating inefficient and error-prone redundancies).

In this regard, it is worth contrasting ArchaeoML with the XML schema developed by the Text Encoding Initiative (TEI). The TEI schema defines a purpose-specific data-exchange format, not a general-purpose database structure. But the latter is what is needed to create a truly integrative and efficiently searchable digital resource. Less abstract XML documents that conform to purpose-specific schemas (e.g., for displaying a dictionary article or a table of archaeological observations) can easily be exported from a general-purpose database for further processing using other software. Such documents, including TEI-conformant documents, are not built into ArchaeoML’s permanent structure, but can be dynamically generated as needed.

Types of Information Represented

The XML element hierarchies defined in ArchaeoML represent the following types of cultural heritage information and the many interrelationships among them:

1. Archaeological descriptions, consisting of observations about ancient landscapes (roads, canals, fields), settlement sites (architecture, stratigraphy, botanical and faunal remains), and artifacts (including the physical properties and contexts of inscribed artifacts).

2. Geographical descriptions, consisting of observations about geographical regions and ancient environments (topography, climate, hydrology, vegetation). Archaeological and geographical descriptions include not just alphanumeric data but also visual resources such as photographs, video clips, drawings, maps, and 3-D models.

3. Language descriptions, especially lexicons of ancient languages, but potentially also phonological, morphological, and syntactic descriptions.

4. Script descriptions, consisting of information about writing systems and the graphic signs used in them.

5. Text descriptions, consisting of the epigraphic and linguistic characteristics of ancient texts and scripts, including sign-by-sign transliterations, normalized transcriptions, grammatical analyses, and modern-language translations.

6. Research results, consisting of secondary literature (e.g., technical reports, interpretive discussions, bibliographies, etc.) organized by author and by modern conceptual categories. The thematic organization of this secondary literature provides a framework within which archaeological, geographical, textual, and linguistic descriptions can be located.

Main Document Types

In terms of specific ArchaeoML document types, spatially situated units of observation (“Locations & Objects”) are represented by recursive trees of  SpatialUnit documents; i.e., each spatial unit can contain any number of other spatial units. These tree structures, which are implemented using Tree documents, represent the spatial containment relationships among units of observation. A SpatialUnit document itself can contain any number of observation elements, representing multiple observations of the same unit. An observation element can contain any number of property elements describing the properties of a unit of observation. A Resource document represents a digital resource of some kind. A digital resource can be an “internal document” consisting of formatted hypertext stored internally within the database, or it can be an “external resource” such as an image, map, video clip, or written document in HTML or PDF format. Interpretive discussions, technical reports, and other secondary literature that exists in digital form are all represented by Resource documents. A Bibliography document stores a bibliographic entry. In contrast to the Resource document type, which represents digital resources, a Bibliography document refers to a non-digital printed resource. Resources of various kinds are associated with units of observation by links between SpatialUnit and Resource documents. Photographs and two- or three-dimensional maps are represented by Resource documents linked to SpatialUnit documents. If a unit of observation is an inscribed artifact, then it is also represented by a Text document, cross-referenced with the relevant SpatialUnit document. A Text document itself points to two Tree documents that represent: (1) a recursive tree of EpigraphicUnit documents, where each epigraphic unit is a physical component of the text corresponding to some level of analysis within a hierarchy of physical subdivisions (e.g., section, column, line, or sign); and (2) a recursive tree of DiscourseUnit documents, where each discourse unit is a meaningful component of the text corresponding to some level of analysis within a hierarchy of linguistic subdivisions (e.g., paragraph, sentence, clause, phrase, word, or morpheme). A DictionaryUnit document represents one or more dictionary entries; an entire dictionary is represented by a Tree document that organizes many DictionaryUnit documents. By using Set documents to represent named sets of ArchaeoML documents of any type (e.g., Resource, SpatialUnit, Text), including query result sets, one can organize primary archaeological and philological data and secondary literature under various topical headings. Alternatively, ArchaeoML documents can be organized into conceptual hierarchies via Tree documents. The result is a way of organizing information similar to that found in traditional printed catalogues and encyclopedias.

Main Elements

The following is an alphabetical list of the main element types in ArchaeoML, with a brief description and a link to the document type in which the element is defined.

Element Name

Description

Document Type

booleanExpression

a boolean expression (evaluating to true or false) that forms part of the query criteria for retrieving a set of ArchaeoML documents

Query

criteria

query criteria for retrieving a set of ArchaeoML documents

Query

dictionaryUnit

one or more dictionary entries

DictionaryUnit

discourseUnit

a discourse unit that represents a meaningful component of a text, as understood by a modern editor (e.g., the whole text, a sentence, a phrase, a word, a morpheme, etc.)

DiscourseUnit

epigraphicUnit

an epigraphic unit that represents a physical component of a text (e.g., the whole text, a section, a column, a line, a single sign, etc.)

EpigraphicUnit

observation

data recorded from one observation of a spatial unit

SpatialUnit

period

a temporal period of any duration

Period

person

a person or organization (including living researchers and historical or fictional persons and organizations)

Person

predefinition

a predefined group of properties (variable-value pairs) that can be used to describe an item

Predefinition

project

a research or publication project that involves specific sets of spatial units, texts, resources, variables, etc.

Project

property

a property of an item, consisting of a variable-value pair

SpatialUnit etc.

relationship

a project-defined relationship between two entities (represented by two ArchaeoML documents)

Relationship

resource

a digital resource, either internal hypertext or an external file (image, video clip, document, etc.), usually linked to SpatialUnit documents and other types of documents

Resource

spatialUnit

a spatially situated unit of observation (i.e., a location or object)

SpatialUnit

text

a text written in a complex script that has been transliterated, transcribed, and/or translated

Text

tree

a hierarchy of ArchaeoML documents of a given type (e.g., SpatialUnit documents, for which the hierarchy represents spatial containment)

Tree

variable

a variable used to describe an item

Variable

value

a non-numeric value of a nominal or ordinal variable

Value

XML Schema Documentation and Source Files

The following is an alphabetical list of the ArchaeoML document types with links to full documentation and source files. These document types are still under development and are subject to change until the official release of ArchaeoML version 1.0.

Schema Documentation

Schema Source

Bibliography.html

Bibliography.xsd

DictionaryUnit.html

DictionaryUnit.xsd

DiscourseUnit.html

DiscourseUnit.xsd

EpigraphicUnit.html

EpigraphicUnit.xsd

Map.html

Map.xsd

Period.html

Period.xsd

Person.html

Person.xsd

Predefinition.html

Predefinition.xsd

Project.html

Project.xsd

Query.html

Query.xsd

Relationship.html

Relationship.xsd

Resource.html

Resource.xsd

ScriptUnit.html

ScriptUnit.xsd

Set.html

Set.xsd

SpatialUnit.html

SpatialUnit.xsd

Style.html

Style.xsd

Text.html

Text.xsd

Tree.html

Tree.xsd

Value.html

Value.xsd

Variable.html

Variable.xsd

[Last revised on February 28, 2006.]

University of

Chicago