

|
Online Cultural Heritage Research Environment |
|
OCHRE |
|
Understanding OCHRE |
|
Control of Data by Participating Projects Centralized data storage does not imply centralized control of the data. Each participating project manages its own data and controls the way that it is entered, organized, and viewed. OCHRE does not force projects to use a predetermined nomenclature and it does not impose a rigid mode of organizing information. Moreover, OCHRE does not present itself as a single, anonymous authority. Individual projects and researchers are identified by name and are given credit for the data and interpretations they have entered. Normalized Databases versus Text Documents OCHRE is a normalized database system. This means that each piece of information is stored only once and the individual pieces are retrieved and combined as needed in whatever way is suitable for the task at hand. This approach requires that a project’s data be subdivided into small units with identifying “keys” that allow the individual units to be located accurately and efficiently. In OCHRE, each database item belongs to one of twenty basic types that collectively represent cultural heritage information of all kinds, including archaeological evidence and ancient texts that use complex writing systems. The great advantage of a normalized database is that information is represented consistently, without duplication or ambiguity. This is in contrast to the storage of information in collections of text files—digital formats that mimic ordinary printed text, with all of the repetitions and ambiguities characteristic of such documents. Much of the information available on the World Wide Web is in the form of text documents, which are easy to create and are readily displayed in a Web browser. Search engines like Google can extract useful results from text documents, but more powerful and efficient queries depend on a normalized database. Moreover, a normalized database system can itself construct text documents as needed for specific purposes, as is often done behind the scenes in sophisticated websites. Data on the Web: XML versus HTML The “Extensible Markup Language” (XML) brings together the world of text documents and the world of normalized databases. The World Wide Web is based on the “Hypertext Markup Language” (HTML), which provides a standardized mechanism for indicating how text documents should be displayed on users’ computers. HTML does this by inserting special “tags” surrounded by angle brackets. HTML tags are used to mark up the text with instructions about headings, paragraphs, lists, styles, and other aspects of how the document is to be formatted. HTML tags can also indicate “hyperlinks” that connect pieces of text in the current document to other locations in the same document or in documents stored elsewhere on the Internet. Following such links from one document to another has become a very common experience for people who use the World Wide Web. Some HTML tags were originally intended to represent the semantic structure of a document rather than its appearance, but HTML is not sufficiently consistent in this regard for software that reads it to interpret the semantic structure reliably. This is one reason why XML was subsequently developed. XML bears a superficial resemblance to HTML, insofar as it defines a standardized document format for tagged text. But XML is “extensible” in a way that HTML is not, because XML tags are not predetermined. HTML has a limited set of predefined tags that are useful for specifying how a document should be presented to the user and how it should be linked to other documents. XML provides a mechanism to define new tags that describe the semantic content of a document without regard to its presentation style. This feature of XML is very powerful because it allows the XML tagged-text format to be used as a general-purpose means of describing data structures of all kinds, from relatively unstructured text documents intended to be read by human beings to highly structured databases. XML is used in OCHRE to represent and integrate a wide range of data types in a very flexible manner. Here is an example of a simple XML document that describes a set of books. The tags in a document such as this indicate semantic distinctions but say nothing about how the information is to be presented. The information can be formatted in many different ways depending on the need at hand. HTML documents can be easily generated from XML documents in order to display the information in Web browsers, but this is only one of the possible ways the XML data might be presented. <?xml version="1.0" encoding="ISO-8859-1"?> <books> <book id="Book_1"> <title>XML: The Annotated Specification</title> <author>Bob DuCharme</author> <publisher>Prentice Hall</publisher> <year>1999</year> </book> <book id="Book_2"> <title>XML Bible, 2nd Edition</title> <author>Elliotte Rusty Harold</author> <publisher>Hungry Minds, Inc.</publisher> <year>2001</year> </book> <book id="Book_3"> <title>XQuery: The XML Query Language</title> <author>Michael Brundage</author> <publisher>Addison-Wesley</publisher> <year>2004</year> </book> </books> Java User Interface versus HTML Browser Interface As a plain text format, XML has the same virtues as HTML as a vehicle for transmitting information on the Internet. XML documents can be easily sent across the Internet from one computer to another using the “Hypertext Transport Protocol” (HTTP), just like HTML documents. (“Plain text” here means that XML files are encoded at the level of binary digits using the Unicode standard for representing text characters—or, more specifically, the UTF-8 or UTF-16 encoding forms.) But while HTML is usually displayed on the receiving end by means of Web browser software, there is no need for an XML-based system to be restricted by the limited user interface features a browser provides. Unlike many online databases, OCHRE is not a browser-based system. The OCHRE Java software runs independently of the user’s Web browser. In fact, OCHRE has an HTML browser feature of its own so that websites can be displayed within its user interface along with other kinds of data. OCHRE bypasses the user’s Web browser and employs the HTTP protocol to transmit XML documents directly to and from a central database server. In the past, a browser-based solution was a practical necessity because HTML browser software such as Microsoft’s Internet Explorer was the common vehicle that all users could be expected to possess. In recent years, however, Java has matured to the point where it has become a universal platform for Internet-based systems. The Java Runtime Environment is preinstalled on almost all computers sold today and can be easily upgraded at no charge. Java enables a more powerful user interface than an HTML browser can provide; indeed, many of the features of OCHRE would not be possible without using Java. Semistructured versus Relational Databases For twenty-five years, the relational data model has dominated database design. In this approach, information is represented in terms of relations between items (“entities”) and their properties (“attributes”). A relation is usually shown in tabular form. Each table column represents a different property and each row represents an item; thus each cell of the table at the intersection of a row and a column contains the value of a property for a given item, as shown in this schematic example:
This highly structured way of organizing information has many advantages, but it is not well suited to representing complex hierarchies or relatively unstructured textual information. As it happens, the study of cultural heritage is rife with complicated spatial and logical hierarchies and loosely structured texts. Archaeologists and textual scholars need a data model that conforms to their research, rather than being forced to squeeze their data into an inappropriate relational mold. For this reason, OCHRE uses the semistructured data model rather than the relational data model. (The concept of “semistructured data” is discussed in Data on the Web: From Relations to Semistructured Data and XML, by S. Abiteboul et al. [San Francisco: Morgan Kaufmann, 2000]). Semistructured data is characterized by hierarchical tree structures as opposed to relational table structures and by flexible links that cut across the tree structures to connect database items in nonhierarchical ways. XML is particularly good at representing semistructured data, while it is also capable of representing highly structured relational tables. That is why database specialists have concluded that a primary function of XML is “information integration.” XML can be used to integrate data that is organized in different ways, encompassing relational databases of various kinds as well as loosely structured text documents. The growing importance of the semistructured data model and of XML as a means to implement it is apparent in the recent inclusion of XML within major database products such as Oracle. XML has moved from specialized database software to the mainstream. Large numbers of XML documents (semistructured data objects) can now be stored, indexed, and queried efficiently within powerful database management systems that support the nonproprietary XML standards published by the World Wide Web Consortium (e.g., “XML Schema” and “XML Query”). An Item-Based Approach OCHRE makes full use of XML’s ability to represent semistructured data. The basic building blocks of OCHRE’s database structure are individual items organized into lists and hierarchies. Each database item represents an entity pertaining to the study of cultural heritage, such as an artifact, a site, a written document, a bibliographic reference, a researcher, and so on. The many diverse entities to be represented are organized by grouping database items into a limited number of general categories, such as “Locations & Objects,” “Persons & Organizations,” and “Texts & Dialogues.” These categories and the use of hierarchies within each category provide the structure needed for the efficient storage and retrieval of cultural heritage information. Hierarchies of items represent in a rigorous and consistent way the spatial, temporal, and logical relationships among items. Furthermore, the ability to cut across item hierarchies (and across categories) by creating links between individual items—e.g., between an artifact and a text, or a researcher and an archaeological site—ensures that OCHRE can faithfully represent the entities and relationships studied by textual scholars, archaeologists, and others interested in the world’s cultural heritage. This hierarchical, item-based approach differs from the tabular, class-based approach commonly used in archaeological databases. The usual approach has been to define a set of item classes (e.g., debris layers, architectural features, ceramic artifacts, metal artifacts, stone artifacts, faunal remains, botanical remains, etc.) and to create a data table for each class. An individual item is represented as a row in a particular table. The table columns represent properties that describe the items, such as “type,” “color,” “length,” etc. These kinds of data tables are easy to work with using commercial relational database software, which helps to explain the popularity of this approach. (It is worth noting that relational databases do not require a class-based organization of information, although they are well suited to it. The hierarchical, item-based approach can be implemented using relational database software, as was done in an earlier Windows application that serves as a prototype for OCHRE.) The class-based approach is simple and straightforward, but it requires that each observed entity be placed within a predefined class in the database; moreover, the properties that describe the entity are limited to the columns predefined in that class’s table. It is not easy to add or delete a property for a particular item because doing so affects an entire table column and hence an entire class of items. This makes it difficult to capture the variability and complexity of the entities to be described—especially in archaeology, in which there are few common standards for how items should be described. As a result, observations are forced into a rigid tabular mold, relegating information about idiosyncrasies or unusual properties to unstructured notes. More fundamentally, in the class-based approach the classification of the data is determined in advance by the computer recording system instead of allowing for multiple overlapping classifications that might emerge from later analysis and comparison of the units of observation. In the item-based approach, on the other hand, the basic structural unit of the database is not the class of items but the individual item as a unit of observation, on whatever spatial scale that unit is defined (e.g., a geographical region, a site, a building, an artifact, or a part of an artifact). Classes are not built into the database structure ahead of time but are generated by queries on the properties of individual items. Any number of properties can be used to describe a particular item, and new properties can be added as needed. In addition to properties, each item can have linked to it any number of other database items representing observed entities or external resources such as photographs, maps, video clips, and so on. Note: The hierarchical, item-based approach employed in OCHRE is described in David Schloen’s article “Archaeological Data Models and Web Publication Using XML,” Computers and the Humanities 35 (2001): 123–52. This article is out of date in some respects. It was written before the OCHRE project was begun, but it explains the basic design principle that has informed the development of OCHRE. [Last revised on March 9, 2006.] |
|
University of |
|
Chicago |

|
For full documentation of the design and operation of the system, see the OCHRE manual, from which the following material is excerpted. XML Database Structure and Java User Interface OCHRE consists of both a database structure and a user interface for entering data and performing queries. The database structure is specified using the “Extensible Markup Language” (XML), which has become the standard format for disseminating data on the Internet (for more details, see the “ArchaeoML Schema” page of this website). The user interface software is written in the Java programming language, which enables it to run under a wide variety of computer operating systems, including Windows, Macintosh, Linux, and Solaris. OCHRE makes use of the “Java Web Start” mechanism to launch the user interface from an ordinary Web link; for example, by clicking the “Start OCHRE” link on the “Getting Started” page of this website. The OCHRE user interface communicates via the Internet with an XML database server at the University of Chicago, permitting users to enter data from their own projects and to view and query data from other projects to which they have been given access. The information stored in the central database is organized in such a way that it can be retrieved with a high degree of consistency, efficiency, and flexibility. |