Online Cultural Heritage Research Environment

OCHRE

Integrating Data

OCHRE is a vehicle for integrating data from multiple projects while preserving the terminology and conceptual distinctions employed in each project. It does not impose a standardized nomenclature or recording system. Instead, it provides a common, integrative structure at the level of fundamental spatial, temporal, and logical relationships that apply to all cultural heritage projects.

This is possible because OCHRE takes a hierarchical, item-based approach to organizing cultural heritage data as opposed to a tabular, class-based approach (see the “Understanding OCHRE” page of this website). The item-based  approach is a powerful means of integrating data from disparate sources. The class-based approach, in which a project’s database consists of a set of tables, each of which represents a predefined class of information, is widely used by cultural heritage researchers. But it is difficult to merge data from one class-based database into another because the number and types of classes (tables) usually vary from one database to the next; and even when there are similar classes in two different databases, the predefined properties (columns) of the classes often do not match. In OCHRE this problem is avoided because table rows are decomposed into individual items with their own properties, and the grouping and recombination of items is done by matching their individual properties. In addition, the spatial and logical hierarchies used by one project can be integrated with the hierarchies of another project, enabling the construction of a larger database that contains the information from both projects while remaining coherently organized in a predictable fashion.

Local Database Schemas and OCHRE’s Global Schema

In OCHRE the information from multiple projects is managed by an XML database server which can store and query large quantities of semistructured hierarchical data with great efficiency. OCHRE serves as a central data repository that can receive information from multiple heterogeneous data sources, each of which conforms to a “local schema” (e.g., a set of relational tables), and can integrate the local databases via a more general “global schema” expressed in XML. OCHRE’s global schema is the “Archaeological Markup Language” (ArchaeoML), which is described in detail in the “ArchaeoML Schema” page of this website.

By converting data from its local database schema into the global ArchaeoML schema, a project can share its information and place it within a larger framework without erasing its distinctive features. A project can continue to use its own local database and periodically import it into OCHRE, or it can use OCHRE as its primary database, entering new information directly into OCHRE’s global format while adhering to its own terminology and recording system.

Data Publication and Data Security

A project’s data does not become publicly available until it is explicitly released by the project for viewing and querying by OCHRE users who are not members of the project. A project designates its members by assigning specific privileges to named OCHRE users for each of its data categories, determining who can view or update data in a given category.

A project’s data may be isolated from other projects, being visible only to project members, or it may be made available for public use. After a project makes some or all of its data public, other projects may establish links to it, integrating those database items directly into their own item categories. For example, a project’s taxonomy of properties or its thesaurus of synonymous terms might be used by other projects that want to avoid reinventing the wheel. Data from a group of related projects can be queried together, enhancing the possibilities for research. This capability is needed by archaeologists, for example, who must search for parallels across multiple excavation sites.

Semantic Integration via Thesauri

There are three stages involved in the integration of heterogeneous databases:

Syntactic integration is accomplished by using XML text as the standardized data transport medium.

Schematic integration involves mapping the structure of each source database onto a global schema, which is provided by OCHRE’s generalized XML database structure. This can be done in a largely automated fashion using OCHRE’s data import utility, which converts data tables from the source database into hierarchies of ArchaeoML documents and elements. The data import utility is not yet available.

Semantic integration is the third and most difficult stage. It requires thesaurus relationships to be established between related terms in each source database. Different cultural heritage databases may use different human languages, as well as different terms for the same thing within one language. The semantic ranges of these terms often do not coincide exactly but overlap in complex ways. The matching of terms from one project to another must therefore be done by an expert human being because the nuances of meaning in a given context are often very subtle.

Indeed, thesaurus construction is itself a work of scholarship; there is no single canonical thesaurus that can serve all purposes for all time. For this reason, OCHRE permits the creation of multiple overlapping thesauri and allows each thesaurus to be credited to a particular author or project. Users will choose for themselves whose thesaurus to employ when doing automated queries that span multiple projects, and they will have the option of creating their own.

OCHRE and the CIDOC CRM

The International Committee for Documentation (Comité international pour la documentation, or CIDOC) of the International Council of Museums has produced a detailed “Conceptual Reference Model” (CRM). The CIDOC CRM is “a formal ontology intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information.” It has the potential to do this because “it describes in a formal language the explicit and implicit concepts and relations relevant to the documentation of cultural heritage,” and thus provides “a common and extensible semantic framework that any cultural heritage information can be mapped to.”

The CIDOC CRM is an ontology; it is not itself a database system or even a database schema. In contrast, OCHRE is a database system whose schema is expressed in XML as the “Archaeological Markup Language” (ArchaeoML), which is described in detail in the “ArchaeoML Schema” page of this website. Insofar as ArchaeoML provides an ontology, this ontology is much simpler and more abstract than the CIDOC CRM, which consists of hundreds of concepts and relationships. Instead, ArchaeoML (and hence OCHRE) provides a small number of standardized structures in which relatively few concepts and relationships are predefined. OCHRE projects use these standardized structures to construct lists and hierarchies of database items that represent the entities of interest to them, and to construct taxonomies of item properties and thesaurus relationships between property names. If the spatial, temporal, and logical relationships represented by the built-in OCHRE structures are not sufficient to capture a project’s information, that project can define other kinds of relationships among its database items and it can use these project-defined relationships to construct its own conceptual models.

OCHRE projects can therefore implement the CIDOC CRM, if they wish, using some or all of its predefined concepts and relationships as the basis for their project-defined items, item properties, and inter-item relationships. From OCHRE’s point of view, the CIDOC CRM is a “local schema” like any database schema in which the names of cultural heritage entities and relationships are predefined. In general, any cultural heritage ontology or controlled vocabulary (taxonomy or thesaurus) can be implemented in OCHRE.

A Central Data Repository versus Metadata Harvesting

OCHRE serves as a central repository (or “data warehouse”) that receives information from multiple sources and delivers that information back to researchers in various combinations. At present, this function is implemented by means of a single database server that is hosted by the University of Chicago. In the future, OCHRE could be distributed across multiple XML servers, if the number of projects and the volume of data warrants it. To OCHRE users, however, the physical location of the data will not be relevant. OCHRE will always appear as a single data repository that integrates information from participating projects.

This aspect of OCHRE’s design reflects the research purpose it is intended to serve. Other approaches to “federating” or “mediating” between disparate databases in a looser fashion have been proposed. But if researchers want a powerful query capability, as many do, then their data must be fully integrated within a comprehensive database structure that makes this possible. Furthermore, academic users, unlike business users, do not need on-the-fly federation or mediation of local data sources. Genuine integration with a powerful query capability is more important than live updates from source databases.

An example of a mechanism for loosely federating, as opposed to tightly integrating, heterogeneous databases is the XML-based protocol developed by the  Open Archives Initiative for sharing “metadata” (data about data) that describes the contents of diverse databases available on the Internet. This protocol enables automated “metadata harvesting” from local sources, which is valuable for finding out what kinds of information are stored in these databases, but it is not intended to accomplish the powerful queries of complexly organized information across multiple projects for which OCHRE was developed.

[Last revised on February 27, 2006.]

University of

Chicago