Springer, 2005. — 275 p.
This book is then focused in the technologies and methodologies that can provide a better utilization of metadata within spatial data infrastructures. In particular, this book will be centered on three main problems that hinder the correct utilization of metadata:
The high volumes of geographic resources and the difficulty of cataloguing them correctly. Although many geographic resources have been created in last decades quite anarchically (and usually with no associated documentation), it is common to find that, at least, it is possible to identify group of related resources among these anarchical resources. There are collections or aggregation of geographic resources (or datasets) that can be considered as a unique entity from a general point of view. Most of these collections arise as a result of the fragmentation of geographic resources into datasets of manageable size and similar scale. The creation of metadata for this upper-level of collections palliates, in no small degree, the lack of documentation for the components of these collections. On the other hand, the hierarchical identification of collections and sub-collections (they can be organized in nested structures) facilitates the organization within a data repository. Imitating this physical organization of collections, catalogs should provide mechanisms to catalog collections of related resources, thus facilitating their navigation and creation of metadata.
The diversity and heterogeneity of metadata standards. Along the last decade and as a response to the uncontrolled diffusion of geographic resources (and in general, all types of multimedia objects) encoded in disparate formats, many organizations (standardization bodies, software vendors, .) started different initiatives for the definition of metadata standards to enable the common understanding within a community of users. However, despite the initial intention of common understanding, the diversity of initiatives originated also an undesired effect of heterogeneity. Nowadays, most of these initiatives have converged to a well defined international standard for each application domain. But despite this convergence there is still a need for facilitating interoperability between different metadata standards. On one hand, legacy metadata (the work done in the past) developed during years can not be directly thrown away. And on the other hand, visibility across different application domains is necessary to facilitate the reuse of resources. Spatial data infrastructures and Geolibraries (digital libraries specialized in geographic resources) are usually asked to provide a summary view (e.g., Dublin Core metadata) of their specific geographic metadata (e.g., ISO 19115), understandable by the general public or discovery agents.
The heterogeneity of metadata content. By content heterogeneity it is meant the problem of identifying that the values given to a metadata element in two different metadata records are meaning the same concept despite using different terms. When the metadata elements are constrained to a predefined list of values, there is no chance for heterogeneity. But if the domain (datatype) of a metadata element is free-text data, possible misunderstandings may appear. In fact, this problem is independent of the metadata schema used, i.e. we may have problems to identify that two metadata records are describing the same resource despite using the same schema. This situation implies that catalog discovery services can not be uniquely implemented as a simple word matching between the user queries and metadata records stored in the catalog. The idea is that discovery services should move from basic data retrieval strategies towards information retrieval strategies. Data retrieval consists mainly of determining which records in the catalog contain the words specified in the user query which, very frequently, is not enough to satisfy the user information need (Baeza-Yates and Ribeiro-Neto, 1999). On the opposite, information retrieval is concerned more with retrieving information about a subject than retrieving data which satisfies exactly a given query. Information retrieval systems usually deal with natural language text which is not always well structure and could be semantically ambiguous. Thus the integration of selected information retrieval techniques into metadata catalogs would help to understand the sense of the users vocabulary and to link these meanings to the underlying concepts expressed by metadata records.
Therefore, the objective of this work will be to offer the proposals for incrementing the capacities of a metadata catalog infrastructure in three main aspects: the support of collections, the interoperability among different metadata standards, and the incorporation of information retrieval techniques. As depicted in figure 0.1, under a catalog interface layer we will propose:
A solution for the management of nested collections. A Metadata Knowledge Base will be used as the basis of the catalog system infrastructure. The main features of this knowledge base are that it will support different metadata standards, and overall, that it will facilitate the management of collections of related resources. The metadata records describing the items of a collection are very similar. This work will investigate how to model and make profit of the aggregation relations that may be established among the metadata records describing the items and the entire collection. The hypothesis is that an appropriate modeling of these aggregation relations will enable the inference of meta-information, avoiding redundancies of information, and discovering new ways of browsing and monitoring collections of resources.
A process for the construction of crosswalks between metadata standards. Crosswalks can be defined as the mechanisms or systems that enable the transformation between metadata in conformance with a source standard and the corresponding metadata in conformance with a target standard. Thanks to crosswalks, it will be possible to develop discovery services that search effectively across heterogeneous metadata holdings, i.e. they enable metadata interoperability.
The use of selected vocabularies (disambiguated thesauri) and information retrieval techniques in order to improve the performance of catalog discovery services. This work will present a heuristic method for the semantic disambiguation of thesauri that are later used to fill the content of some metadata elements. These disambiguated thesauri will be used for the sense-based indexing of metadata records, thus enabling the application of classic information retrieval methods for the implementation of discovery services.
Spatial Data Infrastructures and related concepts
A metadata infrastructure for the management of nested collections
Interoperability between metadata standards
The use of disambiguated thesauri to improve information retrieval
Integrating the concepts within the components of a Spatial Data Infrastructure
Conclusions and future work
A Collections
B Crosswalks
C Applications