GED User's Guide
Dataset Selection

TASK DESCRIPTIONS

Ten project tasks are divided into three major groups:
  • Database Integration and Design
  • Technical Development
  • Scientific Support and Methodology
  • References

    Task Description

    DATABASE INTEGRATION AND DESIGN

    TASK 1: Data Availability and Needs Assessment

    To assess data and information needs, there must be an in-depth evaluation of data availability, including as much information on data design and usability as possible. The GED project has maintained and updated metadata for global change datasets to improve documentation and support needs assessment. Acquisition and processing priorities have been established in response to the needs defined by EPA and NOAA requirements for research; however, it is clear that many of these requirements are shared with the global change community in general.

    Established priorities may range from acquisition of known datasets that require little processing, to recommendations on derivation methods or specific algorithms to produce the desired data from existing sources. The needs assessment also distinguishes between data requirements that can be generalized for wide application, versus unique requirements of the supporting agencies and their immediate research targets.

    TASK 2: Data Acquisition, Integration, and Archive

    At its inception, the GED project sought to extend normal data management operations of NGDC to include innovations. These innovations focused on GIS and related methods for reprocessing and integrating individual datasets received in diverse forms, into the common GIS structure established for the database. The project also aimed to develop useful variables required for research support. This involved as many as four levels of work:

    (a) Data Management:

    Data are acquired through various channels and made available in integrated form online as well as on CD-ROM. Data acquisition efforts seek source datasets that are as close to original investigations as possible, or are especially appropriate for specific requirements. Data integration efforts seek to represent these datasets in as close to their source condition as possible, given the requirements of integration. Since error estimates are typically lacking for most data, it is especially important to preserve full representation of quality information within or accompanying each dataset. Similarly, independently derived, multiple datasets are included to increase intercomparison opportunities between similar data within the database, for quality and error estimation.

    As data are ingested into the NGDC system, all versions of the datasets are archived, including the original source and various modifications produced for the purpose of integration. Documentation is archived with the data, and is compiled into the database User's Guide.

    The existing database is augmented by new datasets acquired from external sources or derived at NGDC, with publication clearances for both data and documentation. All portions of the database, with the exception of re-printed journal articles and custom software, are placed into the public domain. Some individual source datasets are not distributable by NGDC, due to restrictions placed on them by the investigators or source institutions; or more often, in deference to other distribution agreements. Derivations from such source datasets, however, as part of an integrated and operational database, are developed and distributed as different products.

    (b) Data Processing:

    Once data are in the system, they are processed for data management purposes, to inspect quality and content, to verify documentation, and to determine the optimal data structure for integration into the main database. Depending on the nature of the dataset, integration may involve anything from simple media transfers and format conversions, to re-structuring operations such as geographic registration, grid re-sampling, gridding from point data, re-projecting, vectorizing, rasterizing, tabularizing, and/or other forms of interpolation. This work incorporates variously compatible datasets into a common data structure based on geographical objects and community-wide GIS conventions.

    The most common changes made to datasets are conversions between numerical types (i.e., real, integer, ASCII, binary, etc.). Actual re-classifications are avoided unless absolutely necessary due to extremely unusual or cumbersome classification schemes in the source data, in which case every effort is made to involve the principal investigator(s).

    (c) Data Derivation:

    A number of datasets represent "derivations" from raw observational data, such as NDVI derivations from the Advanced Very High Resolution Radiometer (AVHRR) satellite sensor data, or monthly composites using a variety of methods. "Derivation" in this context is defined as a modification of raw data from a single thematic source, using accepted methods of calibration and/or correction. Although such work will defer, where possible, to other established sources (that contribute such derivations into the public domain), some specific derivations may be produced at NGDC. Priorities for derivation are jointly established by the supporting programs, according to needs and resources.

    (d) Data Synthesis:

    Simulation models and other applications may require spatial variables that are missing from the observational database or cannot be derived directly from it. In such cases, it may be possible to substitute for the missing variable through a complex derivation using established relationships and transformations between multiple datasets. This is a form of characterization that can be performed at NGDC and elsewhere using GIS methods, given suitable guidance on transformation equations and derivation procedures.

    TASK 3: Database Structure and Functionality

    A common database structure is possible using currently evolving GIS conventions. This has many benefits in supporting quality control, peer-review, and the sharing of research applications of the database. It is important to note the co-evolutionary nature of the database structure and GIS functionality, and the need for further research and experimentation on both.

    Descriptive analysis functions that are required for characterization are listed in the table below (Kineman, 1993):
    FUNCTION PURPOSE
    Data Integration Multi-thematic representation of variables, within a common analytical structure, with metadata for verification and error analysis.
    Exploratory Data Analysis Visualization; exploration; hypothesis formation
    Inter-comparison and statistical analysis Quality and error assessment; empirical testing (hypotheses, models, and characterizations); statistical modeling
    Re-sampling and error analysis Experimental Design / representation for different applications
    Environmental Characterization & Data Synthesis Multi-disciplinary representation of phenomena; monitoring and assessment; description of patterns and trends; model inputs and tests

    These functions relate well to GIS concepts, which employ a number of conventions, such as geographical (vector and raster) object definitions, common geographical referencing and topological structures, labeling and legend conventions, and others. Future research and development in GIS, if properly informed by the global change community, can provide systems that are better suited for environmental characterization.

    Linkage with GIS is important in forming the required structural and functional aspects of an integrated database, and in performing many of the processing and quality control operations. There is also a need for a common environment to support peer-review and communication with a wide global change community of data users and developers.

    Advances in data structures and metadata are strongly influenced by developments in GIS functionality. The reverse has also been true, with system enhancements resulting from data integration requirements and the effort to track more kinds of data and metadata to meet demands of global change research. As clear as this relationship is, it is still not easy to maintain such links for generic purposes or to disseminate general system requirements to GIS developers. By maintaining strong links to GIS development, these relationships can become better defined.

    As data are processed into a common structure that is intended to be compatible with emerging GIS conventions, we must also deal with changing conventions and improvements in GIS structures. This is accomplished by linking the database with more-or-less "generic" GIS's that are currently available and offer the possibility for collaborative improvement. The use of IDRISI (from Clark University) began during the pilot and prototype phases and has continued because of the value of having a "common denominator" to facilitate integration, review, data exchange, and technical communication. UNIX, though common among global change researchers, does not provide the widest commonality among reviewers, data developers, and internationally distributed scientists and data sources; whereas the IBM-DOS compatible environment has made significant inroads in most technical facilities. Furthermore, maintaining operability at the lowest operable system level has introduced little if any sacrifice of quality or operability in more capable systems (other than the problem of format conversion).

    Collaboration with Clark University (IDRISI) has been fruitful in showing how structure and function can co-evolve. Based on this interaction, IDRISI has been significantly modified to support global change studies, and in turn, the database structure has evolved through use within the GIS environment. Using this example, similar relationships may be developed with other GIS developers.

    The GED project goes beyond the issue of providing a common format for data exchange between systems and users, which is already the focus of several major efforts within Government and other institutions (formats for the integrated database are relatively generic and easily converted to other standards, which generally require less detail about the data). An integrated database is a more in-depth approach, that correspondingly sacrifices breadth in its coverage of overall data availability. Instead, focus is on the intercomparability, inter-operability, and verifiability of datasets, using a selected portion of the overall data pool that must be prioritized by specific research.

    The structure and function of a characterization database is strongly related to Geographic Information Systems (GIS) methodology, to such an extent that GIS integration and improvements in GIS methods are of primary concern, along with the provision of useful data. For this reason, agreements may be formed with GIS researchers, and experimentation by GIS developers is encouraged to promote the inter-operability of the database within the established GIS community and among global change researchers using GIS methods.

    This will involve experimentation with GIS structures using at least one existing GIS as a development platform, and working closely with the software designers to incorporate improvements in:

    (a) data structures and formats (raster, vector, tabular, data types, compression, etc.)

    (b) raster/vector integration,

    (c) integration of tabular data structures,

    (d) export to other systems and formats,

    (e) data exploration and statistical capabilities,

    (f) scale integration and interpolation functions (including re-projection).

    (g) lineage tracking

    (h) techniques for error representation

    (i) error estimation and tracking


    Task Description

    TECHNICAL DEVELOPMENT

    TASK 4: Quality Assessment and Documentation

    The greatest problem in the distribution of environmental and ecological datasets is their usefulness outside the institutions or programs that created them, and within a rapidly expanding global change community. While review and publication standards provide effective quality assurance for research in general (including the production of datasets), they apply less directly to the distribution of datasets for subsequent uses, which is often a second or lower priority in research programs and funding. The result is that when data are removed from their original research context, we have inadequate mechanisms for evaluating their design or verifying their accuracy for given purposes. This issue of verifiability becomes critical in the context of large multi-disciplinary system studies such as global change, which must rely on a common data pool.

    Documentation therefore involves more than providing information about formats and data structures. Verifiability means the ability to assess the entire production method, statistical nature, and accuracy of the dataset, given its original purpose and present use. Ideally, the overall effort will provide the means for users to "re-design" or synthesize data for specific purposes with known confidence limits. This requires considerable knowledge about the data, i.e., "metadata" and documentation. In the GED project, such information is obtained from existing documentation and published articles, returns from the external review process, and results of internal quality assessment efforts. Documentation is produced and added to the database User's Guide. The User's Guide contains all available information needed to understand the design and nature of the data, from a technical/statistical (and mathematical, in the case of models) perspective.

    Internal Quality Control procedures (which extend to all tasks) are implemented to ensure error-free processing and full and accurate reproduction of datasets in their complete form, as close to their originally design as possible, even though structures and formats may be changed to achieve full integration with the database. The need for retrospective quality assessment of many datasets themselves (which typically have inadequate documentation or metadata), has led to an emphasis on "Exploratory Data Analysis" methods, some of which have been experimentally incorporated into GIS software. In some cases, this has also spawned research projects related to specific datasets, for example on the quality and use of AVHRR and Vegetation Index data, and other examples in topography, vegetation classifications, climate data, etc. Some of this research also results in new methods for data integration, such as gridding or re-gridding techniques, error representation, etc. Methods in quality assessment are also being improved cooperatively with independent software developers.

    TASK 5: Distribution and Peer Review

    Three levels of distribution and review are described in the main body of the User's Guide, along with a diagram of the annual cycle.  Pre-release review by 3-5 reviewers selected specifically for each dataset ensure adequate quality and appropriate representation of new datasets. A general review of the overall product (which began with a targeted review of the prototype), is maintained through feedback from users. A number of issues are addressed, including:

    1. Technical quality and completeness of data and documentation
    2. Database design
    3. GIS functionality
    4. Scientific content
    5. Potential applications
    6. Recommended improvements
    7. Overall methods

    The review concept, however, goes farther. As "reviewers" use the database (and GIS methods) in their research, they also participate in an experiment in scientific networking which may have broader implications for future collaboration. Each scientific discipline must achieve a certain "critical mass" in establishing useful methodological conventions that allow scientific exchange to proceed smoothly and quickly between colleagues. Study of Global Change, as a field that crosses traditional disciplinary and institutional boundaries, requires an unprecedented level of cooperation and exchange to address the key scientific questions. Efforts such as these may provide viable technological and methodological approaches to this greatest of problems, as well as the increasing dependence on verifiable multiple disciplinary data.

    The importance of review is especially obvious in contrasting the properties of "ecosystems" data with better-defined physical data historically handled by the World Data Centers. Remotely sensed data are less well defined, and thus more subject to interpretation, such as derived satellite products such as the Global Vegetation Index, produced from NOAA polar-orbiting satellite AVHRR data. For such derived data, precise definition is nearly impossible except on an application-specific, empirical basis. Many of the ancillary datasets included in this project go even one step farther, being subjective classifications based on individual scientists' work. Nevertheless, such derived, and even synthetic data are essential in providing the variables needed to characterize systems, even though their value can only be determined in the context of experimentation. Thus, unlike many other fields where an individual expert can effectively represent the scientific community, global change is too diverse and broadly defined to be served so easily. A reasonable approach, adopted by many multi-disciplinary projects, is to bring a sufficient number of experts together with a common framework that can facilitate consensus. The current effort attempts to do this through a common methodological framework and review process.

    TASK 6: Dataset Development

    Part of the review effort also involves contact with the original investigator(s), which can often result in improved documentation and/or new data. In addition, various in-house projects may be conducted to produce new or improved datasets from available sources. Some examples are monthly GVI data produced from weekly and bi-weekly data, integrated topography data, and new versions of boundary data.

    International efforts to enhance the database were initiated during the IGBP Global Change Database Pilot Project for Africa (known then as the "Diskette Project"), a joint effort by NGDC and the ICSU Panel on World Data Centres (WDC). Plans were developed to expand that effort through collaboration between the ICSU Panel on World Data Centres and the IGBP Data and Information System office (IGBP-DIS): NGDC participates in this effort through its affiliated World Data Center-A. Where there are suitable facilities and interest, the World Data Centres help implement the database and GIS capabilities along with data exchange agreements. This is currently taking place with institutions in China, South America, and Africa.

    The data integration effort became bi-directional in 1992. First, there is a need to respond to the on-going specific data requirements of the ERL-C characterization and modeling efforts to provide needed input data for research (Tasks 1 and 2). Second, the FY92 work will begin integrating some of the outputs of the ERL-C research in the form of derived or predicted numerical sets. Since these numerical sets are intimately connected with the models that produced them, documentation will need to be extended to include adequate representation of the models themselves (Task 4). This may be done in several ways. First, traditional written descriptions will be included in the documentation as provided by ERL-C. Second, NGDC will develop ways of including the operational models themselves in the distribution products, integrated, if possible, with the basic access and analysis software provided with the databases, with links to third-party software as needed.


    Task Description

    SCIENTIFIC SUPPORT AND METHODOLOGY

    TASK 7: Development of a Characterization Method

    The key to understanding the role of an adaptive, integrated database in environmental and ecological characterization is to distinguish between conditions which can be observed (description and analysis), which GIS methods are presently well suited for, and dynamic processes that one may infer (theory), which is largely the realm of research and modeling. This does not overlook the fact that theory is required for description, and vice-versa, and that the two overlap.

    Characterization applies existing theory, models, and data (including flows and rates) for the purpose of empirical analysis and description. It may include the development of indices and predictions, as well as descriptive models of critical processes that determine the time-dependent nature of system function and behavior. (Watson, 1978)

    The advantage of GIS is that it seems optimal for descriptive analysis (including quality assessment, exploratory analysis, statistical comparisons, error analysis, etc.) and static derivations or data synthesis (e.g. overlay operations, distance analysis, interpolation, and even complex derivations of predictive indices). It may also be well suited for statistical modeling, i.e. the search for underlying patterns and trends in the database, which may be used as a basis for prediction. Because it optimizes for descriptive analysis, however, the GIS approach may not be optimal for dynamic simulation and theoretical prediction, which typically employs a different philosophical approach, emphasizing mathematical and statistical formulation of theory rather than the analysis of data objects.

    The GIS-database approach can thus be designed to support characterization work. This approach does not attempt to represent the mathematical form of processes themselves, but rather the observed, derived, or predicted results of such models, i.e. model runs based on known states (in numerical and geographical form). Naturally, this approach allows for linkage to other mathematical or statistical modeling sub-systems, or to narrative descriptions of processes; but nevertheless remains a distinct activity of its own. Unlike computer systems that are designed to implement formula objects (i.e. simulation models), GIS typically deals with formula objects in a transitory and piece-wise manner, using known relations to calculate results in simplified steps. This approach is ideal for exploratory analysis and data development, since each step can be independently confirmed before proceeding. In this approach, the results of applied models and data derivations may be added to the database, as derived (but confirmed) "data."

    This philosophy of GIS methods also has equipment implications. Being more data-intensive than computation-intensive, the approach lends itself well to single-processor computers, even micro-computers, with sufficient disk storage. Improved performance requires optimization for I/O, rather than optimization for complex mathematical computation from limited data inputs. This implies that not only are the two methods distinct (in purpose and character), but that the best overall implementation may be to link separately optimized systems.

    Nevertheless, the characterization database must be driven by conceptual models of the system, whether or not it represents them in mathematical form. These models, derived from research, indicate critical processes and important phenomena, and thus determine what variables should be represented, and perhaps in what form (i.e. scale, time-step, tessellation, precision and accuracy, etc.).

    TASK 8: Scale Integration and Linkage

    Environmental and ecological phenomena are a function of scale (Rosswall, Woodmansee, and Riser, 1988) and accordingly scale is an important defining factor for this project. The scope of the project is largely defined by scale. The database defined by this project fills a gap in scale between site-specific studies and global simulation modeling. As such, it can serve as a link between these other two well-established activities, scaling up from site studies and scaling down from coarse global models. Furthermore, the range of scales defined by the project encompasses interesting phenomena in its own right, and may serve well as the scale for communication between studies and with various global change outreach efforts.

    Assuming that there are appropriate ranges of scales for representing given phenomena, the issue of scale integration becomes critical to global characterization. When is it appropriate to develop correspondences between data of different scale and when is it not? Since any study can define its "natural" scale boundaries,, this question reduces to one of determining the boundaries of natural scale groups, for a given field of study. It is useful and necessary to attempt scale conversions and correspondences between data within the scale boundaries of a given study (e.g. using cover class data at 1/2 degree to help analyze satellite data on 10-minute grids). Between such scale groups, however, it may not be so reasonable (e.g., there is often no direct correspondence between land cover classes at widely differing scales because they may be based on different properties).

    Instead, linkage of information between widely divergent scales must rely on linking the analytical results of complete studies, not data. Such linkage is accomplished by predicting phenomena at different scales, from analysis within a given scale range. For example, it may be difficult or impossible to predict distributions of non-dominant species based on GCM predictions on 2.5 to 5 degree grids (because they may respond more to the micro environment and biological interactions than to general conditions). However, it would be a reasonable approach to predict climatic changes on this scale, then modify the prediction with finer scale data, for example elevation and soils data, and then apply the prediction to species models at site scales. Similarly, it is a reasonable approach to extrapolate biogeochemical emissions from site scales to medium scales, using appropriate data on the geographic extent of similar sites. These may then be aggregated to produce the required coarse inputs to GCM models.

    For both scaling up and scaling down, the "medium" scale of this database becomes important. On the one hand, we need aggregation methods to characterize (or "parameterize") variables at scales required by models, from finer scale states and processes. On the other hand, we need to understand variability of coarse scale phenomena due to more local conditions, to apply the predictions of global models. The same issue exists in the time dimension. Establishing methods for characterization across scales is a prerequisite for linking modeling with observational data.

    TASK 9: Linking Characterization to Research and modeling

    Ultimately, the project hopes to aid research and modeling in global change and landscape ecology, perhaps best serving a role in environmental, ecological, or socio-ecological characterization supporting comparative ecosystems analysis (e.g., Cole, Lovett and Findlay, 1991) and global change modeling. An integrated database of published and easily compared datasets is perceived as a first step generally towards this goal. While the GED may serve as a useful data publication mechanism, and may aid evaluation and use of datasets, its extension to directly support characterization and modeling requires close collaboration with investigators and the establishment of specific data and data integration priorities. It also requires further development of appropriate links between ecological characterization and modeling at various scales.

    Ecological characterization differs from research in that it is focused on synthesis of existing data and process information. Dynamic modeling, on the other hand, is usually designed for research purposes, particularly to study processes or create simulations. Some other common uses of the term "modeling" include statistical modeling, which is the search for underlying patterns that accurately describe data, and GIS or "data" modeling, which is really data synthesis -- the production of a derived dataset from the static combination and analysis of others. Perhaps the clearest distinction between characterization and modeling is that dynamic modeling (which may include probability models) claims to be valid in space and time beyond the range of available data, whereas static forms of modeling are only tenuously extrapolated beyond or between the available data, with increasing statistical error.

    The goal of environmental, ecological, and socio-ecological characterization using GIS technology, is to apply accepted integration and synthesis methods to datasets and other information to provide empirically valid representations of a system. Though guided by modeling efforts, it can and perhaps should remain entirely complementary to them (i.e. as independent as possible to provide new information and valid tests).

    TASK 10: Extension to Education, Outreach, and Policy Support

    While the Global Ecosystems Project itself is focused on establishing and improving the scientific database, other activities of the NGDC/WDC Global Change Database Program are concerned with public dissemination. This includes cooperative projects between the World Data Center (housed at NGDC) and the IGBP, as begun with the Africa Pilot Project, database support and consulting to the International Space Year's Global Change Encyclopedia, data planning committees such as CODATA, and support to various educational outreach projects, including United Nations training programs and developers of educational materials.

    The current database is experimental and not well suited to teach global change phenomena without considerable analysis and interpretation. Nevertheless, education and outreach need not wait for full scientific development, if the education/outreach program is appropriately tuned to the existing level of development and knowledge.

    For example, it may be appropriate to teach and extend expertise in GIS and data management for global change using this database, if it is clear that the effort is targeted to appropriate experts and applied scientists. Similarly, at the level of technical development, the issues of statistical designs, quality assessment, comparative techniques, etc., are relevant for curricula. At the level of scientific support, it may be appropriate to explore concepts of characterization, scale, and modeling, extending results to other scientific groups, as well as developing corresponding education programs for global change phenomena, if the effort is appropriately combined with expert knowledge. It would not be appropriate, however, to use the data contained here as definitive information, without considerable study and interpretation.

    Although not a specific goal of the project, eventual application to policy represents a fourth level of development that relies on prior development of all three preceding levels (Database Integration and Design, Technical Development, and Scientific Support and Methodology). As such, it is clear that by establishing a credible scientific effort in global characterization, we are also establishing a methodical way to support policy and avoid the inappropriate transfer of raw data to information levels.


    Task Description

    REFERENCES

    Cole, J., G. Lovett, and S. Findlay (eds.). 1991. Comparative Analysis of Ecosystems: Patterns, Mechanisms, and Theories. New York: Springer-Verlag.

    Kineman, J.J. 1993. "What is a scientific database? Design considerations for Global Characterization in the NOAA-EPA Global Ecosystems Database Project." In: GIS and Modeling: Proceedings of the First International Workshop on Integrating GIS with Environmental Modeling, Boulder, CO., September, 1991. London: Oxford University Press. [In press]

    Rosswall, T. R.G. Woodmansee, and P.G. Riser (eds.). 1988. Scales and Global Change: Spatial and Temporal Variability in Biospheric and Geospheric Processes. SCOPE 35. New York: J. Wiley. 355p.

    Watson, J.F. 1978. "Ecological characterization of the coastal ecosystems of the United States and its territories." Proceedings: Energy/Environment '78. Los Angeles: Society of Petroleum Industry Biologists. pp. 47-53.

    (Also subsequent publications of the Coastal Ecosystems Project, Office of Biological Services, Fish and Wildlife Service, U.S. Dept. of the Interior, Washington, D.C. 20240.)