Established priorities may range from acquisition of known datasets that require little processing, to recommendations on derivation methods or specific algorithms to produce the desired data from existing sources. The needs assessment also distinguishes between data requirements that can be generalized for wide application, versus unique requirements of the supporting agencies and their immediate research targets.
(a) Data Management:
Data are acquired through various channels and made available in integrated form online as well as on CD-ROM. Data acquisition efforts seek source datasets that are as close to original investigations as possible, or are especially appropriate for specific requirements. Data integration efforts seek to represent these datasets in as close to their source condition as possible, given the requirements of integration. Since error estimates are typically lacking for most data, it is especially important to preserve full representation of quality information within or accompanying each dataset. Similarly, independently derived, multiple datasets are included to increase intercomparison opportunities between similar data within the database, for quality and error estimation.
As data are ingested into the NGDC system, all versions of the datasets are archived, including the original source and various modifications produced for the purpose of integration. Documentation is archived with the data, and is compiled into the database User's Guide.
The existing database is augmented by new datasets acquired from external sources or derived at NGDC, with publication clearances for both data and documentation. All portions of the database, with the exception of re-printed journal articles and custom software, are placed into the public domain. Some individual source datasets are not distributable by NGDC, due to restrictions placed on them by the investigators or source institutions; or more often, in deference to other distribution agreements. Derivations from such source datasets, however, as part of an integrated and operational database, are developed and distributed as different products.
(b) Data Processing:
Once data are in the system, they are processed for data management purposes, to inspect quality and content, to verify documentation, and to determine the optimal data structure for integration into the main database. Depending on the nature of the dataset, integration may involve anything from simple media transfers and format conversions, to re-structuring operations such as geographic registration, grid re-sampling, gridding from point data, re-projecting, vectorizing, rasterizing, tabularizing, and/or other forms of interpolation. This work incorporates variously compatible datasets into a common data structure based on geographical objects and community-wide GIS conventions.
The most common changes made to datasets are conversions between numerical types (i.e., real, integer, ASCII, binary, etc.). Actual re-classifications are avoided unless absolutely necessary due to extremely unusual or cumbersome classification schemes in the source data, in which case every effort is made to involve the principal investigator(s).
(c) Data Derivation:
A number of datasets represent "derivations" from raw observational data, such as NDVI derivations from the Advanced Very High Resolution Radiometer (AVHRR) satellite sensor data, or monthly composites using a variety of methods. "Derivation" in this context is defined as a modification of raw data from a single thematic source, using accepted methods of calibration and/or correction. Although such work will defer, where possible, to other established sources (that contribute such derivations into the public domain), some specific derivations may be produced at NGDC. Priorities for derivation are jointly established by the supporting programs, according to needs and resources.
(d) Data Synthesis:
Simulation models and other applications may require spatial variables that are missing from the observational database or cannot be derived directly from it. In such cases, it may be possible to substitute for the missing variable through a complex derivation using established relationships and transformations between multiple datasets. This is a form of characterization that can be performed at NGDC and elsewhere using GIS methods, given suitable guidance on transformation equations and derivation procedures.
Descriptive analysis functions that are required for characterization are listed in the table below (Kineman, 1993):
| FUNCTION | PURPOSE |
| Data Integration | Multi-thematic representation of variables, within a common analytical structure, with metadata for verification and error analysis. |
| Exploratory Data Analysis | Visualization; exploration; hypothesis formation |
| Inter-comparison and statistical analysis | Quality and error assessment; empirical testing (hypotheses, models, and characterizations); statistical modeling |
| Re-sampling and error analysis | Experimental Design / representation for different applications |
| Environmental Characterization & Data Synthesis | Multi-disciplinary representation of phenomena; monitoring and assessment; description of patterns and trends; model inputs and tests |
These functions relate well to GIS concepts, which employ a number of conventions, such as geographical (vector and raster) object definitions, common geographical referencing and topological structures, labeling and legend conventions, and others. Future research and development in GIS, if properly informed by the global change community, can provide systems that are better suited for environmental characterization.
Linkage with GIS is important in forming the required structural and functional aspects of an integrated database, and in performing many of the processing and quality control operations. There is also a need for a common environment to support peer-review and communication with a wide global change community of data users and developers.
Advances in data structures and metadata are strongly influenced by developments in GIS functionality. The reverse has also been true, with system enhancements resulting from data integration requirements and the effort to track more kinds of data and metadata to meet demands of global change research. As clear as this relationship is, it is still not easy to maintain such links for generic purposes or to disseminate general system requirements to GIS developers. By maintaining strong links to GIS development, these relationships can become better defined.
As data are processed into a common structure that is intended to be compatible with emerging GIS conventions, we must also deal with changing conventions and improvements in GIS structures. This is accomplished by linking the database with more-or-less "generic" GIS's that are currently available and offer the possibility for collaborative improvement. The use of IDRISI (from Clark University) began during the pilot and prototype phases and has continued because of the value of having a "common denominator" to facilitate integration, review, data exchange, and technical communication. UNIX, though common among global change researchers, does not provide the widest commonality among reviewers, data developers, and internationally distributed scientists and data sources; whereas the IBM-DOS compatible environment has made significant inroads in most technical facilities. Furthermore, maintaining operability at the lowest operable system level has introduced little if any sacrifice of quality or operability in more capable systems (other than the problem of format conversion).
Collaboration with Clark University (IDRISI) has been fruitful in showing how structure and function can co-evolve. Based on this interaction, IDRISI has been significantly modified to support global change studies, and in turn, the database structure has evolved through use within the GIS environment. Using this example, similar relationships may be developed with other GIS developers.
The GED project goes beyond the issue of providing a common format for data exchange between systems and users, which is already the focus of several major efforts within Government and other institutions (formats for the integrated database are relatively generic and easily converted to other standards, which generally require less detail about the data). An integrated database is a more in-depth approach, that correspondingly sacrifices breadth in its coverage of overall data availability. Instead, focus is on the intercomparability, inter-operability, and verifiability of datasets, using a selected portion of the overall data pool that must be prioritized by specific research.
The structure and function of a characterization database is strongly related to Geographic Information Systems (GIS) methodology, to such an extent that GIS integration and improvements in GIS methods are of primary concern, along with the provision of useful data. For this reason, agreements may be formed with GIS researchers, and experimentation by GIS developers is encouraged to promote the inter-operability of the database within the established GIS community and among global change researchers using GIS methods.
This will involve experimentation with GIS structures using at least one existing GIS as a development platform, and working closely with the software designers to incorporate improvements in:
(a) data structures and formats (raster, vector, tabular, data types, compression, etc.)
(b) raster/vector integration,
(c) integration of tabular data structures,
(d) export to other systems and formats,
(e) data exploration and statistical capabilities,
(f) scale integration and interpolation functions (including re-projection).
(g) lineage tracking
(h) techniques for error representation
(i) error estimation and tracking
Documentation therefore involves more than providing information about formats and data structures. Verifiability means the ability to assess the entire production method, statistical nature, and accuracy of the dataset, given its original purpose and present use. Ideally, the overall effort will provide the means for users to "re-design" or synthesize data for specific purposes with known confidence limits. This requires considerable knowledge about the data, i.e., "metadata" and documentation. In the GED project, such information is obtained from existing documentation and published articles, returns from the external review process, and results of internal quality assessment efforts. Documentation is produced and added to the database User's Guide. The User's Guide contains all available information needed to understand the design and nature of the data, from a technical/statistical (and mathematical, in the case of models) perspective.
Internal Quality Control procedures (which extend to all tasks) are implemented to ensure error-free processing and full and accurate reproduction of datasets in their complete form, as close to their originally design as possible, even though structures and formats may be changed to achieve full integration with the database. The need for retrospective quality assessment of many datasets themselves (which typically have inadequate documentation or metadata), has led to an emphasis on "Exploratory Data Analysis" methods, some of which have been experimentally incorporated into GIS software. In some cases, this has also spawned research projects related to specific datasets, for example on the quality and use of AVHRR and Vegetation Index data, and other examples in topography, vegetation classifications, climate data, etc. Some of this research also results in new methods for data integration, such as gridding or re-gridding techniques, error representation, etc. Methods in quality assessment are also being improved cooperatively with independent software developers.
TASK 5: Distribution and Peer Review
Three levels of distribution and review are described in the main body of the User's Guide, along with a diagram of the annual cycle. Pre-release review by 3-5 reviewers selected specifically for each dataset ensure adequate quality and appropriate representation of new datasets. A general review of the overall product (which began with a targeted review of the prototype), is maintained through feedback from users. A number of issues are addressed, including:
1. Technical quality and completeness of data and documentation
2. Database design
3. GIS functionality
4. Scientific content
5. Potential applications
6. Recommended improvements
7. Overall methods
The review concept, however, goes farther. As "reviewers" use the database (and GIS methods) in their research, they also participate in an experiment in scientific networking which may have broader implications for future collaboration. Each scientific discipline must achieve a certain "critical mass" in establishing useful methodological conventions that allow scientific exchange to proceed smoothly and quickly between colleagues. Study of Global Change, as a field that crosses traditional disciplinary and institutional boundaries, requires an unprecedented level of cooperation and exchange to address the key scientific questions. Efforts such as these may provide viable technological and methodological approaches to this greatest of problems, as well as the increasing dependence on verifiable multiple disciplinary data.
The importance of review is especially obvious in contrasting the properties of "ecosystems" data with better-defined physical data historically handled by the World Data Centers. Remotely sensed data are less well defined, and thus more subject to interpretation, such as derived satellite products such as the Global Vegetation Index, produced from NOAA polar-orbiting satellite AVHRR data. For such derived data, precise definition is nearly impossible except on an application-specific, empirical basis. Many of the ancillary datasets included in this project go even one step farther, being subjective classifications based on individual scientists' work. Nevertheless, such derived, and even synthetic data are essential in providing the variables needed to characterize systems, even though their value can only be determined in the context of experimentation. Thus, unlike many other fields where an individual expert can effectively represent the scientific community, global change is too diverse and broadly defined to be served so easily. A reasonable approach, adopted by many multi-disciplinary projects, is to bring a sufficient number of experts together with a common framework that can facilitate consensus. The current effort attempts to do this through a common methodological framework and review process.
TASK 6: Dataset Development
Part of the review effort also involves contact with the original investigator(s), which can often result in improved documentation and/or new data. In addition, various in-house projects may be conducted to produce new or improved datasets from available sources. Some examples are monthly GVI data produced from weekly and bi-weekly data, integrated topography data, and new versions of boundary data.
International efforts to enhance the database were initiated during the IGBP Global Change Database Pilot Project for Africa (known then as the "Diskette Project"), a joint effort by NGDC and the ICSU Panel on World Data Centres (WDC). Plans were developed to expand that effort through collaboration between the ICSU Panel on World Data Centres and the IGBP Data and Information System office (IGBP-DIS): NGDC participates in this effort through its affiliated World Data Center-A. Where there are suitable facilities and interest, the World Data Centres help implement the database and GIS capabilities along with data exchange agreements. This is currently taking place with institutions in China, South America, and Africa.
The data integration effort became bi-directional in 1992. First, there is a need to respond to the on-going specific data requirements of the ERL-C characterization and modeling efforts to provide needed input data for research (Tasks 1 and 2). Second, the FY92 work will begin integrating some of the outputs of the ERL-C research in the form of derived or predicted numerical sets. Since these numerical sets are intimately connected with the models that produced them, documentation will need to be extended to include adequate representation of the models themselves (Task 4). This may be done in several ways. First, traditional written descriptions will be included in the documentation as provided by ERL-C. Second, NGDC will develop ways of including the operational models themselves in the distribution products, integrated, if possible, with the basic access and analysis software provided with the databases, with links to third-party software as needed.
Characterization applies existing theory, models, and data (including flows and rates) for the purpose of empirical analysis and description. It may include the development of indices and predictions, as well as descriptive models of critical processes that determine the time-dependent nature of system function and behavior. (Watson, 1978)
The advantage of GIS is that it seems optimal for descriptive analysis (including quality assessment, exploratory analysis, statistical comparisons, error analysis, etc.) and static derivations or data synthesis (e.g. overlay operations, distance analysis, interpolation, and even complex derivations of predictive indices). It may also be well suited for statistical modeling, i.e. the search for underlying patterns and trends in the database, which may be used as a basis for prediction. Because it optimizes for descriptive analysis, however, the GIS approach may not be optimal for dynamic simulation and theoretical prediction, which typically employs a different philosophical approach, emphasizing mathematical and statistical formulation of theory rather than the analysis of data objects.
The GIS-database approach can thus be designed to support characterization work. This approach does not attempt to represent the mathematical form of processes themselves, but rather the observed, derived, or predicted results of such models, i.e. model runs based on known states (in numerical and geographical form). Naturally, this approach allows for linkage to other mathematical or statistical modeling sub-systems, or to narrative descriptions of processes; but nevertheless remains a distinct activity of its own. Unlike computer systems that are designed to implement formula objects (i.e. simulation models), GIS typically deals with formula objects in a transitory and piece-wise manner, using known relations to calculate results in simplified steps. This approach is ideal for exploratory analysis and data development, since each step can be independently confirmed before proceeding. In this approach, the results of applied models and data derivations may be added to the database, as derived (but confirmed) "data."
This philosophy of GIS methods also has equipment implications. Being more data-intensive than computation-intensive, the approach lends itself well to single-processor computers, even micro-computers, with sufficient disk storage. Improved performance requires optimization for I/O, rather than optimization for complex mathematical computation from limited data inputs. This implies that not only are the two methods distinct (in purpose and character), but that the best overall implementation may be to link separately optimized systems.
Nevertheless, the characterization database must be driven by conceptual models of the system, whether or not it represents them in mathematical form. These models, derived from research, indicate critical processes and important phenomena, and thus determine what variables should be represented, and perhaps in what form (i.e. scale, time-step, tessellation, precision and accuracy, etc.).
Assuming that there are appropriate ranges of scales for representing given phenomena, the issue of scale integration becomes critical to global characterization. When is it appropriate to develop correspondences between data of different scale and when is it not? Since any study can define its "natural" scale boundaries,, this question reduces to one of determining the boundaries of natural scale groups, for a given field of study. It is useful and necessary to attempt scale conversions and correspondences between data within the scale boundaries of a given study (e.g. using cover class data at 1/2 degree to help analyze satellite data on 10-minute grids). Between such scale groups, however, it may not be so reasonable (e.g., there is often no direct correspondence between land cover classes at widely differing scales because they may be based on different properties).
Instead, linkage of information between widely divergent scales must rely on linking the analytical results of complete studies, not data. Such linkage is accomplished by predicting phenomena at different scales, from analysis within a given scale range. For example, it may be difficult or impossible to predict distributions of non-dominant species based on GCM predictions on 2.5 to 5 degree grids (because they may respond more to the micro environment and biological interactions than to general conditions). However, it would be a reasonable approach to predict climatic changes on this scale, then modify the prediction with finer scale data, for example elevation and soils data, and then apply the prediction to species models at site scales. Similarly, it is a reasonable approach to extrapolate biogeochemical emissions from site scales to medium scales, using appropriate data on the geographic extent of similar sites. These may then be aggregated to produce the required coarse inputs to GCM models.
For both scaling up and scaling down, the "medium" scale of this database becomes important. On the one hand, we need aggregation methods to characterize (or "parameterize") variables at scales required by models, from finer scale states and processes. On the other hand, we need to understand variability of coarse scale phenomena due to more local conditions, to apply the predictions of global models. The same issue exists in the time dimension. Establishing methods for characterization across scales is a prerequisite for linking modeling with observational data.
Ecological characterization differs from research in that it is focused on synthesis of existing data and process information. Dynamic modeling, on the other hand, is usually designed for research purposes, particularly to study processes or create simulations. Some other common uses of the term "modeling" include statistical modeling, which is the search for underlying patterns that accurately describe data, and GIS or "data" modeling, which is really data synthesis -- the production of a derived dataset from the static combination and analysis of others. Perhaps the clearest distinction between characterization and modeling is that dynamic modeling (which may include probability models) claims to be valid in space and time beyond the range of available data, whereas static forms of modeling are only tenuously extrapolated beyond or between the available data, with increasing statistical error.
The goal of environmental, ecological, and socio-ecological characterization using GIS technology, is to apply accepted integration and synthesis methods to datasets and other information to provide empirically valid representations of a system. Though guided by modeling efforts, it can and perhaps should remain entirely complementary to them (i.e. as independent as possible to provide new information and valid tests).
The current database is experimental and not well suited to teach global change phenomena without considerable analysis and interpretation. Nevertheless, education and outreach need not wait for full scientific development, if the education/outreach program is appropriately tuned to the existing level of development and knowledge.
For example, it may be appropriate to teach and extend expertise in GIS and data management for global change using this database, if it is clear that the effort is targeted to appropriate experts and applied scientists. Similarly, at the level of technical development, the issues of statistical designs, quality assessment, comparative techniques, etc., are relevant for curricula. At the level of scientific support, it may be appropriate to explore concepts of characterization, scale, and modeling, extending results to other scientific groups, as well as developing corresponding education programs for global change phenomena, if the effort is appropriately combined with expert knowledge. It would not be appropriate, however, to use the data contained here as definitive information, without considerable study and interpretation.
Although not a specific goal of the project, eventual application to policy represents a fourth level of development that relies on prior development of all three preceding levels (Database Integration and Design, Technical Development, and Scientific Support and Methodology). As such, it is clear that by establishing a credible scientific effort in global characterization, we are also establishing a methodical way to support policy and avoid the inappropriate transfer of raw data to information levels.
Kineman, J.J. 1993. "What is a scientific database? Design considerations for Global Characterization in the NOAA-EPA Global Ecosystems Database Project." In: GIS and Modeling: Proceedings of the First International Workshop on Integrating GIS with Environmental Modeling, Boulder, CO., September, 1991. London: Oxford University Press. [In press]
Rosswall, T. R.G. Woodmansee, and P.G. Riser (eds.). 1988. Scales and Global Change: Spatial and Temporal Variability in Biospheric and Geospheric Processes. SCOPE 35. New York: J. Wiley. 355p.
Watson, J.F. 1978. "Ecological characterization of the coastal ecosystems of the United States and its territories." Proceedings: Energy/Environment '78. Los Angeles: Society of Petroleum Industry Biologists. pp. 47-53.
(Also subsequent publications of the Coastal Ecosystems Project, Office of Biological Services, Fish and Wildlife Service, U.S. Dept. of the Interior, Washington, D.C. 20240.)