Stephen S. Murray
and
Robert J. Hanisch
December 1995
Smithsonian Astrophysical Observatory
Space Telescope Science Institute
The astronomical research community has assembled large and diverse data holdings - catalogs, databases, simulations, and digital data archives from both space- and ground-based telescopes - most of which are available on-line via network services. The goal of this workshop was to bring together representatives of major data providers, astronomical researchers, and technology experts to find a way to link these data resources so that astronomers can easily locate data of interest regardless of which of the many facilities actually holds that data. Participants in the workshop reviewed current data resources and technologies and defined the requirements for a common data search facility. An approach to providing this facility that imposes minimal impact on data providers, utilizes existing network access tools, and requires little or no addition to project budgets for implementation, was discussed and developed. A prototype implementation is planned for early 1996.
The Astrophysics Science Operations Management Operations Working Group (SOMOWG) recommended to the NASA Astrophysics Science Operations Branch that a workshop be organized in order to bring together data providers, representative users, and network information system specialists to see if there is a low impact (low cost) solution to providing a general data locator function for astronomical data. This workshop was organized by Dr. Stephen Murray (SAO) and Dr. Robert Hanisch (STScI) and held at SAO on 27-28 September 1995. Participants in the workshop are listed below.
|
Roger Brissenden | SAO/ASC |
| Cynthia Cheung | NASA/GSFC NSSDC |
| Richard Crutcher | NCSA/U Illinois |
| Daniel Durand | DAO/CADC |
| Jim Fullton | CNIDR |
| Robert Hanisch | STScI |
| George Helou | IPAC/NED (e-mail input) |
| Susan Kleinmann | U. Massachusetts |
| Tom McGlynn | NASA/GSFC HEASARC |
| Stephen Murray | SAO |
| Eric Olson | UC Berkeley CEA |
| Benoit Pirenne | ST-ECF |
| Karen Strom | Five College Observatory |
| Douglas Tody | NOAO |
| Gustaaf van Moorsel | NRAO |
| Michael Van Steenberg | NASA/GSFC NSSDC |
| Archibald Warnock | A/WWW Enterprises |
| Marc Wenger | CDS |
Participants worked to define the primary user requirements for a data locator service, to determine the technology tools needed and available for implementing such a service, and to understand the impact on existing data services (which were required to be minimal - the data locator service is not intended as a replacement for the user interfaces and data services already provided by each data center).
The diversity and volume of astronomical data available via electronic means has become so large that it is often impossible to keep track of what information exists about a particular object or region of the sky. With current technology (especially the World Wide Web) it should be possible to provide an easy way to determine the existence and location of data sets of interest via queries to a relatively simple network service.
Astronomers have a wealth of network-based research tools available to them, including mission- and facility-based data archive and catalog services (with great depth of information for particular projects) to very broad, loosely coordinated facilities such as the AstroWeb (a volunteer-based effort to maintain a list of WWW sites relevant to astronomy). However, it is not possible to ask a simple question such as ``where can I find optical or UV images of 3C273'' without doing extensive, manual searches on separate data services.
The purpose of the workshop was to come up with a plan for an enabling technology or standard to allow, but not mandate, interoperability of the various astronomy data holdings scattered around the world.
The point was to define something that each site operating a data server could implement to tie into this loose network. A distributed, heterogeneous approach with a low threshold of acceptance was preferred over a "big systems" approach mandating that things be done a certain way or emphasizing any form of centralization. In other words, the goal was to utilize technology like the World Wide Web, which is simple enough to allow any site to participate, and which is fully distributed.
A major component of the workshop was a series of presentations by the participants representing data providers on the services already in existence. All data providers have WWW services that allow at least a basic search capability, and many also provide more specialized interfaces that support complex queries and/or complex data structures. As WWW services improve it is likely to be possible to support even these more complicated user access functions via the WWW. These data services are summarized below, and a more detailed listing is given in the Appendix.
Perhaps the strongest element of commonality amongst the astronomical data providers is that the various user interfaces allow specification of a object position (often provided via either the SIMBAD or NED name resolution services) and object name. This in itself indicates that a unified search facility should not be difficult to achieve.
Archives of both ground- and space-based data sets are of substantial size (exceeding 1 TB and growing at rates of 1 TB/year or more) and have associated catalogs of varying completeness and complexity. Ground-based archive facilities are generally less complete or sophisticated, often acting more as back-up services than true archives, but there is clearly a desire to expand upon these services. Observation catalogs are now available from a number of ground-based observatories.
Data centers also have a large number of astronomical catalogs available and on-line searchable. The CDS, for example, provides over 800 catalogs. Other data providers are beginning to archive reduced data (e.g., the BIMA mm-array consortium now requires observers to provide their reduced, final images to the NCSA/UIUC Digital Image Library), and we expect that simulation data sets will also be available for on-line retrieval.
The problem remains, however, that a general search for data on a given object or given region of the sky requires the astronomer to manually search each data set or catalog.
A number of tools and resources have been developed in the past few years that enable one to integrate network information resources and build distributed systems. The World Wide Web, based on hypertext documents and HTTP servers, is the foremost example of this technology. However, the implementation of a distributed data search and retrieval system requires one to exchange requests for information that are well structured. The ANSI Z39.50 protocol provides a mechanism for the exchange of structured queries and responses. HTML/HTTP is probably adequate for prototyping this system and has the immediate benefit of widespread distribution and support.
A distributed astronomical software documentation service (ASDS) is now being developed by two of the workshop participants (Hanisch and Warnock) using software developed by CNIDR - ISITE. Aspects of this system may be appropriate for use in a distributed data searching service, though ASDS is really designed for text search and retrieval.
The key element for success is defining minimal acceptable standards to be used within these technological frameworks, and which support resolution of queries that establish the existence of data meeting the user's specifications.
>From the perspective of an astronomer using the system, a search for data on a given object or region of the sky has to have a least a modest set of qualification in order to avoid irrelevant information. (This is important from the perspective of data providers as well, who will be concerned about handling poorly qualified queries that result in hits against many many records in their catalogs and databases.) The minimum set of qualifications for a query is given in Table 2.
|
NAME | object name specification |
| POSITION | a,d,r or amin,amax, dmin,dmax |
| (positions should be available through | |
| name resolution services (NED, SIMBAD) | |
| DATA CLASS | pointed observation |
| catalog | |
| survey | |
| reference data | |
| simulation | |
| DATA TYPE | image |
| spectrum | |
| time series | |
| flux (photometric) measurement | |
| visibility data | |
| BANDPASS | gamma-ray |
| x-ray (high energy, low energy) | |
| UV (EUV, FUV, UV, NUV) | |
| optical | |
| IR | |
| mm/sub-mm | |
| radio | |
| TIME | date or range of dates |
| OBSERVATORY | mission, facility, or observatory |
Given a specification of the sort shown in Table 2 (e.g., via an HTML form), queries would be distributed to known data providers. Responses would be returned if a data provider holds data relevant to the query, in the form of a simple yes/no or preferably an indication of the amount of data (e.g., number of hits in the catalog or database) that pertain. The return response gives a hypertext link to the data provider that allows for further refinement of the search criteria and retrieval of the data.
It was not possible during a 1 1/2 day workshop to fully define all aspects of the system implementation. However, a few basic design principles were generally agreed upon, and details will be fleshed out in the coming weeks. There are three primary components needed in order to implement this type of service:
The fields or attributes required for the data holdings and data locator queries still need to be defined. At any given time these will be fixed and predefined, but new fields could be added in the future as the facility evolves. A given server may not support all of the possible fields, in which case they are not used to refine the query.
The implementation strategy described above places minimal requirements on data providers. An agreement needs to be reached on interfaces - data profiles and standard/minimal conformance for responses to queries - but there is no need for data providers to modify their existing data access facilities.
Data providers would have to either register their servers with a centralized coordinator, or broadcast the services they provide using an agreed upon syntax. This registry information would need to include
Data providers need to be protected against poorly qualified queries that can result in huge numbers of hits against their databases. A simple way to limit such queries is to limit the number of hits the server will provide (e.g., 50) in response to a single query. In principle, however, the vulnerability here is no different than what already exists given that users can access the data providers' catalogs and databases now, one at a a time.
While we envision participants in the workshop as being the first to (jointly?) implement a client user interface, the use of standard interfaces will allow any number of users and data providers to develop or customize the client applications. Browsers can be made a simple or as sophisticated as desired, or as resources permit.
The workshop participants agreed that a prototype system could be developed in a matter of a few months time, and that future conferences (ADASS meeting, October 1995; AAS meeting, January 1996) will provide a chance for ongoing discussion. It should be possible to get a prototype going which links a limited number of data providers by early 1996.
The prototype can probably be built using existing resources or with extremely modest augmentations to already funded projects. Participants felt it was important to continue the coordination and planning efforts as exemplified by this first workshop on an annual basis, with meetings planned in conjunction with events such as the ADASS Conference. A modest incremental budget might be needed to support these meetings. Overall, however, the cost of implementing a distributed data locator service should be trivial compared to the cost of acquiring the original data and supporting the mission- and facility-based archives and databases.
The table below gives a synopsis of the data services provided by the organizations represented at the workshop, and does not list all data services in astronomy.
|
Organization | Data Holdings | Volume |
| HEASARC | Various high-energy data sets (ROSAT, Einstein, etc.) | 400 GB + 2-4 GB/day for XTE operations |
| Astronomical catalogs, all wavelength regions | 144 catalogs | |
| ``skymap'' virtual telescope | ||
| NSSDC/NDADS | Various NASA mission data sets | ?? GB |
| Astronomical catalogs, all wavelength regions | ?? catalogs | |
| CEA | EUVE data | 100 GB reduced, 500 GB raw |
| NCSA/U Ill | BIMA data (mm-array radio telescope) | just beginning |
| Astronomy Digital Image Library | just beginning | |
| DAO/CADC | HST data (partial copy) | 500 GB ?? |
| CFHT archive | ?? GB | |
| ST-ECF | HST data (full copy) | 1800 GB |
| ESO archives | ?? GB | |
| Infrared Space Observatory archive | not yet launched, 2000 GB when complete | |
| STScI | HST data | 1800 GB + 750 GB/yr |
| VLA FIRST Survey, radio data | 10 GB | |
| Digitized Sky Survey | 60 GB | |
| NRAO | VLA data (visibilities) | 1500 GB + 200 GB/yr |
| NOAO | KPNO data | 900 GB + 500 GB/yr |
| SAO/ASC | Einstein mission | ?? GB |
| AXAF mission | not yet launched, 2000 GB when complete | |
| CDS | SIMBAD database (1,000,000 objects), 800 catalogs |