CIRSS Researchers at 2012 Research Data Access and Preservation (RDAP) Summit
CIRSS researchers presented two posters at the third annual annual ASIS&T Research Data Access and Preservation (RDAP) Summit (http://rdap12.posterous.com), held 22-23 March 2012 in New Orleans, LA. Topics explored at this year's summit included data management plans and policies; training of data management practitioners; discovery of research data; data curation service models; sustainability of data management; and data curation.
The two posters report on CIRSS activities on the Data Conservancy project (http://dataconservancy.org), funded by NSF and led by partners at Johns Hopkins University.
-------------
What Dataset Descriptions Actually Describe: Using the Systematic Assertion Model to Connect Theory and Practice
Karen Wickett, Andrea Thomer, Simone Sacchi, Karen S. Baker, David Dubin
Available at: http://hdl.handle.net/2142/30470
Scientific data is encoded and described with the aim of supporting retrieval, meaningful interpretation and reuse. Encoding standards for datasets like FGDC, DwC, EML typically include tagged metadata elements along with the encoded data, suggesting that, per the Dublin Core 1:1 principle, those elements apply to one and only one entity (a specimen, observation, dataset, etc.). However, in practice vocabularies are often used to describe different dimensions of scientific data collection and communication processes. Discriminating these aspects offers a more precise account of how symbols and the propositions they express acquire the status of “data” and “data content,” respectively.
In this poster we present an analysis of species occurrence records based on the Systematic Assertion Model (SAM) [DWS]. SAM is a framework for describing the encoding and representation of scientific data, bridging the gap between data preservation models and discipline-specific scientific ontologies. The model is intended to be general enough for any scientific domain, and not bound to any particular methodology or field of study. Since species occurrence records are a kind of data that is frequent re-used, migrated across systems and shared they are a good target for analysis.
Sample data is reviewed in the context of SAM, and analyzed with respect to the provenance events, entities, and relationships governing our definitions of data and data content. The exercise serves to:
1. highlight targets for data description (expression, content, assertion, justification).
2. inform the discovery of anomalous or missing contextual/background information.
3. frame a comparison of generic metadata standards (e.g. Dublin Core) with standards created specifically for scientific use (FGDC, DwC, EML).
4. clarify competing criteria for the identification of data that is tied to the scientific assertions carried by a dataset, and not specific to the details of a format or encoding.
-----------------
Integrating Conceptual and Empirical Studies of Data to Guide Curatorial Processes
Carole L. Palmer, Tiffany C. Chao, Nicholas M. Weber, Simone Sacchi, Karen M. Wickett, Allen H. Renear, Karen Baker, Andrea Thomer, & David Dubin
Two research teams within the Data Conservancy (http://dataconservancy.org/) project are investigating different aspects of scientific data curation. Data Concepts is developing a conceptual model to foster shared understanding of identity conditions and representation levels for data sets. Data Practices is conducting qualitative studies of data production and use in the earth and life sciences, analyzing curation needs, cultures of sharing, and re-use potential across disciplines. This poster will illustrate the integration of results from three phases of research to develop a more comprehensive and practical analysis of fundamental aspects of data curation.
• Phase 1, Data Concepts team - Preliminary framework for definitions of “dataset” based on review of technical documentation and scientific literature, to support curation and integration of data across disciplines. Found four common features across definitions--grouping, content, relatedness, and purpose, elaborating each based on evidence from the literature.
• Phase 2, Data Practices team - Conceptual mapping of data characteristics, data practices, and curation activities, consisting of approximately 145 terms. Emphasizes relationships between data practices and curatorial activities for application to description and assessment of curation services.
• Phase 3, Data Practices team - Analytic potential concept developed as a theoretical approach to assessing the value of data beyond its original intended use. Extends Hjørland’s (1997) notion of “epistemological potential”, acknowledging the essential condition of preservation readiness and two key interrelated factors, potential user communities and fit for purpose.
We will demonstrate how the Phase 1 framework has been tested and extended based on empirical data and analysis from Phases 2-3. In particular, we show how scientists’ practices and ideas about meaningful units of data adhere to and diverge from the framework’s conception of “grouping”. We also identify and discuss additional elements of “purpose” needed to inform the curatorial processes of selection and appraisal and set curation priorities for making data fit for long-term use.
References
Cragin, M.H., Palmer, C.L., & Chao, T.C. (2010). Relating data practices, types, and curation functions: An empirically derived framework. Poster presented at the Annual Meeting of the American Society for Information Science & Technology (ASIS&T), October 22-27, 2010, Pittsburgh, PA.
Hjørland, B. (1997). Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science. Westport, CT: Greenwood.
Palmer, C.L., Weber, N.M., & Cragin, M.H. (2011). The analytic potential of scientific data: Understanding re-use value. Proceedings of the Annual Meeting of the American Society for Information Science & Technology (ASIS&T), October 9-12, 2011, New Orleans, LA.
Renear, A., Sacchi, S., & Wickett, K. (2010). Definitions of dataset in the scientific and technical literature. Proceedings of the Annual Meeting of the American Society for Information Science & Technology (ASIS&T), October 22-27, 2010, Pittsburgh, PA.
