GSLIS Interim Dean and affiliated CIRSS faculty member Allen Renear, will deliver the opening keynote at the National Information Standards Organization (NISO) Forum: Tracking it Back to the Source: Managing and Citing Research Data.
Taking place September 24, 2012 in Denver, Colorado, the NISO forum will address challenges posed by the exponential rise of data creation across nearly all scholarly disciplines by focusing on several new initiatives to improve community practice on data citation and data discovery.
More information is available here:
Catherine Blake will present at the Center for the Analysis of Cellular Mechanisms and Systems Biology, Montana State University, Bozeman Montana.
Her presentation, Claim Jumping: Bridging disciplinary boundaries using the Claim Framework, will be a part of the three day workshop, Making Sense of Biological Systems: Using Knowledge Mining to Improve and Validate Models of Living Systems from August 23-25.
More information is available: http://www.chemistry.montana.edu/cobre/workshop/Program.html
CIRSS is one of three collaborators on a recently announced NEH award, led by the Maryland Institute for Technology in the Humanties (MITH), to develop a series of data curation workshops for humanities scholars, librarians and archivists. The NEH award announcement and further details on the project are reproduced below from MITH PI Trevor Munoz's 26 July 2012 blog post:
MITH is pleased to announce an award from the National Endowment for the Humanities 2012 Institutes for Advanced Topics in the Digital Humanities competition for a series of workshops on data curation for humanities scholars, librarians, and archivists interested in sustaining meaningful access to humanities research materials.
The Digital Humanities Data Curation Institutes project, directed by Trevor Munoz, Associate Director of MITH and Assistant Dean for Digital Humanities Research, University Libraries, will facilitate a multi-institutional collaboration between MITH and the University Libraries at the University of Maryland, the Women Writers Project (WWP) at Brown University, and the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science (GSLIS), at the University of Illinois, Urbana-Champaign to provide three workshops during 2013.
The practice of cutting-edge humanities research increasingly involves acquisition, synthesis, and management of data in digital form. The theoretical knowledge and practical skills of information science, librarianship, and archival science represent a vital component of the skill set that will be required to succeed in the rapidly transforming landscape of the academy and the wider society.
Digital Humanities Data Curation institutes will serve as opportunities for participants with all levels of expertise from beginners to the most advancedto receive guidance in understanding the role of data curation in enriching humanities research projects. By the conclusion of each institute, participants will be adept at formulating solutions for existing challenges and will be able to document their data curation strategies in the form of data curation plans and strategic risk assessments, key elements of innovative digital scholarship.
A core resource for the Institute will be the Digital Humanities Curation Guide (DH Curation: http://guide.dhcuration.org) developed at GSLIS. The Guide allows instructors and participants to share scholarly knowledge about literature, tools, projects, and standards relevant to curating humanities data. A forum through which knowledge developed at the institute can be shared with the broader research community, the Guide will allow for the aggregation of resources and responses from across the Institutes three events. Julia Flanders (WWP) and Dorothea Salo (Faculty Associate in the School of Library and Information Studies at the University of Wisconsin at Madison) will serve as co-instructors alongside Muoz for the three institute events and will contribute resources to the Guide.
Applications to join this cohort of scholars focused on discipline-specific curation practices and skills will be announced in late fall 2012 with the Institute beginning in Spring 2013. For more information, please visit: http://mith.umd.edu/research/project/data-curation/.
[Reproduced from MITH PI Trevor Munoz's 26 July 2012 blog post.]
CIRSS RA Ashley Clark will be presenting her paper entitled "Meta: Exploring the Provenance of XSL Transformations" at Balisage in early August. The paper is the product of her work with the Data Curation Education Program in the Humanities. The following is the article abstract."When documents are transformed with XSLT, what methods can be used to understand and record those transformations? Though they aren't specifically meant for provenance capture, existing tools and informal practices can be used to manually piece together the provenance of XSLTs. However, a meta-stylesheet approach has the potential to generate provenance information by creating a copy of XSLT stylesheets with provenance-specific instructions. This method is currently being implemented, using the strategies and workflows detailed here. Even with the complications and limitations of the method, XSLT itself enables a surprising amount of provenance capture."
The DH Curation Guide will be officially launched during the Digital Humanities 2012 conference, held July 16-22 at the University of Hamburg in Germany. A product of the Data Curation Education Program for the Humanities (DCEP-H), funded by IMLS, the DH Curation Guide is an online educational resource for students and professionals that offers expert-written articles about digital humanities data curation concepts. (http://guide.dhcuration.org/) Trevor Muñoz (UMD) and Robin Davis, both alums of CIRSS, and Julia Flanders (Brown) are the managing co-editors of the project, which features six articles written by contributing editors who are experts in their fields, with six more articles coming soon. The web site, currently available as a beta release, has seen a positive response from the data curation and digital humanities communities, with over 500 page views since May 8 and supportive feedback through Twitter (@dhcuration).
CIRSS Associate Directory Cathy Blake has been awarded a Laura Bush 21st Century Librarian Program grant from The Institute of Museum and Library Services totaling $498,777 to create a specialization in Sociotechnical Data Analytics (SODA) within both the master’s and doctoral degrees.
"One of the exciting aspects about the SODA education program is the dual emphasis on social and technical aspects of data analytics," Blake, who will serve as principal investigator on the project. In addition to the mathematical modeling that typifies data analytics, students who graduate from the GSLIS SODA program will also understand the social, ethical, and policy aspects of big data. "That combination will make students uniquely prepared to fill the growing workforce gap in people who can effectively manage and analyze big data—a gap that, according to The McKinsey Global Institute report on Big Data, will culminate in a shortfall of 1.5 million data-savvy managers and analysts by 2018," she said.
The SODA research group was formed in 2010. The group, which includes faculty members Jana Diesner, J. Stephen Downie, Miles Efron, Brant Houston, Jerome McDonough, Vetle Torvik, and Michael Twidale, is part of the Center for Informatics Research in Science and Scholarship (CIRSS), where Blake serves as associate director. SODA research explores how to best design, develop, and evaluate new technologies in order to better understand the dynamic interplay between information, people, and technology. Group members conduct research in information retrieval, data and text mining, knowledge discovery, social computing, collaboration, and most recently network analysis.
"I am thrilled that we will now be able to formalize an educational program that mirrors the outstanding faculty research," Blake said. "In addition to our faculty, we have some great partners who will enable us to better integrate real-world data sets into the classroom, and augment the classroom experience with a hands-on practicum and projects where students work side-by-side with scientists and business analysts."
The new program will complement an existing Specialization in Data Curation led by Carole Palmer, GSLIS professor and director of CIRSS, as well as the Certificate of Advanced Study in Digital Libraries led by Jerome McDonough. "SODA is just one more piece in the evolving constellation of programs that give the next generation of information professionals the expertise they need to thrive in the information age," said Blake.
"Centuries of Knowledge: Graduate School of Library and Information Science Data Curation Education Program" (DCEP), one of our most successful and long-running grant projects, concluded this December, and the final report is now available through IDEALS: http://hdl.handle.net/2142/30845.
DCEP was funded by the Institute for Museum and Library Services (RE-05-05-0036) and was designed to increase educational and research capacity in data curation at GSLIS. In the first year of the project we developed the Data Curation Education Program, a specialization within our Master of Science degree program. As of May 2012, 54 students have completed the specialization, and our alumni have gone on to secure positions in a variety of institutions, including research centers, academic libraries, government agencies, and corporate industry.
In developing the curriculum for the specialization, we also created two new courses: Foundations of Data Curation, a survey course on the emerging field, and Digital Preservation. Along with Systems Analysis and Management, they constitute the core required courses in the specialization, and they are consistently enrolled at capacity each semester.
In 2008, we conducted the first annual Summer Institute on Data Curation for practicing information professionals, facilitating the development of a community of practice across U.S. and Canadian academic and research organizations. Our outreach and service activities have led to a range of new partnerships that have resulted in student fieldwork opportunities and new collaborative research and education activities at CIRSS, resulting in four successful grant proposals.
Ana Lucic will be defending her CAS project entitled "Characterizing Authorship Style: Contrasting Linguistic and Statistical Strategies."
Abstract: Of the five categories of features used in authorship attribution studieslexical, character, syntactic, semantic, application-specific (Stamatatos, 2009, p. 540)it is the lexical and character features that have been tested most extensively. Syntactic, semantic, and application-specific features, also called high-level features, have been tested with far less consistency. The process of extraction of high-level features is more complex, usually depends on the availability of a parser, and generally is not as reliable as the extraction of surface-level features. High-level features, however, provide a more complex view of the text and thus have the potential to be a stronger marker of difference between different authorial styles than surface-level features. In this study, we explore the potential of syntactic dependencies which hold between two words in a sentence to separate writing styles in an authorship attribution study which uses a collection of movie reviews downloaded from the film database service imdb.com (Seroussi, 2010). Rather than focusing on syntactic dependencies on the level of the entire text of the review, we focus on personal names and on syntactic dependencies which occur immediately before and immediately after personal names. The references to personal names and the grammatical structures that govern these references thus become the key feature for this analysis. The exploratory principal component analysis conducted on this feature revealed its high variability, which speaks to the potential of diverse ways reviewers refer to people to be a strong marker of difference between authorial styles.
The final manuscript is available from the front office.
Date: Thursday, May 10
Location: Room 242
Project Advisor: Dr. Catherine Blake
Committee Members: Dr. David Dubin, Dr. John Unsworth
GSLIS Professor and CIRSS Director Carole Palmer recently shared her thoughts in the University of Illinois feature, "A Minute With . . .," following the Obama administration's announcement of a $200 million research initiative in "big data" computing:
- Informatics is about methods and strategies for using information in organizations, networks, cultures, and societies. Our job is to make advances that help people get access to and work with information to solve problems and make new discoveries.
- The definition of data curation that we promote is the active and ongoing management of data through its life cycle of interest and usefulness to scholarship, science, and education.
- Data are very valuable assets—the raw materials of research—with tremendous potential for re-use in new and innovative ways. But digital data are high risk—extremely fragile and with few standards of good practice.
- We study how to collect and add value to data, to promote sharing and integration across institutions and fields of research, looking at both technical and social problems in making data a collective, shared resource.
- The Data Conservancy (http://dataconservancy.org) is a large multi-institutional collaboration led by Johns Hopkins University. We are partners, contributing to research and education through our data curation initiatives at CIRSS.
GSLIS has been at the forefront of data curation education since launching its specialization within the Master of Science degree in 2006, beginning with a focus on the sciences and expanding to include the humanities in 2008. Currently, more than 50 students enroll each year in the Foundations of Data Curation course, with many completing the GSLIS Specialization in Data Curation (http://www.lis.illinois.edu/academics/programs/ms/data_curation).
The interview with Professor Palmer was conducted by Dusty Rhodes, news editor for the U of I News Bureau. Read the full interview at http://illinois.edu/lb/article/72/62055.
CIRSS researchers presented two posters at the third annual annual ASIS&T Research Data Access and Preservation (RDAP) Summit (http://rdap12.posterous.com), held 22-23 March 2012 in New Orleans, LA. Topics explored at this year's summit included data management plans and policies; training of data management practitioners; discovery of research data; data curation service models; sustainability of data management; and data curation.
The two posters report on CIRSS activities on the Data Conservancy project (http://dataconservancy.org), funded by NSF and led by partners at Johns Hopkins University.
What Dataset Descriptions Actually Describe: Using the Systematic Assertion Model to Connect Theory and Practice
Karen Wickett, Andrea Thomer, Simone Sacchi, Karen S. Baker, David Dubin
Available at: http://hdl.handle.net/2142/30470
Scientific data is encoded and described with the aim of supporting retrieval, meaningful interpretation and reuse. Encoding standards for datasets like FGDC, DwC, EML typically include tagged metadata elements along with the encoded data, suggesting that, per the Dublin Core 1:1 principle, those elements apply to one and only one entity (a specimen, observation, dataset, etc.). However, in practice vocabularies are often used to describe different dimensions of scientific data collection and communication processes. Discriminating these aspects offers a more precise account of how symbols and the propositions they express acquire the status of “data” and “data content,” respectively.
In this poster we present an analysis of species occurrence records based on the Systematic Assertion Model (SAM) [DWS]. SAM is a framework for describing the encoding and representation of scientific data, bridging the gap between data preservation models and discipline-specific scientific ontologies. The model is intended to be general enough for any scientific domain, and not bound to any particular methodology or field of study. Since species occurrence records are a kind of data that is frequent re-used, migrated across systems and shared they are a good target for analysis.
Sample data is reviewed in the context of SAM, and analyzed with respect to the provenance events, entities, and relationships governing our definitions of data and data content. The exercise serves to:
1. highlight targets for data description (expression, content, assertion, justification).
2. inform the discovery of anomalous or missing contextual/background information.
3. frame a comparison of generic metadata standards (e.g. Dublin Core) with standards created specifically for scientific use (FGDC, DwC, EML).
4. clarify competing criteria for the identification of data that is tied to the scientific assertions carried by a dataset, and not specific to the details of a format or encoding.
Integrating Conceptual and Empirical Studies of Data to Guide Curatorial Processes
Carole L. Palmer, Tiffany C. Chao, Nicholas M. Weber, Simone Sacchi, Karen M. Wickett, Allen H. Renear, Karen Baker, Andrea Thomer, & David Dubin
Two research teams within the Data Conservancy (http://dataconservancy.org/) project are investigating different aspects of scientific data curation. Data Concepts is developing a conceptual model to foster shared understanding of identity conditions and representation levels for data sets. Data Practices is conducting qualitative studies of data production and use in the earth and life sciences, analyzing curation needs, cultures of sharing, and re-use potential across disciplines. This poster will illustrate the integration of results from three phases of research to develop a more comprehensive and practical analysis of fundamental aspects of data curation.
• Phase 1, Data Concepts team - Preliminary framework for definitions of “dataset” based on review of technical documentation and scientific literature, to support curation and integration of data across disciplines. Found four common features across definitions--grouping, content, relatedness, and purpose, elaborating each based on evidence from the literature.
• Phase 2, Data Practices team - Conceptual mapping of data characteristics, data practices, and curation activities, consisting of approximately 145 terms. Emphasizes relationships between data practices and curatorial activities for application to description and assessment of curation services.
• Phase 3, Data Practices team - Analytic potential concept developed as a theoretical approach to assessing the value of data beyond its original intended use. Extends Hjørland’s (1997) notion of “epistemological potential”, acknowledging the essential condition of preservation readiness and two key interrelated factors, potential user communities and fit for purpose.
We will demonstrate how the Phase 1 framework has been tested and extended based on empirical data and analysis from Phases 2-3. In particular, we show how scientists’ practices and ideas about meaningful units of data adhere to and diverge from the framework’s conception of “grouping”. We also identify and discuss additional elements of “purpose” needed to inform the curatorial processes of selection and appraisal and set curation priorities for making data fit for long-term use.
Cragin, M.H., Palmer, C.L., & Chao, T.C. (2010). Relating data practices, types, and curation functions: An empirically derived framework. Poster presented at the Annual Meeting of the American Society for Information Science & Technology (ASIS&T), October 22-27, 2010, Pittsburgh, PA.
Hjørland, B. (1997). Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science. Westport, CT: Greenwood.
Palmer, C.L., Weber, N.M., & Cragin, M.H. (2011). The analytic potential of scientific data: Understanding re-use value. Proceedings of the Annual Meeting of the American Society for Information Science & Technology (ASIS&T), October 9-12, 2011, New Orleans, LA.
Renear, A., Sacchi, S., & Wickett, K. (2010). Definitions of dataset in the scientific and technical literature. Proceedings of the Annual Meeting of the American Society for Information Science & Technology (ASIS&T), October 22-27, 2010, Pittsburgh, PA.