Quality Assurance of Data Sets: A 2023 NISO Training Series

Training Series

Scope

This 8-week course on Quality Assurance of Data Sets will cover three main topics: Cleaning, Provenance, and Discoverability. Participants will learn about the importance of data quality and its impact on decision-making, as well as techniques for improving data quality. By the end of the course, participants will be able to:

  • Use data cleaning techniques to detect and correct errors, missing values, and inconsistencies in data sets
  • Understand and capture data provenance information to track and explain data cleaning operations 
  • Understand the impact of data cleaning on downstream analysis and detecting intentional and unintentional problems in data.

Training Facilitator

Training Facilitator: Leslie D. McIntosh, PhD, VP, Research Integrity | Digital Science

Dr. Leslie McIntosh is the VP of Research Integrity at Digital Science and dedicates her work to improving research and investigating and reducing mis- and dis-information in science.

As an academic turned entrepreneur, she founded Ripeta in 2017 to improve research quality and integrity. Now part of Digital Science, the algorithms lead in detecting trust markers of research manuscripts. She works around the globe with governments, publishers, institutions, and companies to improve research and scientific decision-making. She has given hundreds of talks, including to the US-NIH, NASA, and World Congress on Research Integrity, and consulted with the US, Canadian, and European governments. Dr. McIntosh’s work was the most-read RetractionWatch post of 2022.

Dr. McIntosh has dedicated her work to improving science. Since 2014, this has focused on highlighting the need for reproducible science, transparently reporting science, and the need to build trust in science. 

She has experience leading diverse teams to develop and deliver meaningful data to improve scientific decisions. Dr. McIntosh is an accomplished biomedical informatician and data scientist. As an internationally known consultant and speaker, she is passionate about mentoring the next generation of data scientists. Like others, she stands on the shoulders of her ancestors. First, her great-grandmother, who labored picking cotton while supporting her seven children. Second, one who would be her grandmother. Third, one who would become the first person in her family to obtain a university-level education, going on to become a scientist and an inspiration.  

She holds a Master's and PhD in Public Health with concentrations in Biostatistics and Epidemiology from Saint Louis University and a Certificate in Women’s Leadership Forum from Washington University Olin’s School of Business

Event Sessions

Thursday, April 27, 2023: Introduction to Data Quality Assurance

  • Overview of the course and learning objectives
  • The importance of data quality and its impact on decision-making
  • Basic concepts and terminology of data quality assurance
  • Introduction to the three main topics: Cleaning, Provenance, and Discoverability

Resource shared by an attendee:

From the Journal of Clinical Epidemiology: Many researchers were not compliant with their published data sharing statement: a mixed-methods study by Mirko GabelicaRužica Bojčić, and Livia Puljak - The objective of the study was to analyze researchers’ compliance with their data availability statement (DAS) from manuscripts published in open-access journals with the mandatory DAS.

Thursday, May 4, 2023: Cleaning Data Sets

  • Understanding data cleaning and its role in data quality assurance
  • Techniques for detecting and correcting errors, missing values, and inconsistencies in data sets
  • Data quality metrics and standards
  • Hands-on exercises with cleaning data sets

Important Resource:

The Retraction Watch Database

Thursday, May 11, 2023: Data Provenance

Speaker

  • Introduction to data provenance and its importance in data quality assurance
  • Techniques for capturing and representing data provenance information
  • Data lineage and data pedigree
  • Case studies of data provenance in scientific and business applications

Thursday, May 25, 2023: Discoverability

Speaker

  • Introduction to data discoverability and its role in data quality assurance
  • Techniques for improving the discoverability of data sets, including metadata and documentation
  • Standards for describing and sharing data sets, such as Dublin Core and DataCite
  • Hands-on exercises with metadata and documentation tools

Important Resources:

Data-PASS - Data-PASS has acquired and preserved social science data at risk of being lost to the research community, including opinion polls, voting records, large-scale surveys and other social science studies. While this information provides the full story of the social and cultural experience of America, a huge quantity of this data is missing or at-risk.  Significant data collections that had not been deposited in a permanent archive have been identified and acquired by the partner institutions. Through NDIIPP funding, these data have been preserved and made available through a shared catalog.

What is Differential Privacy? - Differential privacy is a rigorous mathematical definition of privacy for statistical analysis and machine learning.

Dimensions App - The world’s largest linked research database

From the Harvard 'Dataverse' - Replication Data for: Do Politicians in Power Receive Special Treatment in Courts? Evidence from India

PLOS One - The citation advantage of linking publications to research data

Thursday, June 1, 2023: Data Catalogs and Repositories

Speaker

  • Introduction to data catalogues and repositories and their role in data discoverability and data quality assurance
  • Examples of data catalogues and repositories, such as Data.gov and figshare
  • Best practices for sharing and publishing data sets in repositories
  • Hands-on exercises with data repository platforms

Important Resources: 

Figshare Repository at Macquarie University - Macquarie University's research data repository (RDR). It is a specialised implementation of figshare.com, dedicated and tailored to the needs of Macquarie University-based researchers.

CoreTrustSeal - CoreTrustSeal offers to any interested data repository a core level certification based on the Core Trustworthy Data Repositories Requirements.  This universal catalogue of requirements reflects the core characteristics of trustworthy data repositories.

Dataset Search (Google) - Dataset Search is a search engine for datasets. Using a simple keyword search, users can discover datasets hosted in thousands of repositories across the Web.

Ag Data Commons: U.S. Department of Agriculture - Providing Central Access to USDA’s Open Research Data

U.S. Government's Open Data - Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.

Data Carpentry - Data Carpentry develops and teaches workshops on the fundamental data skills needed to conduct research. Our mission is to provide researchers high-quality, domain-specific training covering the full lifecycle of data-driven research.

OpenRefine - OpenRefine is a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

Thursday, June 8, 2023: Data Quality Assessment & Fraud Detection

  • Techniques for assessing data quality, including data profiling and data validation
  • Case studies of data quality assessment in different domains

Tuesday, June 13, 2023: Provenance for Data Cleaning - RESCHEDULED from May

Speaker

  • Provenance techniques for tracking and explaining data cleaning operations
  • Examples of using data provenance to understand the impact of data cleaning on downstream analysis
  • Hands-on exercises with provenance tools and libraries

Important Resources:

OpenRefine GREL functions - In places where OpenRefine will accept a string (s) or a regex pattern (p), you can supply a string by putting it in quotes. If you wish to use any regex notation, wrap the pattern in forward slashes.

Automating the Spreadsheet Conversion Process" : ExcelArchivalTool - a script that was created to programmatically convert Microsoft Excel files into open source formats suitable for long-term archival.

ExceLint-addin - An add-in for Excel that finds formula errors

Make your Excel documents accessible to people with disabilities - This topic gives you step-by-step instructions and best practices for making your Excel spreadsheets accessible and unlock your content to everyone, including people with disabilities.

Data Curation Network Modules - What is CURATE(D) Training? Data curation is the encompassing work and actions taken by researchers and curators of a research data repository or data archive in order to provide meaningful and enduring access to research data. Within the Data Curation Network we’ve identified over 50 data curation activities that may be performed by researchers or curators to make data more Findable, Accessible, Interoperable, and Reusable (FAIR).

Accessibility Data Primer - Although accessibility is a commonly-used word in the context of data curation, it’s important to clarify that it has most often been used in the more general sense of data that is findable and widely available.7 This includes considerations around indexing, clear context (and language), well-defined and standardized access paths, using open file formats, and paywalls/affordability. However, accessibility that explicitly takes into account the needs of a range of potential users including those with disabilities or neurodivergence is most often not included in this approach, and this shortfall needs to be addressed.

Tableau Primer - Tableau Software is a proprietary suite of products for data exploration, analysis, and visualization with an initial concentration in business intelligence. This primer focuses on the Tableau workbook files – .twb and .twbx – produced using Tableau Desktop. Like Microsoft Excel, Tableau Desktop uses a workbook and sheet file structure. Workbooks can contain worksheets, dashboards, and stories.

Thursday, June 15, 2023: Data Quality Management

  • Strategies for managing data quality throughout the data lifecycle
  • Roles and responsibilities of data quality management in organizations
  • Best practices for implementing data quality management programs
  • Review of course content and final project presentations

Important Resources:

 

DATA AB INITIO: Managing Research Data Right, From the Start

COPE (Committee on Publication Ethics) - COPE brings together all those involved in scholarly research and its publication to strengthen the network of support, education and debate in publication ethics

Mapping A Discipline: A Guide to Using VOSviewer for Bibliometric and Visual Analysis by James T. McAllister , IIIORCID Icon,Lora LennertzORCID Icon &Zayuris Atencio Mojica

Cytoscape - Network Data Integration, Analysis, and Visualization in a Box

Additional Information

Registrants will receive unique detailed instructions about accessing each session via e-mail three business days prior to the event. Due to the widespread use of spam blockers, filters, out of office messages, etc., it is your responsibility to contact the NISO office if you do not receive login instructions before the start of the session.

If you have not received your Login Instruction e-mail by 10 a.m. (ET) on the day before the training session, please contact the NISO office at nisohq@niso.org for immediate assistance.

If you are registering someone else from your organization, either use that person's e-mail address when registering or contact nisohq@niso.org to provide alternate contact information.

Registrants will receive an e-mail message containing access information to the archived conference recording within 48 hours after the event. This recording access is only to be used by the registrant's organization. We strongly encourage you to download archived event recordings to your own local storage space to ensure that you can continue to access them.

Broadcast Platform

NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.

Please check your system in advance to make sure it meets the Zoom meeting requirements. It is your responsibility to ensure that your system is compatible prior to the beginning of each session.