Data Quality Help and Hints
Curating data to best support reproducible and FAIR use means we all need ways to address data quality, completeness, and consistency. Here we gather our collective tips on finding and fixing (and preventing) some of the more common issues.
Our TW Philosopy on data quality: we try to build in methods to prevent issues in the first place. Where we know they can happen, we try to build in tools to help you both find and fix. We also plan further development to extend our soft validation
tools which will discover issues for you and offer to fix them on click
. Note that when, where, and how you find any data anomalies will vary. And in turn, this influences the options and methods for fixing them (e. g. one-by-one, bulk annotation, scripts). For example, you might notice issues when:
- cleaning data up in a spreadsheet before upload to any CMS
- exploring your exported data with tools like OpenRefine, or via R, or via another API
- looking at feedback from another source (e. g. GBIF or iDigBio or ALA or OBIS or Bionomia)
- someone on the internet sees something and contacts you
- perusing data already in your own database
- mapping you data to migrate to another database or share with an aggregator
- using your database data visualization tools to see distinct values in a given field (e. g. Project vocabulary task in TaxonWorks) or on a map. See also Distinct Values - Why This Data Directory?
- reviewing your software repository issue-tracker (e. g. gitHub for TaxonWorks)
In structuring these hints, we group the known issues into categories: Identifiers
(e .g. catalog numbers), Time
(e. g. dates), Place
(aka geography, location), Taxon
, and Other
and Tools and Resources
Identifiers
Time
Date out-of-bounds
In TaxonWorks, different types of records have dates associated, for example: the event date
for a given Collecting Event
, or the date identified
(that is, date determined
), or the date georeferenced
. Dates out-of-expected bounds would include several kinds of impossible dates. That is, dates in the future or dates before the objects were actually ever collected or dates that are not possible with the birth and death dates for the person who collected/identified/observed/georeferenced/imaged the object/s. These could be grouped as
- date hasn't happened yet
- date is suspiciously old and
- flourit date and event date not compatible.
Filter Collecting Event by Date
Find outlier dates using the Filter Collecting Event
task, the Filter Collection Object
task, and (in development) the Project Vocabulary
task.
- Navigate to the
Filter Collecting Event
task - Scroll down to the
Collecting Event
filter section - Enter date range to search
- e. g. to check for future out-of-bounds dates try putting "tomorrow's" date in for the
start date
and some date way into the future for theEnd date
- e. g. to check for future out-of-bounds dates try putting "tomorrow's" date in for the
- Click
Filter
to see resulting records.
Find outlier dates using the (in development) Project Vocabulary
task. With this task, one can see the unique values present for a given field and how many times that string/value occurs. In the future, you will be able to then click on one of the results of the output and see the associated records having that value. For the out-of-expected-bounds-date use case, one could see odd unexpected dates easily.
Find outlier dates based on someone's lifespan, when known. In the future, you can expect that if you have the active years for a given person entered into the database, and that person is linked to a record where the date collected or identified is not within their active years, you will be able to find these records.
Fix the outlier dates found from the above Filter Collecting Event
task search.
- In the result set, you can navigate to a single record and edit that one
- You can use the download csv version of the results if you have a lot of records and want to sort by year in a spreadsheet to see the extent of the year bounds.
- You can sort by year by clicking on a given column, however, it is only sorting the records on that page (note the number of records per/page can be increased).
Place
Taxon
Other
Tools and Resources
- Data Carpentry Data Cleaning with OpenRefine
- Data Carpentry Data Organization in Spreadsheets
- OpenRefine as a great tool for
- cleaning and structuring messy data
- extending and enhancing your data
- tracking and sharing your data cleaning steps automatically
- Voyant Tools for visualizing and exploring text data
- ChatGPT proves useful in some instances too (e. g. for finding less common datums)
- Bob Mesibov’s Data Cleaner’s Cookbook
- GBIF’s data quality flags NOTE: in TaxonWorks, we pull GBIF DQ data back for you into the user-interface so you can evaluate the feedback without leaving TaxonWorks.
- iDigBio’s data quality flags
- OBIS Manual (e. g. their Data Laundry help and community) See also Gan Y-M, Perez Perez R, Provoost P, Benson A, Peralta Brichtova AC, Lawrence E, Nicholls J, Konjarla J, Sarafidou G, Saeedi H, Lear D, Penzlin A, Wambiji N, Appeltans W (2023) Promoting High-Quality Data in OBIS: Insights from the OBIS Data Quality Assessment and Enhancement Project Team . Biodiversity Information Science and Standards 7: e112018. https://doi.org/10.3897/biss.7.112018
- Linter tools offer another way to evaluate and tidy up your data. For example, you have a BibTeX file (from EndNote, Zotero, or elsewhere) that seems to have errors. You can use online tools that help you find and fix formatting (syntax) errors.
- See BibTeX Tidy as an example.
- Need to create or convert data into other formats? Some tools that help you with this part of any data transformation processes include:
- Tables Generator (e. g. HTML, LaTeX, MediaWiki, Markdown)
- PanDoc - a universal document converter