Data Quality Help and Hints

Curating data to best support reproducible and FAIRopen in new window use means we all need ways to address data quality, completeness, and consistency. Here we gather our collective tips on finding and fixing (and preventing) some of the more common issues.

Our TW Philosopy on data quality: we try to build in methods to prevent issues in the first place. Where we know they can happen, we try to build in tools to help you both find and fix. We also plan further development to extend our soft validation tools which will discover issues for you and offer to fix them on click. Note that when, where, and how you find any data anomalies will vary. And in turn, this influences the options and methods for fixing them (e. g. one-by-one, bulk annotation, scripts). For example, you might notice issues when:

  • cleaning data up in a spreadsheet before upload to any CMS
  • exploring your exported data with tools like OpenRefine, or via R, or via another API
  • looking at feedback from another source (e. g. GBIF or iDigBio or ALA or OBIS or Bionomiaopen in new window)
  • someone on the internet sees something and contacts you
  • perusing data already in your own database
  • mapping you data to migrate to another database or share with an aggregator
  • using your database data visualization tools to see distinct values in a given field (e. g. Project vocabulary task in TaxonWorks) or on a map. See also Distinct Values - Why This Data Directory?open in new window
  • reviewing your software repository issue-tracker (e. g. gitHub for TaxonWorksopen in new window)

In structuring these hints, we group the known issues into categories: Identifiers (e .g. catalog numbers), Time (e. g. dates), Place (aka geography, location), Taxon, and Other and Tools and Resources

Identifiers

Time

Date out-of-bounds

In TaxonWorks, different types of records have dates associated, for example: the event date for a given Collecting Event, or the date identified (that is, date determined), or the date georeferenced. Dates out-of-expected bounds would include several kinds of impossible dates. That is, dates in the future or dates before the objects were actually ever collected or dates that are not possible with the birth and death dates for the person who collected/identified/observed/georeferenced/imaged the object/s. These could be grouped as

  • date hasn't happened yet
  • date is suspiciously old and
  • flourit date and event date not compatible.

Filter Collecting Event by Date

Find outlier dates using the Filter Collecting Event task, the Filter Collection Object task, and (in development) the Project Vocabulary task.

Using the date range method to find outlier dates with the Filter Collecting Event task
  • Navigate to the Filter Collecting Event task
  • Scroll down to the Collecting Event filter section
  • Enter date range to search
    • e. g. to check for future out-of-bounds dates try putting "tomorrow's" date in for the start date and some date way into the future for the End date
  • Click Filter to see resulting records.

Find outlier dates using the (in development) Project Vocabulary task. With this task, one can see the unique values present for a given field and how many times that string/value occurs. In the future, you will be able to then click on one of the results of the output and see the associated records having that value. For the out-of-expected-bounds-date use case, one could see odd unexpected dates easily.

Find outlier dates based on someone's lifespan, when known. In the future, you can expect that if you have the active years for a given person entered into the database, and that person is linked to a record where the date collected or identified is not within their active years, you will be able to find these records.

Fix the outlier dates found from the above Filter Collecting Event task search.

  • In the result set, you can navigate to a single record and edit that one
  • You can use the download csv version of the results if you have a lot of records and want to sort by year in a spreadsheet to see the extent of the year bounds.
    • You can sort by year by clicking on a given column, however, it is only sorting the records on that page (note the number of records per/page can be increased).

Place

Taxon

Other

Tools and Resources