DataGene: A Framework for Dataset Similarity

6 Pages Posted: 29 Jun 2020 Last revised: 22 Oct 2020

See all articles by Derek Snow

Derek Snow

The Alan Turing Institute

Date Written: June 5, 2020

Abstract

DataGene is developed to identify data set similarity between real and synthetic datasets as well as train, test, and validation datasets. For many modelling and software development tasks there is a need for datasets to have share similar characteristics. This has traditionally been achieved with visualizations, DataGene seeks to replace these visual methods with a range of novel quantitative methods. Please see the GitHub repository to inspect and install the Python code.

Keywords: Distance, Time Series, Similarity, Metrics, Python, Data, Machine Learning, Data Science

JEL Classification: C

Suggested Citation

Snow, Derek, DataGene: A Framework for Dataset Similarity (June 5, 2020). Available at SSRN: https://ssrn.com/abstract=3619626 or http://dx.doi.org/10.2139/ssrn.3619626

Derek Snow (Contact Author)

The Alan Turing Institute ( email )

British Library, 96 Euston Rd
London, NW1 2DB
United Kingdom

HOME PAGE: http://www.turing.ac.uk/

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
260
Abstract Views
1,333
Rank
214,112
PlumX Metrics