DataGene: A Framework for Dataset Similarity
6 Pages Posted: 29 Jun 2020 Last revised: 22 Oct 2020
Date Written: June 5, 2020
Abstract
DataGene is developed to identify data set similarity between real and synthetic datasets as well as train, test, and validation datasets. For many modelling and software development tasks there is a need for datasets to have share similar characteristics. This has traditionally been achieved with visualizations, DataGene seeks to replace these visual methods with a range of novel quantitative methods. Please see the GitHub repository to inspect and install the Python code.
Keywords: Distance, Time Series, Similarity, Metrics, Python, Data, Machine Learning, Data Science
JEL Classification: C
Suggested Citation: Suggested Citation