Big Data's Dirty Secret

26 Pages Posted: 11 Jul 2018

See all articles by Harvey J. Stein

Harvey J. Stein

Two Sigma; Columbia University - Department of Mathematics

Yan Zhang

Bloomberg LP

Date Written: June 29, 2018

Abstract

Amidst the avalanche of articles on big data and machine learning, the phrase "after cleaning the data" is often found. Here we focus on the work hidden behind this phrase. We analyze the types of dirty data found in financial time series, the problems caused by dirty data, and the performance of data cleaning algorithms. And we extend the MSSA hole filling algorithm of Kondrashov and Ghil to improve its performance on CDS spread data, and combine it with clustering techniques from data science to detect bad data.

Keywords: Data cleaning, big data, machine learning, SSA, MSSA, PCA, Data science, outlier detection, anomaly detection

JEL Classification: G32, C10, C45, C55, C51, G12

Suggested Citation

Stein, Harvey J. and Zhang, Yan, Big Data's Dirty Secret (June 29, 2018). Available at SSRN: https://ssrn.com/abstract=3205524 or http://dx.doi.org/10.2139/ssrn.3205524

Harvey J. Stein (Contact Author)

Two Sigma ( email )

100 6th Ave
New York, NY 10013
United States
10013 (Fax)

Columbia University - Department of Mathematics ( email )

New York, NY
United States

Yan Zhang

Bloomberg LP ( email )

731 Lexington Ave
New York, NY 10022
United States

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
725
Abstract Views
2,324
Rank
65,749
PlumX Metrics