Big Data's Dirty Secret

Stein, Harvey J.; Zhang, Yan

doi:10.2139/ssrn.3205524

Download This Paper

Open PDF in Browser

Add Paper to My Library

Big Data's Dirty Secret

26 Pages Posted: 11 Jul 2018

See all articles by Harvey J. Stein

Harvey J. Stein

Two Sigma; Columbia University - Department of Mathematics

Amidst the avalanche of articles on big data and machine learning, the phrase "after cleaning the data" is often found. Here we focus on the work hidden behind this phrase. We analyze the types of dirty data found in financial time series, the problems caused by dirty data, and the performance of data cleaning algorithms. And we extend the MSSA hole filling algorithm of Kondrashov and Ghil to improve its performance on CDS spread data, and combine it with clustering techniques from data science to detect bad data.

Keywords: Data cleaning, big data, machine learning, SSA, MSSA, PCA, Data science, outlier detection, anomaly detection

JEL Classification: G32, C10, C45, C55, C51, G12

Suggested Citation: Suggested Citation

Stein, Harvey J. and Zhang, Yan, Big Data's Dirty Secret (June 29, 2018). Available at SSRN: https://ssrn.com/abstract=3205524 or http://dx.doi.org/10.2139/ssrn.3205524

Harvey J. Stein (Contact Author)

Two Sigma ( email )

100 6th Ave
New York, NY 10013
United States
10013 (Fax)

Columbia University - Department of Mathematics ( email )

New York, NY
United States

Yan Zhang

Bloomberg LP ( email )

731 Lexington Ave
New York, NY 10022
United States

Download This Paper

Open PDF in Browser

Do you have negative results from your research you’d like to share?

Submit Negative Results

Paper statistics

Downloads

725

Abstract Views

2,324

Rank

65,749

24 References

PlumX Metrics

Feedback

Big Data's Dirty Secret

Harvey J. Stein

Yan Zhang

Abstract

Harvey J. Stein (Contact Author)

Two Sigma ( email )

Columbia University - Department of Mathematics ( email )

Yan Zhang

Bloomberg LP ( email )

Do you have negative results from your research you’d like to share?

Paper statistics

Related eJournals

Risk Management eJournal

Econometrics: Econometric & Statistical Methods - Special Topics eJournal

Risk Management & Analysis in Financial Institutions eJournal

Econometrics: Data Collection & Data Estimation Methodology eJournal

Artificial Intelligence eJournal

Information Systems eJournal

Innovation & Geography eJournal

International Political Economy: Globalization eJournal

Psychology Research Methods eJournal