TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data
Department
Computer Science
Document Type
Article
Publication Date
6-1-2023
Abstract
Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record's temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony's PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes.
Journal Title
International Journal of Semantic Computing
Journal ISSN
1793351X
Volume
17
Issue
2
First Page
309
Last Page
343
Digital Object Identifier (DOI)
10.1142/S1793351X23500010
- Citations
- Citation Indexes: 1
- Usage
- Abstract Views: 2
- Captures
- Readers: 4