Progressive Ensemble Learning for in-Sample Data Cleaning

Wang, Jung-Hua; Lee, Shih-Kai; Wang, Ting-Yuan; Chen, Ming-Jer; Hsu, Shu-Wei

DC Field	Value	Language
dc.contributor.author	Wang, Jung-Hua	en_US
dc.contributor.author	Lee, Shih-Kai	en_US
dc.contributor.author	Wang, Ting-Yuan	en_US
dc.contributor.author	Chen, Ming-Jer	en_US
dc.contributor.author	Hsu, Shu-Wei	en_US
dc.date.accessioned	2024-11-01T09:18:05Z	-
dc.date.available	2024-11-01T09:18:05Z	-
dc.date.issued	2024/1/1	-
dc.identifier.issn	2169-3536	-
dc.identifier.uri	http://scholars.ntou.edu.tw/handle/123456789/25515	-
dc.description.abstract	We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.	en_US
dc.language.iso	English	en_US
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC	en_US
dc.relation.ispartof	IEEE ACCESS	en_US
dc.subject	Training	en_US
dc.subject	Data models	en_US
dc.subject	Cleaning	en_US
dc.subject	Noise measurement	en_US
dc.subject	Image classification	en_US
dc.subject	Complexity theory	en_US
dc.subject	Training data	en_US
dc.subject	Ensemble learning	en_US
dc.subject	Data integrity	en_US
dc.subject	Transfer learning	en_US
dc.subject	Convolutional neural networks	en_US
dc.subject	Noisy data	en_US
dc.subject	ensemble learning	en_US
dc.subject	data cleanline	en_US
dc.title	Progressive Ensemble Learning for in-Sample Data Cleaning	en_US
dc.type	journal article	en_US
dc.identifier.doi	10.1109/ACCESS.2024.3468035	-
dc.identifier.isi	WOS:001329024200001	-
dc.relation.journalvolume	12	en_US
dc.relation.pages	140643-140659	en_US
item.openairecristype	http://purl.org/coar/resource_type/c_6501	-
item.openairetype	journal article	-
item.languageiso639-1	English	-
item.cerifentitytype	Publications	-
item.grantfulltext	none	-
item.fulltext	no fulltext	-
crisitem.author.dept	College of Electrical Engineering and Computer Science	-
crisitem.author.dept	Department of Electrical Engineering	-
crisitem.author.dept	National Taiwan Ocean University,NTOU	-
crisitem.author.parentorg	National Taiwan Ocean University,NTOU	-
crisitem.author.parentorg	College of Electrical Engineering and Computer Science	-
Appears in Collections:	電機工程學系

Show simple item record

Page view(s)

102

checked on Jun 30, 2025

Google Scholar^TM

Check

DSpace CRIS

Page view(s)

Google Scholar^TM

Altmetric

Altmetric

Page view(s)

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM