Publication:
OCR error correction using correction patterns and self-organizing migrating algorithm

datacite.subject.fos oecd::Engineering and technology
dc.contributor.author Quoc-Dung Nguyen
dc.contributor.author Duc-Anh Le
dc.contributor.author Nguyet-Minh Phan
dc.contributor.author Ivan Zelinka
dc.date.accessioned 2022-11-02T02:17:56Z
dc.date.available 2022-11-02T02:17:56Z
dc.date.issued 2020
dc.description.abstract Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Postprocessing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition
dc.identifier.doi 10.1007/s10044-020-00936-y
dc.identifier.uri http://repository.vlu.edu.vn:443/handle/123456789/615
dc.language.iso en_US
dc.relation.ispartof Pattern Analysis and Applications
dc.relation.issn 1433-7541
dc.relation.issn 1433-755X
dc.subject OCR
dc.subject N-grams
dc.subject Similarity
dc.subject Context
dc.subject Correction pattern
dc.subject Evolutionary algorithm
dc.title OCR error correction using correction patterns and self-organizing migrating algorithm
dc.type journal-article
dspace.entity.type Publication
oaire.citation.issue 2
oaire.citation.volume 24
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
AS290.pdf
Size:
3.36 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed to upon submission
Description: