Tokenization Repair in the Presence of Spelling Errors

CoNLL 2021 paper and presentation

Paper Video Slides Poster

Reproducibility Material

Web demo

Try our tokenization repair methods in the interactive web demo.

Web demo

Evaluation web applications

Click through our benchmarks and get a visualisation of the results in the evaluation web app.

Tokenization Repair Evaluation

Spelling Correction Evaluation

Data

The data contains our benchmarks described in the paper, as well as trained models and predicted sequences from all our methods (1GB compressed). In addition, you can download our training data.

Download data

Download training data

Download training data with synthetic OCR and spelling errors

Corrected ACL anthology corpus

A whitespace-corrected version of the ACL anthology corpus was made available for download with the publication of our paper. Additionally, you can explore the corrected corpus below.

Browse corpus

Download corpus

Code

You can download our code from GitHub. It comes with a Docker setup for easy reproducibility. A readme file in the code directory explains how to set up the Docker container. If you are not familiar with Docker, please visit docker.com.

The Docker container allows you to try our methods interactively, run them on our benchmarks (or on yours!), and run the evaluation. Make targets simplify the program calls and give further explanations.

ArXiv 2020 material

Download the models, benchmarks and results from our arXiv 2020 paper.

arXiv 2020 material