Spell checkers, word2vec, fastext

kiran
3 min readMar 13, 2022

This is more about my journey into the web for understanding the nuances of custom word dictionaries and in the process coming across spell checker’s .

Lets assume you have a misspelled word and you want a list of the possible corrections.

Peter Norvig’s post is one of the famous ones that you come across when you look for the theory behind a spell checker.

For the given word, generate all possible corrections by deleting a character and then expand the corrections set in a similar way with transposition, replacement, insertion and then check the probability of these corrections in a word frequency dictionary and find the word with max probability that could be the correction.

I came across this well presented blog post on how to use fasttext for spelling correction.The pretrained word vectors by fasttext when used, suggest misspelled words when get_nearest_neighbors is used.

why do we see bad output?

One reason for this behavior could be that the pretrained model was originally trained by FastText with a >1 neighborhood window. This FastText documentation page confirms the fact that the wordNgrams (max length of word ngram) was set to 5 during training. If we train our own word vectors, we could keep the wordNgrams hyperparameter to 1, so that FastText trains with 0 neighbors (i.e. each word is considered a line on it’s own).

Hence the author uses the dataset used by peter norvig for training custom fasttext model.

Perspective
I am a bit circumspect if the above line of thinking is the reason for misspelled words suggested as corrections for a misspelled word.
In the fastext documentation there is no reference to 5 being wordNgrams for the trained word2vec model.Those docs state that the character ngram length is 5 and window size(ws param) is 5.The word ngram size in itself is not specified.
I think as per this the default wordNgrams size is 1.

As per this and this, the wordNgrams does not affect the training in unsupervized mode and the fasttext pretrained word vectors model is trained in unsupervised mode using skipgram algo.

So the reason why misspelling words were suggested could be because of the corpus itself and the character ngrams used by fastext.As the corpus(common crawl) for the fasttext skipgram model itself might have had misspelled words, the suggestions also had them.

Lets try the below simple program.

cat sample.txt
my name is how are hello you when is this going to happen for what that only when helle helo
workspace/fasttext$ python
Python 3.8.0 (default, Oct 8 2020, 21:35:46)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fasttext
>>> model = fasttext.train_unsupervised('sample.txt', model='skipgram', minCount=1)
Read 0M words
Number of words: 19
Number of labels: 0
Progress: 100.0% words/sec/thread: 51532 lr: 0.000000 avg.loss: -nan ETA: 0h 0m 0s
>>> model.get_nearest_neighbors('hello')
[(0.40682557225227356, 'helle'), (0.19969063997268677, 'helo'), (0.1260446161031723, 'that'), (0.0739162266254425, 'only'), (0.06943869590759277, 'to'), (0.06371329724788666, 'my'), (0.053224191069602966, 'when'), (0.020248744636774063, 'name'), (0.00697528850287199, '</s>'), (-0.00836258102208376, 'happen')]
>>>
>>>

As you can see for the word hello, the nearest neighbours are helle, helo as cosine similarity is used for get_nearest_neighbors.Note that these suggestions are outside the window size for the input word.

Google’s word2vec corpus has misspellings as well.

You can train your own word2vec on a custom dataset using gensim and to generate apt suggestions.

--

--

kiran

I am a software engineer by profession. I write so as to remind my selves when I forget :-)