Spell checkers, word2vec, fastext
This is more about my journey into the web for understanding the nuances of custom word dictionaries and in the process coming across spell checker’s .
Lets assume you have a misspelled word and you want a list of the possible corrections.
Peter Norvig’s post is one of the famous ones that you come across when you look for the theory behind a spell checker.
For the given word, generate all possible corrections by deleting a character and then expand the corrections set in a similar way with transposition, replacement, insertion and then check the probability of these corrections in a word frequency dictionary and find the word with max probability that could be the correction.
Can we do the same using word embeddings.Now there are many posts on what is word2vec and how they work.I am just listing down the different directions in which I wandered when I jumped into this problem space.
word2vec crash course
how does fastext generate word vector for oov words
I came across this well presented blog post on how to use fasttext for spelling correction.The pretrained word vectors by fasttext when used, suggest misspelled words when get_nearest_neighbors is used.
why do we see bad output?
One reason for this behavior could be that the pretrained model was originally trained by FastText with a >1 neighborhood window. This FastText documentation page confirms the fact that the
wordNgrams(max length of word ngram) was set to
5during training. If we train our own word vectors, we could keep the
1, so that FastText trains with 0 neighbors (i.e. each word is considered a line on it’s own).
Hence the author uses the dataset used by peter norvig for training custom fasttext model.
I am a bit circumspect if the above line of thinking is the reason for misspelled words suggested as corrections for a misspelled word.
In the fastext documentation there is no reference to 5 being wordNgrams for the trained word2vec model.Those docs state that the character ngram length is 5 and window size(ws param) is 5.The word ngram size in itself is not specified.
I think as per this the default wordNgrams size is 1.
So the reason why misspelling words were suggested could be because of the corpus itself and the character ngrams used by fastext.As the corpus(common crawl) for the fasttext skipgram model itself might have had misspelled words, the suggestions also had them.
Lets try the below simple program.
my name is how are hello you when is this going to happen for what that only when helle heloworkspace/fasttext$ python
Python 3.8.0 (default, Oct 8 2020, 21:35:46)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.>>> import fasttext
>>> model = fasttext.train_unsupervised('sample.txt', model='skipgram', minCount=1)
Read 0M words
Number of words: 19
Number of labels: 0
Progress: 100.0% words/sec/thread: 51532 lr: 0.000000 avg.loss: -nan ETA: 0h 0m 0s
>>> model.get_nearest_neighbors('hello')[(0.40682557225227356, 'helle'), (0.19969063997268677, 'helo'), (0.1260446161031723, 'that'), (0.0739162266254425, 'only'), (0.06943869590759277, 'to'), (0.06371329724788666, 'my'), (0.053224191069602966, 'when'), (0.020248744636774063, 'name'), (0.00697528850287199, '</s>'), (-0.00836258102208376, 'happen')]
As you can see for the word hello, the nearest neighbours are helle, helo as cosine similarity is used for get_nearest_neighbors.Note that these suggestions are outside the window size for the input word.
Google’s word2vec corpus has misspellings as well.
You can train your own word2vec on a custom dataset using gensim and to generate apt suggestions.