Ngram Language Model

NGramLanguageModel

I have been playing around with ChatGPT, Claude and other LLMs like many other programmers I know and I decided to try out a deeper dive into this AI thing. Today I am taking a look at an NGramLanguageModel that is implemented in Ruby to get an idea of how that works. This is all new to me so take this as a walk through the technology from a web developer who is an AI beginner rather than a take by an experienced AI programmer. Implementing an N-Gram Language Model in Ruby

Natural Language Processing (NLP) is a fascinating field that explores how computers can understand, interpret, and generate human language. One of the core techniques in NLP is the N-Gram Language Model, which is a powerful tool for predicting the next word in a sequence of text.

In this blog post, I’ll walk you through how to implement an N-Gram Language Model in Ruby from scratch. We’ll cover the core concepts, the implementation details, and some practical applications of this technique. What is an N-Gram Language Model?

An N-Gram Language Model is a statistical model that predicts the next word in a sequence based on the previous N-1 words. The “N” in N-Gram refers to the number of words that the model considers at a time.

For example, in a 3-Gram (or trigram) model, the probability of a word depends on the previous two words. So, the model would try to predict the word “the” based on the previous two words, like “in the” or “at the”.

The core idea is that by analyzing a large corpus of text, we can build a statistical model that captures the patterns and dependencies between words. This allows us to make informed guesses about the most likely next word in a sequence. Implementing an N-Gram Language Model in Ruby

Let’s dive into the implementation. We’ll build a simple N-Gram Language Model that can be used to generate text based on a given training corpus.

First, let’s define the NGramLanguageModel class:

class NGramLanguageModel def initialize(n, corpus) @n = n @corpus = corpus @ngram_counts = build_ngram_counts @total_ngrams = @ngram_counts.values.sum end

# … other methods go here … end

The initialize method takes two arguments:

n: the size of the n-gram (e.g., 3 for a trigram model)
corpus: the training text corpus, which should be an array of strings (one string per sentence or paragraph)

In the constructor, we also build the n-gram counts and calculate the total number of n-grams in the corpus.

Now, let’s implement the build_ngram_counts method:

def build_ngram_counts ngram_counts = Hash.new(0)

@corpus.each do |sentence| words = [“"] * (@n - 1) + sentence.split + [""] words.each_cons(@n) do |ngram| ngram_counts[ngram.join(" ")] += 1 end end

ngram_counts end

This method iterates through the corpus, extracting all n-grams and counting their occurrences. We add special and tokens to the beginning and end of each sentence to capture the context at the beginning and end of a sequence.

Next, we’ll implement the predict_next_word method, which uses the n-gram counts to predict the most likely next word:

def predict_next_word(previous_words) previous_words = [“"] * (@n - 1 - previous_words.length) + previous_words ngram = previous_words.join(" ") return nil if @ngram_counts[ngram].nil?

candidates = {} @ngram_counts.each do |key, count| if key.start_with?(“#{ngram} “) candidate = key.split(“ “)[-1] candidates[candidate] = @ngram_counts[key] end end

candidates.max_by { |_, count| count }[0] end

The predict_next_word method takes a list of previous words and uses the n-gram counts to find the most likely next word. It first pads the input with tokens to ensure we have the correct context. Then, it iterates through all the n-grams in the model, finding the ones that match the given context, and selects the most common next word.

Finally, let’s add a generate_text method to generate new text based on the language model:

def generate_text(seed_text, length) text = seed_text.split length.times do next_word = predict_next_word(text[-(@n-1)..-1]) break if next_word.nil? text « next_word end text.join(“ “) end

This method takes a seed text and a desired length, then generates new text by repeatedly calling the predict_next_word method to find the most likely next word. Applications of N-Gram Language Models

N-Gram Language Models have a wide range of applications in natural language processing, including:

Text Generation: As we've seen, language models can be used to generate new text that mimics the style and content of the training corpus.
Autocomplete and Spelling Correction: Language models can be used to predict the most likely next word or correct misspellings in user input.
Machine Translation: By modeling the relationships between words in different languages, language models can be used to translate text from one language to another.
Sentiment Analysis: The patterns captured by language models can be used to infer the sentiment (positive, negative, or neutral) of a piece of text.

N-Gram Language Models are a foundational technique in NLP, and understanding how to implement and apply them is a valuable skill for any data scientist or machine learning engineer working with text data.

I hope this blog post has improved your knowledge of how to build an N-Gram Language Model in Ruby. Feel free to experiment with different corpus sizes, n-gram lengths, and so forth to build on what I set out here.