Loading content...
In computational linguistics and natural language processing, understanding how frequently individual words appear in a text collection is a fundamental building block for many language models. The unigram model treats each word as an independent event, computing the probability of any word based solely on its relative frequency in the corpus—ignoring any contextual dependencies between words.
This concept forms the foundation of language modeling, where we estimate the probability distribution over sequences of words. By computing the likelihood of isolated words (unigrams), we establish the simplest probabilistic model of language, which serves as a baseline for more sophisticated n-gram and neural language models.
Sentence Boundary Markers: In language modeling, it is essential to model where sentences begin and end. We use special tokens to mark these boundaries:
These markers are treated as regular tokens when computing probabilities, allowing the model to learn patterns about sentence structure and length distributions.
Computing Word Likelihood: Given a corpus represented as a space-separated string of tokens (including boundary markers), the likelihood of a specific word is calculated as:
$$P(word) = \frac{\text{count}(word)}{\text{total tokens in corpus}}$$
Your Task: Implement a function that calculates the probability of a given word appearing in a corpus of tokenized sentences. Your implementation should:
corpus = "<s> Jack I like </s> <s> Jack I do like </s>"
word = "Jack"0.1818First, we tokenize the corpus by splitting on whitespace:
Tokens: ["<s>", "Jack", "I", "like", "</s>", "<s>", "Jack", "I", "do", "like", "</s>"]
• Total token count = 11 • Occurrences of "Jack" = 2 • Probability = 2 / 11 = 0.18181818... • Rounded to 4 decimal places = 0.1818
corpus = "<s> the cat is big </s> <s> the dog is small </s>"
word = "the"0.1667Tokenizing the corpus:
Tokens: ["<s>", "the", "cat", "is", "big", "</s>", "<s>", "the", "dog", "is", "small", "</s>"]
• Total token count = 12 • Occurrences of "the" = 2 • Probability = 2 / 12 = 0.16666... • Rounded to 4 decimal places = 0.1667
corpus = "<s> I love programming </s> <s> Python is great </s>"
word = "love"0.1Tokenizing the corpus:
Tokens: ["<s>", "I", "love", "programming", "</s>", "<s>", "Python", "is", "great", "</s>"]
• Total token count = 10 • Occurrences of "love" = 1 • Probability = 1 / 10 = 0.1 • Rounded to 4 decimal places = 0.1
Constraints