0/318

00:00:00

Description

Editorial

Single Word Frequency Likelihood

EASY10 pts

In computational linguistics and natural language processing, understanding how frequently individual words appear in a text collection is a fundamental building block for many language models. The unigram model treats each word as an independent event, computing the probability of any word based solely on its relative frequency in the corpus—ignoring any contextual dependencies between words.

This concept forms the foundation of language modeling, where we estimate the probability distribution over sequences of words. By computing the likelihood of isolated words (unigrams), we establish the simplest probabilistic model of language, which serves as a baseline for more sophisticated n-gram and neural language models.

Sentence Boundary Markers: In language modeling, it is essential to model where sentences begin and end. We use special tokens to mark these boundaries:

<s> — Start-of-sentence marker
</s> — End-of-sentence marker

These markers are treated as regular tokens when computing probabilities, allowing the model to learn patterns about sentence structure and length distributions.

Computing Word Likelihood: Given a corpus represented as a space-separated string of tokens (including boundary markers), the likelihood of a specific word is calculated as:

$$P(word) = \frac{\text{count}(word)}{\text{total tokens in corpus}}$$

Your Task: Implement a function that calculates the probability of a given word appearing in a corpus of tokenized sentences. Your implementation should:

Tokenize the corpus by splitting on whitespace
Count the occurrences of the target word
Divide by the total number of tokens (including <s> and </s> markers)
Return the probability rounded to 4 decimal places

Example

Input

corpus = "<s> Jack I like </s> <s> Jack I do like </s>"
word = "Jack"

Output

0.1818

Explanation

First, we tokenize the corpus by splitting on whitespace:

Tokens: ["<s>", "Jack", "I", "like", "</s>", "<s>", "Jack", "I", "do", "like", "</s>"]

• Total token count = 11 • Occurrences of "Jack" = 2 • Probability = 2 / 11 = 0.18181818... • Rounded to 4 decimal places = 0.1818

Example

Input

corpus = "<s> the cat is big </s> <s> the dog is small </s>"
word = "the"

Output

0.1667

Explanation

Tokenizing the corpus:

Tokens: ["<s>", "the", "cat", "is", "big", "</s>", "<s>", "the", "dog", "is", "small", "</s>"]

• Total token count = 12 • Occurrences of "the" = 2 • Probability = 2 / 12 = 0.16666... • Rounded to 4 decimal places = 0.1667

Example

Input

corpus = "<s> I love programming </s> <s> Python is great </s>"
word = "love"

Output

0.1

Explanation

Tokenizing the corpus:

Tokens: ["<s>", "I", "love", "programming", "</s>", "<s>", "Python", "is", "great", "</s>"]

• Total token count = 10 • Occurrences of "love" = 1 • Probability = 1 / 10 = 0.1 • Rounded to 4 decimal places = 0.1

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of corpus ≤ 10⁵ characters
1 ≤ number of tokens in corpus ≤ 10⁴
1 ≤ length of word ≤ 100 characters
The corpus will always contain properly paired <s> and </s> markers
Words are case-sensitive ("The" ≠ "the")
The word may appear zero or more times in the corpus
Tokens are separated by single whitespace characters

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

word =

"Jack"

corpus =

"<s> Jack I like </s> <s> Jack I do like </s>"

Single Word Frequency Likelihood

Hints

Single Word Frequency Likelihood

Hints