The Shape of a Word: Intro to Word Embeddings

To understand the magic, we first have to appreciate the problem. Computers don’t understand “words”; they understand numbers. For decades, the primary way to convert words into numbers was a brute-force method called one-hot encoding.

From Words to Numbers: The Old Way

Imagine you have a tiny vocabulary of just five words: “cat”, “dog”, “king”, “queen”, and “runs.” With one-hot encoding, you would assign each word a unique position in a list (or vector) of zeros. A “1” in a word’s position indicates that specific word.

cat: [1, 0, 0, 0, 0]
dog: [0, 1, 0, 0, 0]
king: [0, 0, 1, 0, 0]
queen: [0, 0, 0, 1, 0]
runs: [0, 0, 0, 0, 1]

This works, but it has two crippling flaws. First, it’s incredibly inefficient. The English language has hundreds of thousands of words, so each word would be represented by a vector with hundreds of thousands of dimensions, almost all of which are zero. Second, and more importantly for linguistics, it treats every word as a completely isolated island. The mathematical distance between “cat” and “dog” is exactly the same as the distance between “cat” and “king.” The system has no inkling that “cat” and “dog” are both small, furry pets, while “king” is a type of monarch. There is no concept of similarity or relationship.

The Distributional Hypothesis: You Shall Know a Word by the Company It Keeps

The breakthrough came from a simple but profound linguistic idea, famously articulated by the British linguist J.R. Firth in 1957:

“You shall know a word by the company it keeps.”

This is the distributional hypothesis. The meaning of a word is not an intrinsic property but is defined by the words that tend to appear around it. Think about it. The word “thermonuclear” is likely to appear in contexts with words like “fusion”, “physics”, “bomb”, and “energy.” The word “sourdough” is likely to appear near “bread”, “starter”, “bake”, and “flour.” Even if you didn’t know what “sourdough” was, its neighbors give you a very strong clue.

Words with similar meanings will tend to appear in similar contexts. “Cat” and “kitten” will both be found near “meow”, “purr”, “claws”, and “pet.” “King” and “queen” will both be found near “royal”, “crown”, “throne”, and “palace.” What if a computer could learn these contextual patterns by reading a massive amount of text, like all of Wikipedia or a library of books?

Enter Word Embeddings: Giving Words a Place in Space

This is exactly what word embedding models, like the pioneering Word2Vec, do. Instead of a sparse, high-dimensional one-hot vector, they represent each word as a dense, lower-dimensional vector—typically with 50 to 300 dimensions. Each of these numbers is a coordinate, placing the word in a complex, high-dimensional “meaning space.”

Let’s use a simple 2D analogy. Imagine a map where one axis represents “Royalty” and the other represents “Gender” (from masculine to feminine).

“King” might have coordinates like [0.9 Royalty, -0.8 Gender].
“Queen” might be at [0.88 Royalty, 0.85 Gender].
“Man” might be at [0.1 Royalty, -0.9 Gender].
“Woman” might be at [0.08 Royalty, 0.92 Gender].
“Castle” might be at [0.7 Royalty, 0.0 Gender].

In this simplified space, words with similar meanings cluster together. “King” and “Queen” are close to each other on the “Royalty” axis. But now, the relationships between words also have a shape. The line you draw from “Man” to “Woman” is very similar in length and direction to the line you draw from “King” to “Queen.”

Now, scale this up from two dimensions to 300. We can no longer visualize it, but the mathematical principles hold. The dimensions don’t represent clean human concepts like “Royalty” but rather abstract features of meaning that the model learned from the data. Dimension 42 might represent something about animacy, while dimension 187 might capture a nuance about verb tense.

The Surprising Math of Meaning

This geometric arrangement leads to something that feels like magic. Because these word representations are vectors, we can do math on them. This is where we finally solve our opening riddle.

If we take the vector for “king”, subtract the vector for “man”, and then add the vector for “woman”, the resulting vector will be extremely close to the vector for “queen.”

vec("king") - vec("man") + vec("woman") ≈ vec("queen")

What’s happening here? The operation vec("king") - vec("man") essentially calculates a “gender vector” in the context of royalty. It captures the essence of what separates a king from a man. When we add that “royalty + gender transition” vector to “woman”, we land squarely on “queen.”

This works for many other relationships:

Capitals: vec("Paris") - vec("France") + vec("Japan") ≈ vec("Tokyo")
Verb Tense: vec("walking") - vec("walk") + vec("swam") ≈ vec("swim")

The AI isn’t “thinking.” It’s performing vector arithmetic in a carefully constructed space where mathematical relationships mirror the semantic relationships in human language.

Cultural Fingerprints in the Data

This leads to a crucial point, especially for a blog about language and culture. Where do these models learn their “meaning space” from? They learn it from us. They are trained on vast corpora of human text—books, news articles, web pages, scientific papers. As a result, the embeddings don’t just capture linguistic truth; they capture a snapshot of our culture, complete with its biases and stereotypes.

For example, an embedding model trained on mid-20th-century text might famously produce the analogy: “man is to computer programmer as woman is to homemaker.” This isn’t a failure of the algorithm; it’s a perfect reflection of the biased data it was fed. The proximity of words in the vector space mirrors their proximity in the cultural texts of the time.

This has become a major field of study in AI ethics. How do we build fair and representative models? How do we debias these embeddings so that they don’t perpetuate harmful stereotypes? A model trained on a Japanese corpus will embed different cultural assumptions about politeness and social hierarchy than one trained on an English corpus. The shape of a word is defined by its cultural context.

The Shape of Thought

Word embeddings revolutionized natural language processing. They are the foundational building block for the technologies we use every day, from Google Search and machine translation to the large language models like GPT that power chatbots. They succeeded because they moved beyond treating words as discrete symbols and started treating them as points in a rich, relational space.

By teaching machines to see the “shape” of words and the geometric relationships between them, we’ve not only given them a semblance of understanding but also created a fascinating mirror. In these complex, high-dimensional spaces, we can see the reflection of our language, our history, and the very structure of our cultural consciousness.