Week 9 - Embeddings - Chinese Characters

I explored how Chinese characters can be represented as vectors and used those vectors to find similar characters based on meaning or visual similarity.

Python notebook

I loaded a pretrained embedding file containing thousands of Chinese characters, each mapped to a 2048-dimensional vector.

https://github.com/cqx931/chineseVisualEmbeddings

I used the simpleneighbors library to search for characters that are similar to a chosen one, like “水” (water).
I experimented by adding random noise to a character’s vector and observing what other characters are nearby in meaning.
Finally, I visualized part of this vector space using a 2D projection technique called t-SNE, to better understand how characters group together in meaning.

What This Embedding Code Does

This notebook demonstrates how to load and search a special kind of word embedding dataset—not in English, but for Chinese characters. Unlike English where words are made of letters and embedding often focuses on full words or sentences, Chinese characters are standalone semantic units, and often carry complex visual and conceptual meaning.

Using a pretrained dataset of visual Chinese embeddings, the code:

Loads character + vector pairs from a .txt file
Uses a nearest-neighbor library (simpleneighbors, built on annoy) to quickly find similar vectors
Allows semantic search—finding related or similar Chinese characters based on vector proximity, not spelling
Adding noise to vectors to simulate visual/semantic variation)
Enables visualizing how characters cluster in a high-dimensional space

This method is helpful for understanding how meaning and similarity can be encoded numerically, even for a logographic writing system like Chinese.

How Chinese Embeddings Are Done (vs. English)

Each Chinese character is a semantic unit (not like English letters which must form full words).
Many characters carry both visual and phonetic cues, so embedding methods sometimes combine image-based features with linguistic ones.