I explored how Chinese characters can be represented as vectors and used those vectors to find similar characters based on meaning or visual similarity.

Python notebook

  1. I loaded a pretrained embedding file containing thousands of Chinese characters, each mapped to a 2048-dimensional vector.

https://github.com/cqx931/chineseVisualEmbeddings

  1. I used the simpleneighbors library to search for characters that are similar to a chosen one, like “水” (water).
  2. I experimented by adding random noise to a character’s vector and observing what other characters are nearby in meaning.
  3. Finally, I visualized part of this vector space using a 2D projection technique called t-SNE, to better understand how characters group together in meaning.

What This Embedding Code Does

This notebook demonstrates how to load and search a special kind of word embedding dataset—not in English, but for Chinese characters. Unlike English where words are made of letters and embedding often focuses on full words or sentences, Chinese characters are standalone semantic units, and often carry complex visual and conceptual meaning.

Using a pretrained dataset of visual Chinese embeddings, the code:

This method is helpful for understanding how meaning and similarity can be encoded numerically, even for a logographic writing system like Chinese.

How Chinese Embeddings Are Done (vs. English)