20/06/2025
Similar Whatever
I shared my similar movies chart at Townspeople #009.
Who Doesn’t Love a Good Embedding?
I was looking for a movie, TV show, or really anything to pass the time. Sentence embeddings can take a sentence, a paragraph, or even an entire book (though not quite yet, sadly) and convert it into a single vector of arbitrary size. These vectors can be compared to determine how similar two pieces of text are. This concept of similarity or relatedness is also at the core of RAG systems—and they’re fantastic.
After embedding two public datasets containing around 44,500 movies, I ended up with 44,500 vectors. I used only the plot summary data—not genre, release year, origin, or title. That’s because, even though embedding models can represent thematically different movies with similar metadata, dimensionality reduction algorithms tend to “cling” to those similarities and produce unsatisfying results. The model still aggressively separates Bollywood from Hollywood films—likely because the character names in Indian films dominate those vectors.
Each of the 44,500 vectors has 768 dimensions. Since it’s hard to conceptualize data in that many dimensions, we reduce it using UMAP, t-SNE, or something similar. I went with UMAP out of habit, but plan to try PCA and t-SNE on the same set. With some tuning, you can represent the entire dataset in a 2D plot, where similar movies appear closer together. For example, Interstellar ends up closer to Inception than to RRR.
Want to take a wild guess where the Bollywood and Hollywood movies are? What about European films? :^)
It would also be interesting to search for nearest neighbors before reducing dimensions, since that process inevitably loses some information. The idea is: the user selects a movie from a list and gets the top 5 most similar ones—no plot, no metadata, just raw embedding similarity.
At this point, my idea felt complete—but there’s always room to expand.
Next, I clustered the points based on proximity, density, and frequency. It’s all about similarity, baby!
I used HDBSCAN*—again, out of habit. I’m still tweaking the HDBSCAN parameters to improve the results, so the current clusters aren’t final. Also, there are a lot of podcasts labeled “None”—that’s a bug on my end.
Problems
How can an embedding model understand that Up and Gran Torino are basically the same movie? Or that High-Rise and Snowpiercer share themes? Well, it can’t—at least not yet.
In my dataset, the “plot summary” column was more like a full synopsis than a summary. I strongly suspect that properly summarized versions of these plots would produce better embedding results.