The super effectiveness of Pokémon embeddings using only raw JSON and images

169 points by minimaxir 5 days ago

Very nice! This took me about 30 minutes to re-implement for Magic: The Gathering cards (with data from mtgjson.com), and then about 40 minutes or so to create the embeddings. It does rather well at finding similar cards for when you want more than a 4-of, or of course for Commander. That's quite useful for weirder effects where one doesn't have the common options memorized!

minimaxir 4 days ago

I was thinking about redoing this with Magic cards too (I have quite a lot of code for that preprocessing that data already) so it's good to know it works there too! :)

rahimnathwani 3 days ago

There seem to be a lot of properties that are numeric or boolean, e.g.

    "base_happiness": 50,
    "capture_rate": 190,
    "forms_switchable": false,
    "gender_rate": 4,
    "has_gender_differences": true,
    "hatch_counter": 10,
    "is_baby": false,
    "is_legendary": false,
    "is_mythical": false,

Why not treat each of those properties as an extra dimension, and have the embedding model handle only the remaining (non-numeric) fields?

Is it because:

A) It's easier to just embed everything, or

B) Treating those numeric fields as separate dimensions would mean their interactions wouldn't be considered (without PCA), or

C) Something else?

dcl 3 days ago

Because then you couldn't use a pretrained LLM to give you the embeddings. If you added these numerics as extra dimensions, you would need to train a new model that somehow learns the meaning of those extra dimensions based on some measure.
- rahimnathwani 2 days ago
  
  The embedding model outputs a vector, which is a list of floats. If we wrap the embedding model with a function that adds a few extra dimensions (one for each of these numeric variables, perhaps compressed into the range zero to one) then we would end up with vectors that have a few extra dimensions (e.g. 800 dimensions instead of 784 dimensions). Vector similarity should still just work, no?

jszymborski 4 days ago

I would be interested in how this might work with just looking for common words between the text fields of the JSON file weighted by e.g. TF-IDF or BM25.

I wonder if you might get similar results. Also would be interested in the comperative computation resources it takes. Encoding takes a lot of resources, but I imagine look-up would be a lot less resource intensive (i.e.: time and/or memory).

refulgentis 4 days ago

Almost everyone uses MiniLM-L6-v2.

You almost certainly don't want to use MiniLM-L6-v2.

MiniLM-L6-V2 is for symmetric search: i.e. documents similar to the query text.

MiniLM-L6-V3 is for asymmetric search: i.e. documents that would have answers to the query text.

This is also an amazing lesson in...something: sentence-transformers spells this out, in their docs, over and over. Except never this directly: i.e. it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"

And yet, I'd wager there's $100+M invested in vector DB startups who would be surprised to hear it.

radarsat1 3 days ago

It would be nice if you spelled out on your post how you know this, then. Is it written somewhere? A relevant paper for example?
- refulgentis 3 days ago
  
  > it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"

bfung 4 days ago

> minimaxir uses Embeddings!

> It’s super effective!

> minimaxir obtains HN13

axpy906 4 days ago

Nice article. I remember the original work. Can you elaborate on this one Max? > Even if the generative AI industry crashes

minimaxir 4 days ago

It's a note that embeddings R&D is orthogonal to whatever happens with generative AI even though both involve LLMs.
I'm not saying that generative AI will crash but if it's indeed at the top of the S-curve there could be issues, notwithstanding the cost and legal issues that are only increasing.
- qeternity 4 days ago
  
  While there is no real definition of LLM I’m not sure I would say both involve LLMs. There is a trend towards using the hidden state of an LLM as an embedding but this is relatively recent, and overkill for most use-cases. Plenty of embedding models are not large, and it’s fairly trivial to train a small domain-specific embedding model that has incredible utility.
  - refulgentis 4 days ago
    
    To some approximation, if you understood what BERT was at the time it was released, you'd consider it the first modern LLM. GPT-1 was OpenAI's BERT.
    Timeline would be viewed as:
    2017: transformers
    2018: bert
    2018: gpt-1
    2019: gpt-2
    2020: gpt-3
    2022: gpt-3.5 (chatgpt)
    
    rolisz 3 days ago
    
    BERT was not large (it was under a billion parameters). And it wasn't an autoregressive language model like GPT
    
    refulgentis 3 days ago
    
    Relative to today's models, it is small.
    For the time, it was large.
    Re: that it's not auto regressive, that's correct.
    Things built on eachother smoothly. Transformers to BERT to GPT.
pqdbr 4 days ago

I think the author is implying that even if you can't extract real world value from generative AI, the current AI hype has evolved embeddings to a point they can provide real world value to a lot of projects (like the semantic search demonstrated in the article, where no generative AI was used).

moralestapia 4 days ago

Nice.

Can you compare distances just like that on a 2D space post-UMAP?

I was under the impression that UMAP makes metrics meaningless.

flipflopclop 4 days ago

Great post, really enjoyed the flow of narrative and quality deep technical details

Woshiwuja 3 days ago

arceus being as close as rampardos to mew is kinda funny

vasco 4 days ago

> man + women - king = queen

Useless correction, it's king - man, not man - king.

01HNNWZ0MV43FF 4 days ago

It's also woman not women
- PaulHoule 4 days ago
  
  Also you hear that example over and over again because you can't get other ones to work reliably with Word2Vec; you'd have thought you could train a good classifier for color words or nouns or something like that if it worked but actually you can't.
  Because it could not tell the difference between word senses I think Word2Vec introduced as many false positives as true positive, BERT was the revolution we needed.
  I use similar embedding models for classification and it is great to see improvements in this space.
  - simonw 4 days ago
    
    The other example that worked for me with Word2Vec was Germany + Paris - France = Berlin: https://simonwillison.net/2023/Oct/23/embeddings/#exploring-...
  - craigacp 4 days ago
    
    There are a bunch of these things in a word2vec space. I had a blog post years ago on my group's blog which trained word2vec on a bunch of wikias so we could find out who is the Han Solo of Doctor Who (which I think somewhat inexplicably was Rory Williams). You need to carefully implement word2vec, and then the similarity search, but there are plenty of vaguely interesting things in there once you do.
    
    radarsat1 3 days ago
    
    It's a good point about true and false positives though, which makes me wonder if anyone's taken a large database of expected outputs from such "equations" and used it to calculate validation scores for different models in terms of precision and recall.

ramonverse 3 days ago

really cool read!

jpz 4 days ago

Great article - thanks.

smohare 3 days ago

[dead]