Artificial needles to real haystacks: Improving retrieval capabilities in LLMs

101 points by veryluckyxyz a year ago

This is really interesting. They fine-tune on instances of this sort of task:

  Do a task using the list of dictionaries below.
  Dictionary [1] {122: 765, 4548: 1475, 4818: 4782} Dictionary [2] {526: 290, 9205: 9318, 9278: 1565} ...
  Dictionary [32] {2931: 8364, 196: 1464, 812: 5363} ...
  Dictionary [85] {344: 1579, 116: 617, 330: 411}
  Above is a list of dictionaries such that each key and value is an integer. Report the
  value of key 2931 and the dictionary it is in.
  Desired answer: The value of key 2931 is 8364 and it is in Dictionary [32].

This task doesn't teach any new facts, but seems to encourage better ability to random-access data from a large context.

avree a year ago

If this type of fine-tuning severely harms the general-purpose capabilities of the model, why is it advantageous versus using a slimmer model more suited to your goal from the start? I guess I don't fully grok the paper.
vessenes a year ago

Interesting. You could add counting and other jobs as well, eg ”how many times does value 8364 appear, and in which dictionaries?”
A lot of the needle trained models clearly can’t reason with stuff everywhere in a long context even if they can retrieve. I would like to see some extensions that require some reasoning.
I guess you could use words as dict vals and ask it what sentence a set of keys spells, and to answer the sentence with a set of keys. Lots of interesting possibilities.
surfingdino a year ago

cat list_of_dicts.txt | grep '2931:'
- optimalsolver a year ago
  
  This is why you need symbolic reasoning. The above is vastly more efficient than whatever these LLMs are doing.
  - surfingdino a year ago
    
    I had a chat with a dev friend of mine. He's building an app and will have users enter venue booking info. He then plans to use ... AI to check venue availability and sort bookings, something a couple of simple SQL queries and views would easily take care of. Where did we go wrong? How did the LLM monkey gang manage to embed their shitty ideas into the minds of otherwise intelligent people?
    
    EGreg a year ago
    
    Sounds like blockchain.
    Actually very much like blockchain, in that it’s slower and less efficient but they put it into everything inclusinf Iced Tea.
    The only difference is that AI weights are inscrutable and people don’t know what it will do, while blockchain’s actual achivement is precisely the opposite — knowing and proving what the code will do. But still, it is not a hammer for every nail!
    
    surfingdino a year ago
    
    Good analogy.
    
    sgaur a year ago
    
    Hope he’s using the AI to convert user inputs into query parameters and doing the actual retrieval through queries. Or perhaps there are subjective descriptions or reviews of the venues that need to be queried?
    
    beefnugs a year ago
    
    Yeah I don't know how we can avoid this circle jerk of inevitability: flood the internet with cheap generated content, so that the rest of us need ai to try and filter out the vast amount of shit to get to the meat?
    Then we finally get back to the shit covered meat we all crave
  - EGreg a year ago
    
    In 10 years:
    “LLM, make your lookup more efficient. kthxbai”

dvt a year ago

I've seen a lot of papers recently tackle the needle-in-a-haystack problem wrt LLMs, and I think this approach (and more generally, any in-context solution) is a mistake.

Imo the best way to handle this is RAG + multi-shot prompting (+ symbolic mapping to an actual data structure). For example, a pre-processing step where you partition the context by "records," another step where you insert (and potentially split up the records) in a RAG database, and another step where you make fuzzy queries. So, if you ask for record 1234 you get an exact match on that line (or set of lines, or record, or whatever) of the original context. And if you ask for "elephant" but there's no "elephant" in the context, you might get the "hippo" record because of the RAG reranking.

This is a lot of work, and is essentially a data pipeline, but the results are much better-curated than just fine-tuning and hoping that generalized needle-in-a-haystack search will work reliably as part of a language model.

pilooch a year ago

No. Because the model should and will replace any piece of code. This is what happened already for other tasks, in computer vision, text (entity recognition, etc...), audio, etc.
RAG will go away, decision multimodal models / LLMs will take over. No here yet, but inevitable I believe.
- snovv_crash a year ago
  
  End to end networks can sometimes have higher performance, but the failure mechanisms aren't explainable, and are unintuitive to humans.
  If you're building something that needs to be easy to work with, and that humans can understand the limitations of, splitting the network up into stages and having human-interpretable intermediate values is a good architecture choice.
- jimmySixDOF a year ago
  
  No so sure these are even mutually exclusive positions. The BYO data needs will far exceed the longest contexts for a long time to come. LLMs might integrate RAG to the point it is hard to talk about one without the other (DSPy is close) but there will still be some kind of private knowledge base graph feeding incontext learning so improvements in either area is positive.
- achierius a year ago
  
  This seems like a nonsensical position. Computer vision models have not replaced every piece of code -- there's still harness, formatting, even old-fashioned classical vision processing that goes on both before and after the model runs. It's perfectly reasonable to couch AI models inside other, classical code.
  - imtringued a year ago
    
    We have end to end robot learning models that take nothing but camera input and instructions and directly produce the robot motions. There is literally no code left except for the sensors and actuators. It's all in the model.
- vatsadev a year ago
  
  not nesc.
  like I would rather use code to find apriltags than a vit or other model

kristjansson a year ago

The comments here are kinda silly… the haystack test measures how well a model can natively attend to its entire context window. Of course a more elaborate pipeline, or a way for the model to use a shell, or whatever will easily (trivially) solve the problem.

But that’s not the point, the point is a task that’s trivial to generate and exercises 10s-100s of thousands of tokens of context in a falsifiable way.

yousif_123123 a year ago

Haven't read the paper yet, but looks like this can improve the ability of the model attention to work better, since many of these tasks end up being similar to these generic tasks.

Even gpt4 gets tripped up when there's too many exact instructions needed to be executed on an input. That's why it's common that breaking a task into multiple steps and multiple improves performance.

It's wonderful to see improvements possible on smaller models.

viksit a year ago

anyone have pointers on progress in symbolic reasoning vs context forcing approaches in LLMs?