Bagel: Open-source unified multimodal model

219 points by tosh 5 days ago

jjrv 5 days ago

I found a losslessly compressed version: https://github.com/LeanModels/Bagel-DFloat11

It works following readme instructions at least on Ubuntu, on my RTX 3090 GPU with 24 gigs of memory, just barely. Have to close most other windows and lower screen resolution to be able to load the model. Then it generates or edits images in 2-3 minutes. I only have this one GPU and am using Chrome to use the browser interface on the same machine.

The original release won't run on this hardware, but the compressed one is supposed to give identical results.

jjrv 5 days ago

I also asked it to explain what's funny in some newspaper comic strips in Finnish. It misunderstands some words and makes up nutty explanations, but most phrases still get translated correctly and its explanations do fit the drawn scenes once you factor in those misunderstandings. For such a small model that seemed impressive.

spuz 5 days ago

I'm interested in potential alternatives to ChatGPT's advanced voice mode. When I see the word "multimodal" I'm hopeful the model understands text + voice but instead it almost always seems to refer to text + images. Is there a keyword that I can use to look for models that work with voice similar to ChatGPT's advanced voice mode?

cjbprime 5 days ago

I don't know that ChatGPT's voice mode is using audio as a transformer input directly.
It could just be using speech to text (e.g. Whisper) on your input, and then using its text model on the text of your words. Or has OpenAI said that they aren't doing this?
- mrshu 4 days ago
  
  OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:
  > Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.
  From https://help.openai.com/en/articles/8400625-voice-mode-faq
amrrs 5 days ago

Google Gemini Live is pretty good.
If you want to try only voice, Try unmute.sh by Kyutai which will be eventually open-sourced
- spuz 4 days ago
  
  Thanks - it seems that Gemini Live is pretty far behind advanced voice mode at the moment. For example, I can't get it to speak slower when I want to understand what it is saying.
  I'm still interested in what keyword I could use to search for the latest research in voice models.

akacrobat 5 days ago

This looks exciting! There is a serious dearth of high-quality open-source models with multimodal capabilities. So, really looking forward to playing with this one.

Has anyone here experimented with fine-tuning this for domain-specific applications?

charcircuit 5 days ago

The demo shows pretty weak performance compared to other small models. It misunderstood my question due to picking an uncommon way to interpret it. After clarifying what I wanted it lost all context I had provided in the previous message. My benchmark query intentionally ambiguous and I use it to see how models handle ambiguity, handle information which can be outdated, and handle avoiding hallucination. Usually weak models will just hallucinate an answer, but this model was the first who want able to understand the question.

LourensT 5 days ago

These days, papers come with an advertisement video

jxjnskkzxxhx 5 days ago

As someone who used to be in the academia, I think is isn't bad in itself, I just worry that by comparison it raises the burden of effort that one has to make in order to get their work noticed.
- kleiba 5 days ago
  
  Compared to the effort required to play in that field at all, making a video is almost negligible.
  - jxjnskkzxxhx 5 days ago
    
    If it's an obligation, it's admin. Scientists hate admin.
lern_too_spel 5 days ago

This has been common for CG papers for two decades. Image generation is CG.

pleone 5 days ago

Is it from ByteDance Team, right? The team behind TikTok, CapCut, BuzzVideo and more. Any thoughts on that?

rvnx 5 days ago

Like BYD vs Tesla. US is getting more and more late and more closed than ever (e.g. Chinese Qwen LLM versus LLaMA). So long-term, China may emerge as the dominating force in tech.

akoculu 5 days ago

Good summary of the paper: https://x.com/build__ship/status/1926930191185580176

mdrzn 5 days ago

A quick test in the "demo" link doesn't show it to be "as smart" as it appeared in the demos on the page. I really hope it does all it's promising to do, but I'm skeptic so far.

mrec 5 days ago

I found it surprising that even one of the demos on the page appeared to get it wrong. (Chat example #5, explaining the "My Handwriting In Exams" meme.) Not horribly wrong, but still an odd example to cherry-pick for publicity material.
ETA: oof, and it's still getting hands wrong. (Editing demo #12)

moffkalast 5 days ago

Oh no it's The Everything Bagel.

mnky9800n 5 days ago

I couldn’t find it, what are the hardware expectations for bagel?

tonii141 5 days ago

If the model uses FP16 precision and has 7 billion active parameters, it would require approximately 14 GB of VRAM. I didn't read the paper.
- sfphoton 5 days ago
  
  How can you calculate required VRAM from precision and parameter number?
  - Havoc 5 days ago
    
    Realistically you probably just want to look at the file size on huggingface and add ~2 gigs for OS/Firefox tabs and and a bit for context (depends but lets say 1-2)
    The direct parm conversion math tends to be much less reliable than one would expect once quants are involved.
    e.g.
    7B @ Q8 = 7.1gb [0]
    30B @ Q8 = 34.6gb [1]
    btw you can also roughly estimate expected output speed too if you know the device memory throughput. Noting that this doesn't work for MoEs
    Also recently discovered that in CPU mode llama.cpp does memory mapping. For some models it loads less than a quarter into memory.
    https://huggingface.co/TheBloke/Llama-2-7B-GGUF/tree/main
    https://huggingface.co/TheBloke/LLaMA-30b-GGUF/tree/main
  - NitpickLawyer 5 days ago
    
    Rule of thumb is parameter_count * precision. Precision can be anything [32,16,8,4] bits. 32bits is sometimes used in training (although less now I guess), and rarely in inference. For a while now "full" precision is 16bit (fp16, bf16), fp8 is 8bit, int4 is 4bit, and so on. Everything that's not "full" precision is also known as quantised. fp8 is a quantised version of the "full" model.
    So quick napkin math can give you the VRAM usage for loading the model. 7b can be ~14GB full, 7GB in fp8 and ~3.5GB in 4bit (AWQ, int4, q4_k_m, etc). But that's just to load the model in VRAM. You also need some available VRAM to run inference, and there are a lot of things to consider there too. You need to be able to run a forward pass on the required context, you can keep a kv cache to speed up inference, you can do multiple sessions in parallel, and so on.
    Context length is important to take into account because images take a lot of tokens. So what you could do with a 7b LLM at full precision on a 16GB VRAM GPU might not be possible with a VLM, because the context of your query might not fit into the remaining 2GB.
  - a_t48 5 days ago
    
    A float16 is 2 bytes. 7B * 2 bytes = 14GB. I can't say if that's an accurate number, but that's almost certainly how tonii141 calculated it.
    
    sfphoton 5 days ago
    
    Oh, so FP16 means FloatingPoint16? I'm glad to learn something today, thanks!
gunalx 5 days ago

If you follow the hugginface link at the bottom you get to te actual model. Here https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT
It seems to be 7b, but like with other new architectures expect to not be able to run it quantizised.

GrantMoyer 5 days ago

Nice, it's really an open source model, Apache 2.0.

wsintra2022 5 days ago

https://news.ycombinator.com/item?id=44063602

sandra_vu 5 days ago

Hi good job, team. Any plans to commercialize the model?

saretup 5 days ago

> Scalable Perceptual Generative Model

If you wanna call it Bagel, just call it Bagel. No need to make up a justification.

flsaflkelfe 5 days ago

[dead]

gregjw 5 days ago

bagel