A while ago I used whisper (or rather an OSS subtitle tool that used whisper, sadly can’t remember the name; it also converted burned in subs to proper ones via OCR) to generate subtitles for a season of a Show (4 season DVD set, 1 had proper subs, 2 burned in subs, 1 no subs -.-), too old and not popular enough to have "scene" subs, it worked impressively well. The most memorable thing for me was that a character’s scream was properly attributed to the correct character.
I’d love a feature like that for Jellyfin eventually.
I've seen this sentiment a lot, and it's always confused me a bit. I only use Jellyfin as well, I've never really seen the appeal of the *arr stack. I really don't see the appeal of having a firehose pointed at my Jellyfin instance, which requires like eight additional services that need to run.
> I've never really seen the appeal of the *arr stack
Automation I suppose. I only run Sonarr/Radarr, so two additional services. They both work exactly the same, so it's basically "configure once, run twice".
I eventually ran out of patience to find the right releases and what not, and now enjoy a life where I can just input the title, then return one hour later to everything setup for me. Any new episodes/movies will just be available automatically in Jellyfin next time I open it up on the TV. Helps that other people in the household are also able to use it, so we've got rid of all streaming services now.
I am mostly in the same camp but there are a few quality of life improvements that I recommend:
1. Overseerr - add tv/movies in one place
2. Custom feeds as import lists so that new, popular stuff is automatically getting added
3. Kometa (for Plex users) - custom collections via trakt and burning imdb/rt/metacritic scores into cover art
There's an art to subtitling that goes beyond mere speech-to-text processing. Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read. Sometimes you need to name a voice as unknown, to avoid spoilers. Sometimes the positioning on the screen matters. I hope the model can be made to understand all this.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read
Please no. Some subtitle companies do think like this, and it's really weird, like when they try to "convert" cultural jokes, and then add in a bunch of more assumptions regarding what cultures you're aware of depending on the subtitle language, making it even harder to understand...
Just because I want my subtitles in English, doesn't mean I want all typical discussed Spanish food names to be replaced by "kind of the same" British names, yet something like that is something I've come across before. Horrible.
I totally get this. When I'm watching videos for the purpose of learning a language, I want all the actual words in the subtitles. But if I'm watching just ot enjoy, say in a language I don't care to learn, I don't mind someone creatively changing the dialog to how it probably would have been written in English. This happens with translations of novels all the time. People even seek out specific translators who they feel are especially talented at this kind of thing.
It depends on the context! Trying to Americanize Godzilla, for instance, has largely failed because Godzilla is an allegory for the unique horror of nuclear bombing which Japan experienced. Making him just a lizard that walks through New York is kind of stupid.
Jokes are an example of something translators can do really well - things like puns don't work 1:1 across languages. A good translator will find a corresponding, appropriate line of dialogue and basically keep the intent without literally translating the words.
Food is kind of silly because it's tied to place - if a setting is clearly Spanish, or a character is Spanish, why wouldn't they talk about Spanish food? Their nationality ostensibly informs something about their character (like Godzilla) and can't just be fine/replaced.
> Jokes are an example of something translators can do really well - things like puns don't work 1:1 across languages. A good translator will find a corresponding, appropriate line of dialogue and basically keep the intent without literally translating the words.
Again, those aren't "cultural translations" but "idioms translations", which I do agree should be translated into something understandable in the language, otherwise you wouldn't understand it.
What I was aiming at in my original comment was examples like these:
> Family Guy original voice-overs + subtitles making a joke about some typical father figure in Hollywood for example. Then the Latin Spanish subtitles will have translated that joke but replaced the specific actor with some typical father figure from the Mexican movie industry, while the Castilian subtitles would have replaced it with someone from the Spanish movie industry.
More precisely speaking, there are two kinds of subtly different subtitles with different audiences: those with auditory imparements and those with less understanding of given language. The former will benefit from paraphrasing while the latter will be actively disadvantaged due to the mismatch.
Not all of his translations were nowhere near the original. For example, his translation of Guy Ritchie's Snatch was excellent (in my opinion of course) and is still quoted to this day. I'd say it's the only one that absolutely nails it and then some.
On the other hand, his Lord of The Rings was an "alternative" dub as you described. Didn't watch that one though.
I know a little Spanish and even I get annoyed when the English subtitles don’t match what they said in Spanish. Of course I expect grammatically correct Spanish to be translated into grammatically correct English.
> Spanish food names to be replaced by "kind of the same" British names
The purpose of a translation is after all to convey the meaning of what was said. So for example you'd want the English "so so" to be translated in Spanish as "más o menos" instead of repeating the translation of "so" twice. You don't want to just translate word for word, venir infierno o alta agua.
A lot of dialog needs language specific context, many expressions don't lend themselves to literal translation, or the translation in that language is long and cumbersome so paraphrasing is an improvement.
Like with anything else, the secret is using it sparingly, only when it adds value.
> But for example you'd want the English "so so" to be translated in Spanish as "más o menos" instead of doubling down on whatever literal translation for "so" they choose.
Agree, but I don't think those are "cultural translations" but more like "idioms translations", which mostly makes sense to do.
What I originally wrote about are things like Family Guy original voice-overs + subtitles making a joke about some typical father figure in Hollywood for example. Then the Latin Spanish subtitles will have translated that joke but replaced the specific actor with some typical father figure from the Mexican movie industry, while the Castilian subtitles would have replaced it with someone from the Spanish movie industry.
I think I've encountered cases like this watching Netflix movies originally in my native language but subtitled in English. But in every case I can remember the substitution made perfect sense.
Without adapting the translation the natives will immediately understand the reference while you, the non-native with no sense of who they're talking about are left wondering what's the real message. Calling someone a "Mother Teresa" might miss the mark somewhere in China. Same if an Italian movie made references to food like Casu Marzo and the average American would probably miss a lot of the context.
Just recently I saw this in a series where some workers in the oil extraction industry were staring at a pan of paella asking what it is and calling it jambalaya. Paella is world famous, how about khash?
That's why I said I understand the usage but it should only be done if it really helps the comprehension, not just the principle of gratuitously adapting to one language.
> Without adapting the translation the natives will immediately understand the reference while you, the non-native with no sense of who they're talking about are left wondering what's the real message. Calling someone a "Mother Teresa" might miss the mark somewhere in China. Same if an Italian movie made references to food like Casu Marzo and the average American would probably miss a lot of the context.
Right, but isn't that why the American is watching this Italian movie anyways, to get a wider understanding of Italian culture? I don't watch foreign movies with the expectation that they're adapted to my local culture, then there wouldn't be much point in watching it.
> isn't that why the American is watching this Italian movie anyways
I don't know. I always though people would prefer to hear the original voice of the actor, since speech is a big part of the acting. But movies are dubbed in a lot of countries. And most people from those countries I spoke to said that they find the idea of subtitles very odd because it's distracting them from the movie.
I am sure many if not most people simply want to understand what's the message behind the conversation on the screen first and foremost. Learning something new only works if while you try to keep up with the dialogue, you also keep track of all the expressions you heard for the first time, to look them up later.
This is why I think it's the translator's job to balance translation and adaptation. Directly translate the original where context helps you understand the meaning so you get the "original" experience, and adapt where leaving just the 1:1 translation will make you lose the thread or miss some details.
That's a good example of translation where there's only really so many ways to do it. A bad example like people are talking about is the original 4Kids Pokemon where every time someone brought out an Onigiri (rice ball), they would call them jelly donuts.
There is the art of subtitling, and then there is the technical reality that sometimes you have some content with no subtitles and just want a solution now, but the content didn't come with an SRT or better yet VTT and OpenSubtitles has no match.
They're using Whisper for speech to text, and some other small model for basic translation where necessary. It will not do speaker identification (diarization), and certainly isn't going to probe into narrative plot points to figure out if naming a character is a reveal. It isn't going to place text on the screen according to the speaker's frame place, nor for least intrusion. It's just going to have a fixed area where a best effort at speech to text is performed, as a last resort where the alternative is nothing.
Obviously it would be preferred to have carefully crafted subtitles from the content creator, translating if the desired language isn't available but still using all the cues and positions. Secondly to have some carefully crafted community subtitles from opensubtitles or the like, maybe where someone used "AI" and then hand positioned/corrected/updated. Failing all that, you fall to this.
I've first hand encountered several situations with subtitles where it would have been ambiguous who was speaking without speaker annotations, despite the voices being distinctive, me being able to hear them clearly etc. Just think of a rapid exchange with neither speaker on-screen for more than two or three sentences and replies.
You can probably get 99% there without that for a lot of content, but I'd challenge the notion that this is somehow only important for hearing impaired viewers (or people just watching without clearly audible sound for other reasons).
I guess you can have that in a real life situation as well, where you don't have subtitles at all (hello AR) and still can handle. Do you want your subtitles full of metadata the whole movie for every movie and every day, for those several situations when the image director made a mess? You can always play that confusing scene again.
I definitely prefer subtitles to be as helpful as possible, yes. That includes having situationally appropriate metadata (which is different from "all the metadata, all the time").
I don't think that's an unrealistic goal to have for AIs; they're already extremely good at semantic scene description after all. By looking at the image in addition to just the audio track they probably also get a lot more metadata, which a refined world model will eventually be able to use just like a human subtitle editor can today.
So you mean, the AI should figure out when a problematic scene is coming, and only then add labels and whatnot? Not impossible, just somebody must teach them, same with subtitles positioning.
Like other commenters pointed out, we are talking two different types of subtitles - those for hearing impaired have very different requirements. I'm not sure which one is VLC gonna cover. Best would be both, just don't mix them up please.
AI subtitles are just text representation of the sound track.
There is no need for artistic interpretation, substituting words, or hiding information. If it’s in the audio, there’s no reason to keep it out of the subtitle.
An AI subtitle generator that takes artistic license with the conversion is not what anyone wants.
This is horrible for people who learn languages using TV Shows and Movies. One of the most frustrating things I've encountered while learning German is the "paraphrase" thing, it makes practicing listening very hard, because my purpose wasn't to understand what was being said, but rather familiarizing my ear with spoken German.
So, knowing exactly the words being said is of utter importance.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read
NO!
I speak and understand 90% of English but I still use subtitles because sometimes I don't understand a word, or the sound sucks, or the actor thought speaking in a very low voice was a good idea. When the subtitles don't match what's being said, it's a terrible experience.
As long as they're synced properly I don't care much, some movies/shows have really bad sound mix and it's not always possible to find good subs in the first place
I suppose this feature should have been termed closed captioning and not subtitling. It seems you're not going to get much sympathy for human translation here.
> There's an art to subtitling that goes beyond mere speech-to-text processing.
Agreed.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read.
Hard no. If it’s the same language, the text you read should match the text you listen to. Having those not match makes parsing confusing and slow.
> Sometimes you need to name a voice as unknown, to avoid spoilers.
Subtitles don’t usually mention who’s talking, because you can see that. Taking the source of a voice is uncommon and not something I expect these system to get right anyway.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read.
Pretty sure this is a violation of the Americans with Disabilities Act, so illegal in the U.S. at least. Being Deaf doesn't mean you need "reduced" dialogue.
I recently used some subtitles that I later found out had been AI generated.
The experience wasn't really good to be honest - the text was technically correct but the way multiline phrases were split made them somehow extremely hard to parse. The only reason I learned AI was involved is that it was bad enough for me to stop viewing and check what was wrong.
Hopefully it's an implementation detail and we'll see better options in the future, as finding subtitles for foreign shows is always a pain.
This reminds me of Prime Video subtitles. Anything not Hollywood blockbuster will only have one language (from what it looks like, randomly chosen) of garbage quality (not sure whether AI generated though). But there's worse anyway - some Asian titles ONLY available in badly dubbed versions - again in some random language (hello Casshern in... German???). So I see this VLC initiative as an improvement from this very very low bar.
Yeah, I definitely wish other media would experiment with Youtube-style word-at-a-time subtitles. They often feel a lot more natural than full-sentence subtitles, the way they stream in is better at providing "connecting tissue", they never spoil upcoming reveals the way subtitles tend to, etc.
(By "connecting tissue", I mean they don't have the problem where sentence A is "I like chocolate", sentence B is "only on special occasions", and at the time B appears A is completely gone, but you really need A and B to be onscreen at the same time to parse the full meaning intuitively.)
Proper subtitles are obviously better, but it's impossible to do on everything. The tech is going to get better, and is already a game changer for hearing impaired people. Subtitles that are mostly correct are much better than none at all.
VLC has the option to find subtitles. If you use plex or jellyfin there are add-ons or bazarr, which does it automatically
If you have the luxury of requiring subtitles in English, sure. There's a huge scene of people making them and high quality subtitles available for pretty much everything. If you need subs in another language though your experience might change dramatically. Especially for any media that is old or less popular, in which case your options are probably either really bad subs, out of sync subs, or most likely, none whatsoever.
As a Romanian, I'm so sick of AI translations on YouTube, especially since they use Google's translation (OpenAI's at least works quite well). Here's an example (translated back to English):
> Man Builds Background of Tan for Swedish House
It's completely puzzling. To understand it you have to know both English and Romanian. "Background of tan" is the term for "foundation" (makeup) in Romanian. That is, "foundation" has two meanings in English: for a house and for makeup, but Google uses the wrong one.
Automatic translation is full of these bad associations. I have no idea how people who don't speak English understand these translations.
In the search bar it says "Updated 2 weeks ago", like if there were additional recent comments or actions in this thread that we cannot see.
So it could actually be OpenAI Whisper model, for which we have the final binary format (the weights), but not the source training data, but it is the best you can get for free.
Yeah, it'd be nice if we could all use 'open source' to mean 'open weights' + 'open training set', instead of just 'open weights'. I fear that ship has sailed though. Maybe call it a 'libre' model or something?
Why are we still talking about this? Computers are INCREDIBLY efficient and still become orders of magnitude more efficient. Computation is really negligible in the grand scheme of things. In the 80s some people also said that our whole world energy would go to computations in the future. And look today. It’s less than 1%. We do orders of magnitude more computations, but the computers have become orders of magnitude more efficient too.
As another way to look at this, where does this questioning of energy use end? Should I turn off my laptop when I go to the supermarket? When I go to the toilet? Should I turn off my lights when I go to the toilet?
My point is, we do a lot of inefficient things and there is certainly something to being more efficient. But asking “is it efficient” immediately when something new is presented is completely backwards if you ask me. It focusses our attention on new things even though many old things are WAY more inefficient.
> That's what a software engineer would say who views resources as unlimited and free.
from the same guidelines: don't be snarky; don't post thoughtless insubstantive comments; reply to the argument instead of calling names; don't sneer; respond to the strongest interpretation - assume good faith; eschew flamebait; don't trample curiosity.
> from the same guidelines: don't be snarky; don't post thoughtless insubstantive comments; reply to the argument instead of calling names; don't sneer; respond to the strongest interpretation - assume good faith; eschew flamebait; don't trample curiosity.
You're making a litany of unsolicited accusations while subtracting value in a conversation you're not a part of. Please stop with the self-appointed fixation on microaggression policing because it just creates pointless drama. Thanks.
... who would "solicit accusations"? How am I "not a part of" it? How can I stop self-appointing what I comment about? Why are you knowingly posting microaggressions at all? Why do you think it creates drama, and why do you think said drama is pointless? And why aren't you responding to the strongest good-faith interpretation of my comment?
> In the 80s some people also said that our whole world energy would go to computations in the future. And look today.
Today we consume twice as much energy as we did in the 80s (and that's mostly coming from an increase in fossil fuels consumption). Datacenters alone consume more than 1% of global energy production, that doesn't include the network, the terminals, and the energy necessary to produce all of the hardware.
> Why are we still talking about this? Computers are INCREDIBLY efficient and still become orders of magnitude more efficient.
Because today is today, and if we can project that the energy consumption of doing a task n times on the client side outweighs the complexity of doing it once and then distributing the result somehow to all n clients, we should arguably still do it.
Sometimes it's better to wait; sometimes it's better to ship the improved version now.
> Computation is really negligible in the grand scheme of things.
Tell that to my phone burning my hand and running through a quarter of its battery for some local ML task every once in a while.
The results could be cached, but it's probably unlikely that they would need to be used again later, as I imagine most videos are watched only once.
Another option would be to upload the generated subtitles to some service or p2p, but I believe that would also problem (e.g. privacy, who runs the service, firewalls for p2p, etc).
It seems like some kind of review process would need to be included to reduce abuse possibilities.
But yes, that would be quite nice, if there are enough people who don't mind uploading the file names of the media they play (or some other unique media identifier?) along with the subtitles to that service—and with user credentials? It would certainly need to be opt-in, which makes one wonder if it would be much effective at all.
Go-to player with easy wifi loading and ability to connect to file shares to find files. Simple and actually easy to use (of course having a file server is another question)
This boils down to software development not being free. In VLC's case, the development is funded by several for profit companies (like videolabs) that make their money from stuff they do with VLC (consulting, commercial services, etc.).
VLC is a good example of an OSS project that is pretty well run with decades of history that has a healthy ecosystem of people and companies earning their living supporting all that and a foundation to orchestrate the development. I don't think there ever was a lot of VC money they need to worry about. This is all organic growth and OSS working as it should.
So, this boils down to what paying customers of these companies are paying for. The project also accepts donations but those go to the foundation and not the companies. It's the companies that employ most of the developers. And you can't fault them on working on things that they value. If AI features is what they pay for then that is what they work on.
I happen to share your reservations about the UX. It's a bit old school, to put it mildy. And they obviously don't have professional designers that they work with. Like many OSS products, it looks and feels like product made by techies for techies. It doesn't bother me that much but I do notice these things. I actually talked to one of their IOS developers a few years ago. Pretty interesting person and not a huge team as I recall. I remember talking to her about some of the frustrations she had with the UX and the lack of appreciation of that. I think she moved to Netflix afterwards.
Like with most OSS projects you are welcome to take part in the meritocracy and push your favorite features or pay somebody to do that for you. But otherwise, you should just be grateful for this awesome thing existing and prospering.
> Like many OSS products, it looks and feels like product made by techies for techies.
That's not the problem. mpv is another media player that is arguably even more "made by techies for techies", yet it doesn't have the usability issues of VLC, and is a much more robust piece of software.
VLC is just poorly designed from the ground up, and the project's priorities are all over the place, as this AI initiative demonstrates.
They don’t need designers when they are the free media player that stood the test of time and being used by the masses. It’s true organic bottom up design, tweaked little by little over the years
It's not required. But it could make their product easier to use and more usable for their users. But that's clearly not something the core team values or is passionate about and I appreciate that they have other priorities.
It's common with many OSS projects. There are a few positive exceptions. But this stuff is hard.
AI subtitle generation seems like a useful feature. Hopefully they'll integrate with a subtitle sharing service so we don't have computers repeatedly duplicating work.
In the ideal case you'd probably generate a first pass of subtitles with AI, have a human review and tweak it as needed, and then share that. There's no reason for people to be repeatedly generating their own subtitles in most cases.
Android and iOS already support live captions and AI Accelerators are becoming more common in PC hardware. If you can generate it with little compute at home, then there is no need to set up a share system.
You also want local generation in a lot of cases. If you have your own videos you need to generate them yourself. For Accessibility it's fantastic if they can have subtitles on every Video.
If generating your own is fast and good enough and takes little compute, then it isn't needed to share them. Having subtitles generated by the best models and optimized by humans is better, but not needed in most cases.
A system like that would be pretty nice as long as it wasn't a privacy problem. You wouldn't really need LLMs to do the subtitles as all then though, for any video common enough to be sharable the subtitles probably already exist from the original source.
I wonder if VLC tracks their downloads per day or region. Similar to Pornhub Trends, it'd be interesting to correlate a spike of downloads with events like Netflix hiking up their prices/enforcing anti-password-sharing policies, dropping some series from their catalogue, or some hyped-up movie being released...
Would be great to add a system of sharing the results instead of having to waste resources transcribing the same video over and over again with no chance of correction
Likely due to cost, Google hasn’t decided to use any LLM model (what basically everybody refers to when they say AI now) to generate the subtitles and by modern standards they aren’t very good.
(Like open source software, so that, in theory, someone could see see the source code, source data, and process for how it was trained, and reproduce that? And could they change how that model is built, not just tuning atop it, and distribute that under the same open source license?)
Amazon has already rolled out AI-generated subtitles for Crunchyroll anime. I discovered this while watching Last Exile, a show that has already been subtitled and dubbed in English.
It's funny for a second, and then incredibly frustrating after that. Character names are regularly wrong, and sentences get rewritten to have different meanings. Instead of just reading subtitles, you have to mentally evaluate them for errors.
At best, AI generated subtitles should be a last resort.
That's the part that really confuses me. Somebody already did pay for translation from Japanese to English subtitles, then for a voice cast to dub the anime. The original subtitles are nowhere to be found; the closed captions are based on the English language dub.
Already happened last year, both with fansubs and official releases - Nokotan springs to mind. It was indeed a trainwreck. Although having it done real time should also be funny.
Wouldn't be new though, entertainment companies are already exploring AI subtitles; and official anime subtitles can be a trainwreck too.
> Many anime fans were concerned about the low-quality subtitles in The Yuzuki Family's Four Sons' debut episode, leading some to believe that they had been generated by A.I. These subtitles were notoriously awkward -- occasionally to the point of being nonsensical -- and featured numerous grammar issues; Crunchyroll was forced to delete the episode following outrage by fans on social media, with many asking why they didn't pay professionals to do a better job.
IMO this kind of thing is a symptom of so few people knowing multiple languages. It doesn’t take much time in a second language to realize how much of an art translation is. Heck, even if you only know English, reading a few translations of classic literature should make it obvious. I really hope AI doesn’t totally ruin the market for actually decent translations of books, films, and television by making something “good enough” so cheap that nobody gets into the industry anymore.
DeppL's business model is exactly this. The words translated per day increases a lot the past few years, because of it. You let DeepL translate the text, and real Translators use it as a starting point.
DeepL is already pretty good, but it still needs a proper translator for the optimal output. That translator just saves a lot of time not having to translate every word.
Totally personally skewed perspective, but thinking about the number of systems I've installed VLC on, some multiple times through rebuilds etc, 6 billion feels like a huge underestimate!
I'm guessing that's just from the main website. It's distributed so wildly I'd wager we'll never know the complete number. But yeah, I've also must have installed it at least 50 times or so, since the first time I came across it.
Given modern mirrorless cameras can detect and track humans or mammals using DSPs and NNs without any discernible battery penalty at 30/60/120/240FPS (depending on camera class), I think they can do it with very small and efficient models with some frequency clamping and downsampling.
Initial versions will be more power hungry, but it'll be negligible at the end, given that modern processors have accelerators for image and voice based AI applications.
I wonder if there could be an option to share the generated ai subtitiles so there is no duplicate effort. For example Movie_torrent.vlc subs can be shared on some other users machine with the same Movie_torrent.vlc file.
Obv you can find majority of subs online for popular files but niche cases usually have low subs available and sharing AI could help the ecosystem.
There’s a reason it’s being implemented in an open source application that gets actual use instead of a random-word-generated-elevator-pitch being chased by VC funds and dinosaurs in the stock market.
Windows has this as a feature in the OS with their stupidly-named Copilot+ PCs. I tried it a while ago and it seemed ok-ish at translating. Wasn’t terrible for doing it all on-device
As of yet big auto-subtitle providers like YouTube fails miserably at this, and as far as I've tested the Whispers they also do. Would be nice if embedding it wouldn't become the default.
Yet another classic garbage-in-garbage-out expected from AI gen subtitles. But it is an excellent undertaking nevertheless. Will help improve viewership in general
The OP means to be able to play backwards frame by frame, which I believe VLC can't.
To be fair, technically speaking this isn't nearly as easy as step forward due to how codecs are constructed, but lots of players have this feature already.
Unfortunately the term "open-source" AI alone is meaningless without knowing the users definition of open-source. More often than not it just means "openly available". In this case for example they could be using Whisper, which is openly available, but from OpenAI and still has all the caveats around used data, resource usage for training etc.
The unmitigated tendency for people to selectively morally approve homeomorphic elements of the same set based on whether or not each element is personally useful to them is extremely alarming.
A while ago I used whisper (or rather an OSS subtitle tool that used whisper, sadly can’t remember the name; it also converted burned in subs to proper ones via OCR) to generate subtitles for a season of a Show (4 season DVD set, 1 had proper subs, 2 burned in subs, 1 no subs -.-), too old and not popular enough to have "scene" subs, it worked impressively well. The most memorable thing for me was that a character’s scream was properly attributed to the correct character.
I’d love a feature like that for Jellyfin eventually.
You can setup a bazarr instance to use whisper as a subtitle provider.
Looks like it only works if you use the *arr stack, which I don’t.
If you're already using Jellyfin, then why not? Don't want to complicate the stack?
I've seen this sentiment a lot, and it's always confused me a bit. I only use Jellyfin as well, I've never really seen the appeal of the *arr stack. I really don't see the appeal of having a firehose pointed at my Jellyfin instance, which requires like eight additional services that need to run.
> I've never really seen the appeal of the *arr stack
Automation I suppose. I only run Sonarr/Radarr, so two additional services. They both work exactly the same, so it's basically "configure once, run twice".
I eventually ran out of patience to find the right releases and what not, and now enjoy a life where I can just input the title, then return one hour later to everything setup for me. Any new episodes/movies will just be available automatically in Jellyfin next time I open it up on the TV. Helps that other people in the household are also able to use it, so we've got rid of all streaming services now.
I am mostly in the same camp but there are a few quality of life improvements that I recommend:
1. Overseerr - add tv/movies in one place 2. Custom feeds as import lists so that new, popular stuff is automatically getting added 3. Kometa (for Plex users) - custom collections via trakt and burning imdb/rt/metacritic scores into cover art
> 2. Custom feeds as import lists so that new, popular stuff is automatically getting added
What do you mean with this? Are you automatically downloading things based on popularity solely, or does this mean something else?
There's an art to subtitling that goes beyond mere speech-to-text processing. Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read. Sometimes you need to name a voice as unknown, to avoid spoilers. Sometimes the positioning on the screen matters. I hope the model can be made to understand all this.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read
Please no. Some subtitle companies do think like this, and it's really weird, like when they try to "convert" cultural jokes, and then add in a bunch of more assumptions regarding what cultures you're aware of depending on the subtitle language, making it even harder to understand...
Just because I want my subtitles in English, doesn't mean I want all typical discussed Spanish food names to be replaced by "kind of the same" British names, yet something like that is something I've come across before. Horrible.
I totally get this. When I'm watching videos for the purpose of learning a language, I want all the actual words in the subtitles. But if I'm watching just ot enjoy, say in a language I don't care to learn, I don't mind someone creatively changing the dialog to how it probably would have been written in English. This happens with translations of novels all the time. People even seek out specific translators who they feel are especially talented at this kind of thing.
It depends on the context! Trying to Americanize Godzilla, for instance, has largely failed because Godzilla is an allegory for the unique horror of nuclear bombing which Japan experienced. Making him just a lizard that walks through New York is kind of stupid.
Jokes are an example of something translators can do really well - things like puns don't work 1:1 across languages. A good translator will find a corresponding, appropriate line of dialogue and basically keep the intent without literally translating the words.
Food is kind of silly because it's tied to place - if a setting is clearly Spanish, or a character is Spanish, why wouldn't they talk about Spanish food? Their nationality ostensibly informs something about their character (like Godzilla) and can't just be fine/replaced.
> Jokes are an example of something translators can do really well - things like puns don't work 1:1 across languages. A good translator will find a corresponding, appropriate line of dialogue and basically keep the intent without literally translating the words.
Again, those aren't "cultural translations" but "idioms translations", which I do agree should be translated into something understandable in the language, otherwise you wouldn't understand it.
What I was aiming at in my original comment was examples like these:
> Family Guy original voice-overs + subtitles making a joke about some typical father figure in Hollywood for example. Then the Latin Spanish subtitles will have translated that joke but replaced the specific actor with some typical father figure from the Mexican movie industry, while the Castilian subtitles would have replaced it with someone from the Spanish movie industry.
More precisely speaking, there are two kinds of subtly different subtitles with different audiences: those with auditory imparements and those with less understanding of given language. The former will benefit from paraphrasing while the latter will be actively disadvantaged due to the mismatch.
There was an eminent Russian voiceover artist (goblin?) that translated pirated Western movies with his own interpretation.
His translations were nowhere near what the movie was about, but they were hilarious and fit the plot perfectly.
Not all of his translations were nowhere near the original. For example, his translation of Guy Ritchie's Snatch was excellent (in my opinion of course) and is still quoted to this day. I'd say it's the only one that absolutely nails it and then some.
On the other hand, his Lord of The Rings was an "alternative" dub as you described. Didn't watch that one though.
I know a little Spanish and even I get annoyed when the English subtitles don’t match what they said in Spanish. Of course I expect grammatically correct Spanish to be translated into grammatically correct English.
> Spanish food names to be replaced by "kind of the same" British names
The purpose of a translation is after all to convey the meaning of what was said. So for example you'd want the English "so so" to be translated in Spanish as "más o menos" instead of repeating the translation of "so" twice. You don't want to just translate word for word, venir infierno o alta agua.
A lot of dialog needs language specific context, many expressions don't lend themselves to literal translation, or the translation in that language is long and cumbersome so paraphrasing is an improvement.
Like with anything else, the secret is using it sparingly, only when it adds value.
> But for example you'd want the English "so so" to be translated in Spanish as "más o menos" instead of doubling down on whatever literal translation for "so" they choose.
Agree, but I don't think those are "cultural translations" but more like "idioms translations", which mostly makes sense to do.
What I originally wrote about are things like Family Guy original voice-overs + subtitles making a joke about some typical father figure in Hollywood for example. Then the Latin Spanish subtitles will have translated that joke but replaced the specific actor with some typical father figure from the Mexican movie industry, while the Castilian subtitles would have replaced it with someone from the Spanish movie industry.
I think I've encountered cases like this watching Netflix movies originally in my native language but subtitled in English. But in every case I can remember the substitution made perfect sense.
Without adapting the translation the natives will immediately understand the reference while you, the non-native with no sense of who they're talking about are left wondering what's the real message. Calling someone a "Mother Teresa" might miss the mark somewhere in China. Same if an Italian movie made references to food like Casu Marzo and the average American would probably miss a lot of the context.
Just recently I saw this in a series where some workers in the oil extraction industry were staring at a pan of paella asking what it is and calling it jambalaya. Paella is world famous, how about khash?
That's why I said I understand the usage but it should only be done if it really helps the comprehension, not just the principle of gratuitously adapting to one language.
> Without adapting the translation the natives will immediately understand the reference while you, the non-native with no sense of who they're talking about are left wondering what's the real message. Calling someone a "Mother Teresa" might miss the mark somewhere in China. Same if an Italian movie made references to food like Casu Marzo and the average American would probably miss a lot of the context.
Right, but isn't that why the American is watching this Italian movie anyways, to get a wider understanding of Italian culture? I don't watch foreign movies with the expectation that they're adapted to my local culture, then there wouldn't be much point in watching it.
> isn't that why the American is watching this Italian movie anyways
I don't know. I always though people would prefer to hear the original voice of the actor, since speech is a big part of the acting. But movies are dubbed in a lot of countries. And most people from those countries I spoke to said that they find the idea of subtitles very odd because it's distracting them from the movie.
I am sure many if not most people simply want to understand what's the message behind the conversation on the screen first and foremost. Learning something new only works if while you try to keep up with the dialogue, you also keep track of all the expressions you heard for the first time, to look them up later.
This is why I think it's the translator's job to balance translation and adaptation. Directly translate the original where context helps you understand the meaning so you get the "original" experience, and adapt where leaving just the 1:1 translation will make you lose the thread or miss some details.
That's a good example of translation where there's only really so many ways to do it. A bad example like people are talking about is the original 4Kids Pokemon where every time someone brought out an Onigiri (rice ball), they would call them jelly donuts.
There is the art of subtitling, and then there is the technical reality that sometimes you have some content with no subtitles and just want a solution now, but the content didn't come with an SRT or better yet VTT and OpenSubtitles has no match.
They're using Whisper for speech to text, and some other small model for basic translation where necessary. It will not do speaker identification (diarization), and certainly isn't going to probe into narrative plot points to figure out if naming a character is a reveal. It isn't going to place text on the screen according to the speaker's frame place, nor for least intrusion. It's just going to have a fixed area where a best effort at speech to text is performed, as a last resort where the alternative is nothing.
Obviously it would be preferred to have carefully crafted subtitles from the content creator, translating if the desired language isn't available but still using all the cues and positions. Secondly to have some carefully crafted community subtitles from opensubtitles or the like, maybe where someone used "AI" and then hand positioned/corrected/updated. Failing all that, you fall to this.
> better to paraphrase dialog to reduce the amount of text that needs to be read.
That's just bad destructive art, especially for a foreign language that you partially know.
> Sometimes you need to name a voice as unknown, to avoid spoilers.
Don't name any, that's what your own eye-ear voice recognition/matching and positioning are for (also reduces the amount of text)
> Sometimes the positioning on the screen matters.
This is rather valuable art indeed! Though unlikely fit to be modelled well
> Don't name any, that's what your own eye-ear voice recognition/matching and positioning are for
That’s tricky when one or more speakers aren’t visible.
They'd also have to start speaking at the same time and have similar voices to make it tricky
You’re assuming that the subtitle user can hear the voice track here.
Yes I'm, the other option is the "hearing impaired" variant, let that one have all the extra info
I've first hand encountered several situations with subtitles where it would have been ambiguous who was speaking without speaker annotations, despite the voices being distinctive, me being able to hear them clearly etc. Just think of a rapid exchange with neither speaker on-screen for more than two or three sentences and replies.
You can probably get 99% there without that for a lot of content, but I'd challenge the notion that this is somehow only important for hearing impaired viewers (or people just watching without clearly audible sound for other reasons).
I guess you can have that in a real life situation as well, where you don't have subtitles at all (hello AR) and still can handle. Do you want your subtitles full of metadata the whole movie for every movie and every day, for those several situations when the image director made a mess? You can always play that confusing scene again.
I definitely prefer subtitles to be as helpful as possible, yes. That includes having situationally appropriate metadata (which is different from "all the metadata, all the time").
I don't think that's an unrealistic goal to have for AIs; they're already extremely good at semantic scene description after all. By looking at the image in addition to just the audio track they probably also get a lot more metadata, which a refined world model will eventually be able to use just like a human subtitle editor can today.
So you mean, the AI should figure out when a problematic scene is coming, and only then add labels and whatnot? Not impossible, just somebody must teach them, same with subtitles positioning.
Like other commenters pointed out, we are talking two different types of subtitles - those for hearing impaired have very different requirements. I'm not sure which one is VLC gonna cover. Best would be both, just don't mix them up please.
Subtitles aren't just for foreign viewers though, they're also for native speakers who are now hearing impaired.
Sure, that's what HI version is for
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read.
Questionable. It drives me crazy to have subtitles that are paraphrase in a way that changes the meaning of statements.
AI subtitles are just text representation of the sound track.
There is no need for artistic interpretation, substituting words, or hiding information. If it’s in the audio, there’s no reason to keep it out of the subtitle.
An AI subtitle generator that takes artistic license with the conversion is not what anyone wants.
That doesn't work for idioms, certainly in Italian, which has multiple colorful metaphors which would be mystifying if translated directly.
I don't think anybody's talking about translations.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read
I really hope people to stop doing that.
This is horrible for people who learn languages using TV Shows and Movies. One of the most frustrating things I've encountered while learning German is the "paraphrase" thing, it makes practicing listening very hard, because my purpose wasn't to understand what was being said, but rather familiarizing my ear with spoken German.
So, knowing exactly the words being said is of utter importance.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read
NO!
I speak and understand 90% of English but I still use subtitles because sometimes I don't understand a word, or the sound sucks, or the actor thought speaking in a very low voice was a good idea. When the subtitles don't match what's being said, it's a terrible experience.
As long as they're synced properly I don't care much, some movies/shows have really bad sound mix and it's not always possible to find good subs in the first place
I suppose this feature should have been termed closed captioning and not subtitling. It seems you're not going to get much sympathy for human translation here.
> There's an art to subtitling that goes beyond mere speech-to-text processing.
Agreed.
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read.
Hard no. If it’s the same language, the text you read should match the text you listen to. Having those not match makes parsing confusing and slow.
> Sometimes you need to name a voice as unknown, to avoid spoilers.
Subtitles don’t usually mention who’s talking, because you can see that. Taking the source of a voice is uncommon and not something I expect these system to get right anyway.
The best subtitles I've had were fantrads
> Sometimes it's better to paraphrase dialog to reduce the amount of text that needs to be read.
Pretty sure this is a violation of the Americans with Disabilities Act, so illegal in the U.S. at least. Being Deaf doesn't mean you need "reduced" dialogue.
I recently used some subtitles that I later found out had been AI generated.
The experience wasn't really good to be honest - the text was technically correct but the way multiline phrases were split made them somehow extremely hard to parse. The only reason I learned AI was involved is that it was bad enough for me to stop viewing and check what was wrong.
Hopefully it's an implementation detail and we'll see better options in the future, as finding subtitles for foreign shows is always a pain.
This reminds me of Prime Video subtitles. Anything not Hollywood blockbuster will only have one language (from what it looks like, randomly chosen) of garbage quality (not sure whether AI generated though). But there's worse anyway - some Asian titles ONLY available in badly dubbed versions - again in some random language (hello Casshern in... German???). So I see this VLC initiative as an improvement from this very very low bar.
On Youtube I really like the auto-generated captions and often prefer them to the creator's because:
- Sometimes the creator bases their captions on the script and misses changes in edit
- Sometimes the creator's captions are perfect transcriptions but broken up and timed awkwardly
Auto-generated captions aren't always perfect but unlike human captions provide word-by-word timing.
Yeah, I definitely wish other media would experiment with Youtube-style word-at-a-time subtitles. They often feel a lot more natural than full-sentence subtitles, the way they stream in is better at providing "connecting tissue", they never spoil upcoming reveals the way subtitles tend to, etc.
(By "connecting tissue", I mean they don't have the problem where sentence A is "I like chocolate", sentence B is "only on special occasions", and at the time B appears A is completely gone, but you really need A and B to be onscreen at the same time to parse the full meaning intuitively.)
Proper subtitles are obviously better, but it's impossible to do on everything. The tech is going to get better, and is already a game changer for hearing impaired people. Subtitles that are mostly correct are much better than none at all.
VLC has the option to find subtitles. If you use plex or jellyfin there are add-ons or bazarr, which does it automatically
I had exactly the same experience. Human expertise in subtitle makes the experience really better
If you have the luxury of requiring subtitles in English, sure. There's a huge scene of people making them and high quality subtitles available for pretty much everything. If you need subs in another language though your experience might change dramatically. Especially for any media that is old or less popular, in which case your options are probably either really bad subs, out of sync subs, or most likely, none whatsoever.
As a Romanian, I'm so sick of AI translations on YouTube, especially since they use Google's translation (OpenAI's at least works quite well). Here's an example (translated back to English):
> Man Builds Background of Tan for Swedish House
It's completely puzzling. To understand it you have to know both English and Romanian. "Background of tan" is the term for "foundation" (makeup) in Romanian. That is, "foundation" has two meanings in English: for a house and for makeup, but Google uses the wrong one.
Automatic translation is full of these bad associations. I have no idea how people who don't speak English understand these translations.
It's really sad that I'm reading "open source model" and think "hmhm, as if".
Maybe they're really using a truly open source model (probably not) but the meaning of the word is muddied already.
https://code.videolan.org/videolan/vlc/-/merge_requests/5155
Here they are working on integrating Whisper.cpp
In the search bar it says "Updated 2 weeks ago", like if there were additional recent comments or actions in this thread that we cannot see.
So it could actually be OpenAI Whisper model, for which we have the final binary format (the weights), but not the source training data, but it is the best you can get for free.
The meaning of "AI" and "open source model" have both been muddied enough to be pretty meaningless.
Yeah, it'd be nice if we could all use 'open source' to mean 'open weights' + 'open training set', instead of just 'open weights'. I fear that ship has sailed though. Maybe call it a 'libre' model or something?
Very excited for this, but a waste of energy if everyone is needing to process their video in real time.
Why are we still talking about this? Computers are INCREDIBLY efficient and still become orders of magnitude more efficient. Computation is really negligible in the grand scheme of things. In the 80s some people also said that our whole world energy would go to computations in the future. And look today. It’s less than 1%. We do orders of magnitude more computations, but the computers have become orders of magnitude more efficient too.
As another way to look at this, where does this questioning of energy use end? Should I turn off my laptop when I go to the supermarket? When I go to the toilet? Should I turn off my lights when I go to the toilet?
My point is, we do a lot of inefficient things and there is certainly something to being more efficient. But asking “is it efficient” immediately when something new is presented is completely backwards if you ask me. It focusses our attention on new things even though many old things are WAY more inefficient.
> Computers are INCREDIBLY efficient and still become orders of magnitude more efficient.
That's what a software engineer would say who views resources as unlimited and free.
Do you not pay for the energy you use?
> don't cross-examine
https://news.ycombinator.com/newsguidelines.html
> That's what a software engineer would say who views resources as unlimited and free.
from the same guidelines: don't be snarky; don't post thoughtless insubstantive comments; reply to the argument instead of calling names; don't sneer; respond to the strongest interpretation - assume good faith; eschew flamebait; don't trample curiosity.
> from the same guidelines: don't be snarky; don't post thoughtless insubstantive comments; reply to the argument instead of calling names; don't sneer; respond to the strongest interpretation - assume good faith; eschew flamebait; don't trample curiosity.
You're making a litany of unsolicited accusations while subtracting value in a conversation you're not a part of. Please stop with the self-appointed fixation on microaggression policing because it just creates pointless drama. Thanks.
... who would "solicit accusations"? How am I "not a part of" it? How can I stop self-appointing what I comment about? Why are you knowingly posting microaggressions at all? Why do you think it creates drama, and why do you think said drama is pointless? And why aren't you responding to the strongest good-faith interpretation of my comment?
> In the 80s some people also said that our whole world energy would go to computations in the future. And look today.
Today we consume twice as much energy as we did in the 80s (and that's mostly coming from an increase in fossil fuels consumption). Datacenters alone consume more than 1% of global energy production, that doesn't include the network, the terminals, and the energy necessary to produce all of the hardware.
> Today we consume twice as much energy as we did in the 80s
Who's "we"?
Worldwide we've gone from 4.4 tons/person to 4.7 tons/person (7% increase) from 1980 to 2023: https://ourworldindata.org/grapher/co-emissions-per-capita?t...
> Datacenters alone consume more than 1% of global energy production
Energy-use or electricity-use? Only about 20% of total energy use is electricity [1]. I went through the math for EUV machines in more detail at https://news.ycombinator.com/threads?id=huijzer#42600790.
[1]: https://ourworldindata.org/energy/country/united-states
> Should I turn off my laptop when I go to the supermarket?
Yes.
> Why are we still talking about this? Computers are INCREDIBLY efficient and still become orders of magnitude more efficient.
Because today is today, and if we can project that the energy consumption of doing a task n times on the client side outweighs the complexity of doing it once and then distributing the result somehow to all n clients, we should arguably still do it.
Sometimes it's better to wait; sometimes it's better to ship the improved version now.
> Computation is really negligible in the grand scheme of things.
Tell that to my phone burning my hand and running through a quarter of its battery for some local ML task every once in a while.
What is the alternative?
The results could be cached, but it's probably unlikely that they would need to be used again later, as I imagine most videos are watched only once.
Another option would be to upload the generated subtitles to some service or p2p, but I believe that would also problem (e.g. privacy, who runs the service, firewalls for p2p, etc).
It can actually send to Google the information on what you are playing:
https://github.com/videolan/vlc/blob/f908ef4981c93a8b76805ad...
and to their own servers:
https://github.com/videolan/vlc/blob/f908ef4981c93a8b76805ad...
should could fetch subtitles as the same time ?
edit: cf, what "a3w" says too.
Downloading is still much easier to handle than uploading.
Tangentially related: funny enough, this site was just submitted yesterday to HN - https://exampl.page/
You can navigate to $foo.exampl.page and it will generate a website on the fly with text and graphics using AI. It will then save and cache the page.
It’s admittedly a useless but cool little demo.
This is exactly the answer imo. We have subtitle files, which is in effect a cache. Process once, read many times.
So while I'm excited this feature is now available, having high quality subtitles cached in one place and generated by AI is the answer imo.
Authoritatively-correct subtitles rather than distributed generation and/or publication by anyone and everyone, including AI.
I don't know how many times I've seen subtitles that appear to be based on a script or were half-assed, and don't match the dialogue as spoken at all.
VLC has a "check my media online" feature, next to "check for updates to VLC on startup" already. Could they offer subtitle downloads?
Previously, that was used for mp3 album covers or something?
OpenSubtitles as a cache, maybe? (With their collaboration, of course.)
It seems like some kind of review process would need to be included to reduce abuse possibilities.
But yes, that would be quite nice, if there are enough people who don't mind uploading the file names of the media they play (or some other unique media identifier?) along with the subtitles to that service—and with user credentials? It would certainly need to be opt-in, which makes one wonder if it would be much effective at all.
VLC is excellent on iOS -- highly recommended!
Go-to player with easy wifi loading and ability to connect to file shares to find files. Simple and actually easy to use (of course having a file server is another question)
me: waiting years for VLC to fix basic usability issues and reconcile UI across different platforms
VLC: we're gonna work on AI
The blanket criticism of AI is ridiculous. AI generated subtitles solves real issues for users.
How is criticism of VLC project management a criticism of AI?
Dude you need to level up your reasoning skills.
When you write "VLC: we're gonna work on AI" you’re clearly implying AI is worthless.
This boils down to software development not being free. In VLC's case, the development is funded by several for profit companies (like videolabs) that make their money from stuff they do with VLC (consulting, commercial services, etc.).
VLC is a good example of an OSS project that is pretty well run with decades of history that has a healthy ecosystem of people and companies earning their living supporting all that and a foundation to orchestrate the development. I don't think there ever was a lot of VC money they need to worry about. This is all organic growth and OSS working as it should.
So, this boils down to what paying customers of these companies are paying for. The project also accepts donations but those go to the foundation and not the companies. It's the companies that employ most of the developers. And you can't fault them on working on things that they value. If AI features is what they pay for then that is what they work on.
I happen to share your reservations about the UX. It's a bit old school, to put it mildy. And they obviously don't have professional designers that they work with. Like many OSS products, it looks and feels like product made by techies for techies. It doesn't bother me that much but I do notice these things. I actually talked to one of their IOS developers a few years ago. Pretty interesting person and not a huge team as I recall. I remember talking to her about some of the frustrations she had with the UX and the lack of appreciation of that. I think she moved to Netflix afterwards.
Like with most OSS projects you are welcome to take part in the meritocracy and push your favorite features or pay somebody to do that for you. But otherwise, you should just be grateful for this awesome thing existing and prospering.
> Like many OSS products, it looks and feels like product made by techies for techies.
That's not the problem. mpv is another media player that is arguably even more "made by techies for techies", yet it doesn't have the usability issues of VLC, and is a much more robust piece of software.
VLC is just poorly designed from the ground up, and the project's priorities are all over the place, as this AI initiative demonstrates.
Those are just your priorities. I don't have any usability issues with VLC, but would use the AI subtitles.
They don’t need designers when they are the free media player that stood the test of time and being used by the masses. It’s true organic bottom up design, tweaked little by little over the years
It's not required. But it could make their product easier to use and more usable for their users. But that's clearly not something the core team values or is passionate about and I appreciate that they have other priorities.
It's common with many OSS projects. There are a few positive exceptions. But this stuff is hard.
Designer in an OSS project must be quite a unique job too
Don't they still consider "app locks up when playlist free-loops" not a bug?
AI subtitle generation seems like a useful feature. Hopefully they'll integrate with a subtitle sharing service so we don't have computers repeatedly duplicating work.
In the ideal case you'd probably generate a first pass of subtitles with AI, have a human review and tweak it as needed, and then share that. There's no reason for people to be repeatedly generating their own subtitles in most cases.
Android and iOS already support live captions and AI Accelerators are becoming more common in PC hardware. If you can generate it with little compute at home, then there is no need to set up a share system.
You also want local generation in a lot of cases. If you have your own videos you need to generate them yourself. For Accessibility it's fantastic if they can have subtitles on every Video.
If generating your own is fast and good enough and takes little compute, then it isn't needed to share them. Having subtitles generated by the best models and optimized by humans is better, but not needed in most cases.
A system like that would be pretty nice as long as it wasn't a privacy problem. You wouldn't really need LLMs to do the subtitles as all then though, for any video common enough to be sharable the subtitles probably already exist from the original source.
How do you do this in a privacy preserving way?
The only feasible way I can think of is a locally run model. Perhaps whisper?
I think they meant the sharing
Whoopsie, good point. I agree.
Run the model on your PC. Done?
I wonder if VLC tracks their downloads per day or region. Similar to Pornhub Trends, it'd be interesting to correlate a spike of downloads with events like Netflix hiking up their prices/enforcing anti-password-sharing policies, dropping some series from their catalogue, or some hyped-up movie being released...
Would be great to add a system of sharing the results instead of having to waste resources transcribing the same video over and over again with no chance of correction
interesting for comparison, tweaking, analysis,..
Isn't this what YouTube has been doing with automatically generated subtitles for videos for years?
I've always found their quality to be somewhat lacking.
Yes and no.
Likely due to cost, Google hasn’t decided to use any LLM model (what basically everybody refers to when they say AI now) to generate the subtitles and by modern standards they aren’t very good.
Quality definitely can be lacking, on the other hand, it is significantly better than no subtitles at all.
Will the ML model be fully open?
(Like open source software, so that, in theory, someone could see see the source code, source data, and process for how it was trained, and reproduce that? And could they change how that model is built, not just tuning atop it, and distribute that under the same open source license?)
It's highly likely that they'll be using Whisper, if so, it won't be fully open. But I can be wrong, of course.
Hmm I wonder if this had anything to do with this? https://www.youtube.com/watch?v=0dUnY1641WM&themeRefresh=1
Can’t wait to watch anime with AI generated subs, it will be a beautiful trainwreck.
Amazon has already rolled out AI-generated subtitles for Crunchyroll anime. I discovered this while watching Last Exile, a show that has already been subtitled and dubbed in English.
It's funny for a second, and then incredibly frustrating after that. Character names are regularly wrong, and sentences get rewritten to have different meanings. Instead of just reading subtitles, you have to mentally evaluate them for errors.
At best, AI generated subtitles should be a last resort.
You get what you pay for I guess. Translators are paid pennies and AI is making it much worse.
Who in their right mind would aspire to become a professional translator today?
That's the part that really confuses me. Somebody already did pay for translation from Japanese to English subtitles, then for a voice cast to dub the anime. The original subtitles are nowhere to be found; the closed captions are based on the English language dub.
Already happened last year, both with fansubs and official releases - Nokotan springs to mind. It was indeed a trainwreck. Although having it done real time should also be funny.
Wouldn't be new though, entertainment companies are already exploring AI subtitles; and official anime subtitles can be a trainwreck too.
> Many anime fans were concerned about the low-quality subtitles in The Yuzuki Family's Four Sons' debut episode, leading some to believe that they had been generated by A.I. These subtitles were notoriously awkward -- occasionally to the point of being nonsensical -- and featured numerous grammar issues; Crunchyroll was forced to delete the episode following outrage by fans on social media, with many asking why they didn't pay professionals to do a better job.
https://www.cbr.com/crunchyroll-ai-anime-subtitles-investmen...
As a former fansub translator, it's better for mankind this way.
Further thoughts on the matter: https://old.reddit.com/r/grandorder/comments/dnpzrh/everyone...
IMO this kind of thing is a symptom of so few people knowing multiple languages. It doesn’t take much time in a second language to realize how much of an art translation is. Heck, even if you only know English, reading a few translations of classic literature should make it obvious. I really hope AI doesn’t totally ruin the market for actually decent translations of books, films, and television by making something “good enough” so cheap that nobody gets into the industry anymore.
DeppL's business model is exactly this. The words translated per day increases a lot the past few years, because of it. You let DeepL translate the text, and real Translators use it as a starting point.
DeepL is already pretty good, but it still needs a proper translator for the optimal output. That translator just saves a lot of time not having to translate every word.
You don't need to wait, you can use: https://github.com/m-bain/whisperX right now for STT with timestamps and diarization.
Totally personally skewed perspective, but thinking about the number of systems I've installed VLC on, some multiple times through rebuilds etc, 6 billion feels like a huge underestimate!
I'm guessing that's just from the main website. It's distributed so wildly I'd wager we'll never know the complete number. But yeah, I've also must have installed it at least 50 times or so, since the first time I came across it.
Finally, we can daisy-chain translations to freshly generated anime subtitles and the world will find out what "dattebayo" stand for.
Problem with this one is the quality is horrible, especially, if the language is not English. Sometime, completely wrong.
Out of curiosity, what’s VlC doing at CES? Do they have a commercial offering?
This is how a media player should work and be developed. Learn that, RealPlayer.
I wonder how strong your pc will have to be to run this locally.
Given modern mirrorless cameras can detect and track humans or mammals using DSPs and NNs without any discernible battery penalty at 30/60/120/240FPS (depending on camera class), I think they can do it with very small and efficient models with some frequency clamping and downsampling.
Initial versions will be more power hungry, but it'll be negligible at the end, given that modern processors have accelerators for image and voice based AI applications.
I wonder if there could be an option to share the generated ai subtitiles so there is no duplicate effort. For example Movie_torrent.vlc subs can be shared on some other users machine with the same Movie_torrent.vlc file.
Obv you can find majority of subs online for popular files but niche cases usually have low subs available and sharing AI could help the ecosystem.
Whisper.cpp ran fine on my GPU-less 2013 era 8GB i5 Dell laptop.
Jesus Christ, is any software safe from ai? How long until curl gets ai? How about ls? Git?
no , time find a ai less music player .
now here's an AI integration that doesn't seem pointless and stupid
There’s a reason it’s being implemented in an open source application that gets actual use instead of a random-word-generated-elevator-pitch being chased by VC funds and dinosaurs in the stock market.
In fact it's a feature I've been waiting for someone to do!
Windows has this as a feature in the OS with their stupidly-named Copilot+ PCs. I tried it a while ago and it seemed ok-ish at translating. Wasn’t terrible for doing it all on-device
https://support.microsoft.com/en-us/windows/use-live-caption...
Finally something that can potentially run on those new shoehorned NPUs and isn't just more spying.
As of yet big auto-subtitle providers like YouTube fails miserably at this, and as far as I've tested the Whispers they also do. Would be nice if embedding it wouldn't become the default.
I wonder what STT model that is fast enough and accurate VLC could be use for the application.
What about fixing subtitle regressions first. Then put the subtitles on a separate layer.
Yet another classic garbage-in-garbage-out expected from AI gen subtitles. But it is an excellent undertaking nevertheless. Will help improve viewership in general
Maybe we will get AI generated frame-stepping backwards by 2035? /s
Which is a very common feature, even YouTube can do it if you press "," (or the button next to "M" I guess)
Edit: mentioned frame interpolation, but seems like it was about something else; being allowed to go one frame backward
The OP means to be able to play backwards frame by frame, which I believe VLC can't.
To be fair, technically speaking this isn't nearly as easy as step forward due to how codecs are constructed, but lots of players have this feature already.
> frame-stepping backwards
That's not frame interpolation. Just being able to step back one frame at a time.
Implement it yourself, then? Or how is this relevant to the feature this post is about?
Please no AI.
> using open-source AI models that run locally on users’ devices, eliminating the need for internet connectivity or cloud services
this isn't the bad kind of ai we hate... it could be actually useful!
Unfortunately the term "open-source" AI alone is meaningless without knowing the users definition of open-source. More often than not it just means "openly available". In this case for example they could be using Whisper, which is openly available, but from OpenAI and still has all the caveats around used data, resource usage for training etc.
The unmitigated tendency for people to selectively morally approve homeomorphic elements of the same set based on whether or not each element is personally useful to them is extremely alarming.
There isn't enough information to judge. For example, the article doesn't mention if the training data was legally and ethically acquired.
Sorry if this is a hot take, but this is of no importance to me.
Ok, stop using your CPU then