Phonetic Matching

77 points by raybb 4 days ago

asveikau 3 days ago

The idea that "shore" and "sure" are pronounced "almost identically" would depend pretty heavily on your accent. The vowel is pretty different to me.

Also, the matches for "sorI" and "sorY" would seem to me to misinterpret the words as having a vowel at the end, rather than a silent vowel. If you're using data meant for foreign surnames, the rules of which may differ from English and which might have silent vowels be very rare depending on the original language, of course you may mispronounce English words like this, saying both shore and sure as "sore-ee".

I'm sure there are much better ways to transcribe orthography to phonetics, probably people have published libraries that do it. From some googling, it seems like some people call this type of library a phonemic transcriber or IPA transcriber.

smoores 3 days ago

It's true, "sure" and "shore" are not pronounced exactly the same, and accents absolutely can vary, which is part of why Beider-Morse produces multiple encodings for each word. But the goal of Soundex-style phonetic encoding systems isn't to perfectly encode a word with a precise alphabet like the IPA. Rather, they intentionally introduce fuzziness so that words (really, names) that are pronounced similarly will be encoded the same way.
Perhaps "sure" and "shore" was a bad example; it's tricky to come up with these! And you're right that the encodings that happen to overlap for those words are technically "incorrect" pronunciations; again, these Soundex-style encoders are designed for surnames, not general English words. Some Storyteller users are testing out a version of Storyteller using this encoder to see if it makes any improvements (so far it seems like it's not worse, but not necessarily better!), but I won't be surprised if it doesn't end up making it into Storyteller long term.
Mostly I wrote this piece not to advocate for using BMPM to support forced alignment, but as a way to express the emotional journey that I found myself on as I learned more about these systems and where they came from.
xdennis 3 days ago

> The idea that "shore" and "sure" are pronounced "almost identically" would depend pretty heavily on your accent.
For an idea of how bad various accents can complicate recognition see how Baltimoreans pronounce "Aaron Earned An Iron Urn": https://www.youtube.com/watch?v=Oj7a-p4psRA
woodrowbarlow 3 days ago

IPA is the most-used tool by linguistic researchers for encoding pronunciation in a standardized way. IPA is criticized for being a little bit anglo-centric and falls short for some languages and edge cases, but overall it performs pretty well. (learned from an ex who studies linguistics.)
- bane 3 days ago
  
  This is sort of the inverse of the problem IPA is trying to solve. You're correct in that IPA is used to try to encode pronunciation. But phonetic matching is trying to encode those areas where different people, in different accents (maybe languages), say or write semantically the same thing, but differently -- but you need to find all the others using only one of the different versions without finding things that are not or irrelevant.
  Basically it's trying to smush all the different versions together into a single sort of cluster, where the identity of the cluster is any of the versions.
  I used to work in this field about 30 years ago, specifically how names can end up being latinized when coming from non-latin languages. We were very focused on trying to collapse variants into a complex ruleset that could be used both to recognize the cluster of names as being the same "thing", and then that ruleset could also produce all the valid variants. It was very much a kind of applied "expert systems" approach that predated ML.
  The rulesets were more or less context free grammars and regular expressions that we could use to "decompile" a name token into a kind of limited regular expression (no infinite closures) and then recompile the expression back into a list of other variants. Each variant in turn was supposed to "decompile" back into the same expression so a name could be part of a kind of closed algebra of names all with the same semantic meaning.
  For example:
  A Korean name like "Park" might turn into a {rule} that would also generate "Pak", "Paek", "Baek", etc.
  Any one of those would also generate the same {rule}.
  In practice it worked surprisingly well, and the struggle was mostly in identifying the areas where the precision/recall in this scheme caused the names to not form a closed algebra.
  Building the rules was an ungoldly amount of human labor though, with expert linguists involved at every step.
  These days I'm sure the problem would be approached in an entirely different way.
  - rhdunn 3 days ago
    
    In English there was analysis by John Wells defining several lexical sets for vowels [1]. Various other linguists have extended that to cover other accents such as Scottish, Irish, and Welsh.
    Words in the same lexical sets are the result of splits and mergers. Usually through processes like you describe -- e.g. the Southern English BATH vowel resulted from a lengthening of the TRAP vowel and then changing quality to that of the PALM vowel.
    Lexical sets don't cover consonant changes between accents such as rhotic r, the /sh/-/sj/ merger in shore and sure, tapped t, glottalized t, etc.
    And on the thing you are talking about, Colin Gorrie has some YouTube videos on doing that comparative linguistics and rule construction. A lot of his videos are doing that for conlangs, but there are videos with real historical accents in several languages.
    An example in English is the shift in pronunciation of 's' before 'u' from the /sj/ glide to the /sh/ sibilant so that in accents with that shift <sure> and <shore> are homophones (especially with the CURE-FORCE merger).
    There are computer programs that you can use to express these rules and see how the pronunciations change over time. I think Colin uses one in some of his videos and Biblidarion uses one in some of his conlang videos to check phonetic evolution of his conlangs.
    [1] https://en.wikipedia.org/wiki/Lexical_set
    
    bane 3 days ago
    
    Great response!
    I forgot to mention that we struggled much more with cognate names that that were more linguistically distant than within ones that were closer. e.g. Matthew, Mattieu were likely to be within the same lexical set, but Matityahu may have been a bit to far for us.
    It's interesting how some names tend to have more conserved features than others as they transit across larger distances in terms of language families. I worked later in genetics and was able to reapply many of the learning from names into gene sequences.
    I used to test our own software with my given name. It's rather common, but our ruleset would produce some rather wild variations of it. We thought it was an error, but it turned out to be a completely valid name in Finnish!
- asveikau 3 days ago
  
  I've always found IPA to be deeply confusing for English, because different accents have different historical vowel mergers, so I am never sure about vowels. And I think linguists aren't always sure about them either. IIRC, I saw a video by Geoff Lindsey suggesting Americans don't really have a /ʌ/ phoneme. Most people who have written about this write as if we do. (By the way, Dr. Lindsey's YouTube videos are some of the more interesting content I've found about English phonetics)
  For other languages I have exposure to, IPA seems to make more sense. Possibly I have a bias in that they're not my native language, so I can analyze them instead of internalizing them. But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.
  - thaumasiotes 3 days ago
    
    > For other languages I have exposure to, IPA seems to make more sense.
    > But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.
    The first and last of those are essentially guaranteed to be false.
    > Possibly I have a bias in that they're not my native language
    The more likely bias is that you just don't know very much about those other languages.
    
    asveikau 3 days ago
    
    You assume too much. I'm talking about languages I'm fluent in, can read and write, etc.
    You'd have to be insane to think that, for example, IPA for Spanish isn't easier than IPA for English vowels. In contrast to English, most Spanish regional pronunciations are about consonants. And the orthography is very regular. If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.
    
    thaumasiotes 3 days ago
    
    > You assume too much. I'm talking about languages I'm fluent in, can read and write, etc.
    None of that takes anything away from what I said. The more likely bias is that you don't know much about the languages you're referring to. Whether you can read or write them doesn't even speak to these questions.
    "Cleaner phonetics" doesn't have a meaning. And the idea that there's less regional variation in Spanish than English is not plausible.
    > If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.
    Do you usually notate that [d] and [t] in standard Spanish are generally dental rather than alveolar?
    IPA is almost never used with a concern for the phonetic accuracy of the symbols. It's almost always used to indicate phonemes. You can even read John Wells arguing vehemently that correct IPA for English should use the symbols "e" and "r" because those are familiar to people who use the English alphabet.
    
    asveikau 3 days ago
    
    > And the idea that there's less regional variation in Spanish than English is not plausible.
    I didn't say there was not variation. I said there was less of a specific type. Specifically there are not phonemic vowel variations as in English. In terms of consonants, some regions have extra consonant phonemes such as /θ/ or /λ/. Some have a tendency to omit some consonants in some positions. Every other difference is around which allophones get expressed preferentially.
    > Do you usually notate that [d] and [t] in standard Spanish are generally dental rather than alveolar?
    Typically I've seen it notated that way between [] but not between //. It's an articulation detail rather than phonemic.
    This is honestly a not very intelligent point, you're just doing trivia on Spanish phonetics now. I know these things too. They're not at all relevant to my point.
    
    thaumasiotes 3 days ago
    
    > I didn't say there was not variation. I said there was less of a specific type. Specifically there are not phonemic vowel variations as in English.
    Here are your exact words:
    >> But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.
    >> You'd have to be insane to think that, for example, IPA for Spanish isn't easier than IPA for English vowels.
    If you don't want to defend what you've said... don't just pretend that you said something completely different.
    I might also ask whether you're sure that the different varieties of Spanish actually exhibit less variation in their vowels than the varieties of English do, as opposed to the impact of this variation being muted by the much smaller count of vowel phonemes.
    As for this:
    > I've always found IPA to be deeply confusing for English, because different accents have different historical vowel mergers, so I am never sure about vowels.
    > If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.
    Those two claims directly conflict with each other. The second one is more correct, in its first half. If you know your target dialect, you can produce conventional IPA for any given word. The orthography of English usually makes this easier by preserving information about the historical pronunciation of the word. If what you want is to produce IPA for an English word without knowing the dialect it's going to be pronounced in, that's no more possible in Spanish than it is in English, and you've already noted this fact.
    > This is honestly a not very intelligent point, you're just doing trivia on Spanish phonetics now. I know these things too. They're not at all relevant to my point.
    What point? What do you think you're complaining about, if not trivia on English phonetics?
    Try articulating an actual problem with the use of IPA for English that, in your opinion, doesn't occur in every other language.
    > Specifically there are not phonemic vowel variations [in Spanish] as in English. In terms of consonants, some regions have extra consonant phonemes such as /θ/ or /λ/. Some have a tendency to omit some consonants in some positions. Every other difference is around which allophones get expressed preferentially.
    Going purely from Wikipedia...
    > For those areas of southeastern Spain where the deletion of final /s/ is complete, and where the distinction between singular and plural of nouns depends entirely on vowel quality, it has been argued that a set of phonemic splits has occurred, resulting in a system with eight vowel phonemes in place of the standard five.
    
    asveikau 2 days ago
    
    You are just arguing to argue, man. My words are NOT inconsistent, the inconsistency is you and you reading them in a combative fashion. You don't actually know what you are talking about and you project your own lack of knowledge onto me.
    Yes, in Andalucía the vowel that precedes an aspirated /s/ changes to a different allophone of the vowel. It doesn't cease being an allophone of the vowel. If you ask a speaker who aspirates or omits their /s/ they'd say there's an /s/ there. That's why the /s/ can fully re-emerge if there's a vowel after it. It's not a phonemic difference, it's more like the /s/ is difficult for them to articulate in that position and that fact sometimes bleeds into the vowel, similar to /r/ for UK speakers of English. I think most dialects of Spanish do something like this with /s/ in that position, it's just a lot more frequent in Andalucía or the Caribbean and a few others.
    I came close to mentioning this exact phenomenon but I didn't want to lengthen my comment on really "in the weeds" shit that isn't very relevant.
    There's also the fact that in northern Mexico, I've heard the allophones they select for vowels are pretty different from most of the rest of the Spanish speaking world. I didn't mention it because I already said ... No phonemic difference.
- lupire 3 days ago
  
  Yes, but stay aware that IPA is for pronunciations.
  A word doesn't have unique pronunciation. (Speaker, Word) pair has pronunciation, and even those are not unique. (Speaker, Word, Utterane) Triple has a pronunciation.
  - jjtheblunt 3 days ago
    
    even a speaker with a specified word in a specified utterance will vary pronunciation for the context of who is listening (imitation of local accent).
    (we worked on all this in Motorola in 2001 extensively....then they dropped it)
  - Funes- 3 days ago
    
    >Yes, but stay aware that IPA is for pronunciations. A word doesn't have unique pronunciation.
    No. IPA encodes sounds based on various aspects of articulation. A word has unique phonemes (enclosed in forward slashes, //), but not necessarily unique sounds (allophones, enclosed in brackets, []).
- tokinonagare 3 days ago
  
  The issue is not really in the IPA but how to use it. If you stay at the phonemic level, it's makes more words comparable but hides distinctions that occurs only in dialects. Also for a lot of language, there's multiple modelization in terms of the set of phonemes involved. If you go down the phonetic rabbit hole the notation quickly become read heard to read. If you have to handle multiples variations, there's also diaphonemes but then it's even less standardized.
thaumasiotes 3 days ago

> The idea that "shore" and "sure" are pronounced "almost identically" would depend pretty heavily on your accent. The vowel is pretty different to me.
That's not the similarity the author is trying to point out. The idea is that the spelling is a lot more different than the pronunciation is, and that's true. The pronunciations are as similar as it's possible to be, measured by substitution count, without actually being identical. (You could use a measure of phonetic similarity, in which case e.g. fought and thought would be much more similar than fought and caught, but he's not doing that either.)
The pronunciation of sure comes from (1) the old, dead idea that the letter u should be pronounced /ju/ rather than /u/ (compare cure); and (2) the still vital English reduction of /sj/ to /ʃ/. Shore has to indicate the same sound in a radically different way, since it doesn't have and never had a medial /j/ to transform a bare s.
- asveikau 3 days ago
  
  > the old, dead idea that the letter u should be pronounced /ju/ rather than /u/ (compare cure)
  Tell me your accent has yod dropping without telling me.

WarOnPrivacy 3 days ago

This short epilogue struck me.

    This past Yom Kippur, my wife and I drove two hours to spend the afternoon at my aunt’s house, with my cousins. As the night drew on, conversation roamed from television shows and books to politics and philosophy. The circle grew as we touched on increasingly sensitive and challenging topics, drawing us in.

    We didn’t agree, per se. We were engaging in debate as often as we were engaging in conversation. But we all love each other deeply, and the amount of care and restraint that went into how each person expressed their disagreement was palpable.

cess11 3 days ago

It's about someone using Levenshtein distance for phonetic fitting against text learning about soundex.

One way to start playing around with it is to put some stuff in a database: https://dev.mysql.com/doc/refman/8.4/en/string-functions.htm...

(or this module, https://www.postgresql.org/docs/current/fuzzystrmatch.html if you're stuck with PG)

ajuc 3 days ago

This is one of these cases where inheriting hacked-together piece of crap (English spelling) makes a lot of additional work higher up.

Another example is poetry. A regex can find rhymes in Polish. Same postfix == it rhymes.

In English it's a feat of engineering.

thechao 3 days ago

English orthography isn't really hacked together. Most of the "examples" I see people bandy about are because you're reading the wrong English: try Old English, instead. For example, knight: it was pronounced "k-ng-ee-h-tuh" (my IPA is too rusty to use). That's, like, precisely how it's spelled? What's gone wrong is the our modern pronunciation is poor.
Other languages have this even worse. Try comparing Egyptian Colloquial Arabic vs literary Arabic. I mean... these are different languages. Or, for instance, American Sign Language (ASL) vs. written English: the former is more like Chinese than English.
wavemode 3 days ago

It's really just a feat of data collection (e.g. rhymezone.com). You just compile all English words and record which ones rhyme with which.
(Yeah it's labor-intensive, but probably not moreso than, say, writing a dictionary.)
- williamdclt 3 days ago
  
  > You just compile all English words and record which ones rhyme with which
  I suppose, if we ignore accents and heteronyms... both of which English is famous for, unfortunately!
  - nyrikki 3 days ago
    
    Shakespeare in RP loses most of the raunchy jokes as an example of the above.
    My highschool English teacher was horrified when she figured out why us boys were laughing when reading her copy of the first folio, our hick accent ment we were getting some of the jokes she didn't even notice.
    Theme rhyming with sixteen in the Cranberry's song Zombie is another.
    
    rhdunn 3 days ago
    
    There was an effort to revive the Shakespearean pronunciation by David Crystal. You can see YouTube videos of David and his son Ben talking about it and reciting parts of Shakespeare to highlight the jokes and word play.

arunc 3 days ago

I created this sheet[0] to tech my kid to learn Tamil using Roman letters and in the process figured it could be useful for kids learning other Indian languages as well.

With the history of reading and speaking (Indian) phonetic languages, I think, English would've been much nicer and uniform if the vowels sounded right, esp the long forms.

Extending the long forms using orthogonal vowels probably made it complex, especially with the lack of ii and uu.

Say for instance, to extend the long form of "o", "a" was used. Eg: boat, goat. The correct spelling could've been boot, with the original boot spelled as buut.

With that notion, door is probably the only word that's written and pronounced phonetically correct, with two oo.

Curious to know how would such correct phonetic translation aid in the encoding, matching and compression.

[0] https://docs.google.com/spreadsheets/d/15hdVh-oBUngTyigqDdjg...

smoores 3 days ago

Oh, hello, I didn't realize this was shared here! I guess let me know if anyone has any questions. I mostly wrote this piece as part of processing some hard feelings I've been having and seeing shared among Jewish folks around me, but I also ended up learning quite a bit about phonetic encoding algorithms, and I've spent several years at this point steeped in forced alignment via Storyteller.

arunc 3 days ago

Just wanted to share this. Be strong. The only thing that shall die in this world is hate! There is and will always hope!

msgerbush 3 days ago

I'm using a library, stable-ts, for a similar issue with short audio clips and it works well: https://github.com/jianfch/stable-ts/tree/main

Not sure how it will perform on something long like an audiobook.

Der_Einzige 3 days ago

Highly related to my paper on why tokenization in LLMs is the devil: https://paperswithcode.com/paper/most-language-models-can-be...

qrian 2 days ago

I also had to do this in my previous work and I took the phonetic embeddings of reference and transcribed text and ran a dynamic time warping with them.

willwade 3 days ago

Im intrigued.. Is this not done just with a phonemizer?

    from phonemizer.phonemize import phonemize

    text = "hello world"
    variations = [
        phonemize(text, backend="espeak", language="en-us", strip=True),
        phonemize(text, backend="espeak", language="en-gb", strip=True),
        phonemize(text, backend="espeak", language="en-au", strip=True),
    ]

I mean, espeak isnt the best but a lot of folks in the ASR/Speech world still are using this right?

(NB: If you are on iOS check out the inbuilt one - Settings -> Accessibility -> Spoken Content -> Pronounciations. Adding one it has the ability to phonemize to IPA your spoken message. If someone can tell me where that SDK/API is they use in that I'd love to know) for i, variation in enumerate(variations, 1): print(f"Variation {i}: {variation}")

rahimnathwani 3 days ago

It seems like Beider-Morse outputs more variations of each word, which I guess means fewer false negatives, and using only equality tests?