Is anyone aware of techniques or research on this?

  • lanigerous@feddit.uk
    link
    fedilink
    English
    arrow-up
    27
    ·
    1 year ago

    Personally I don’t know of any specific research but afaik the reason why our voices sound terrible to us when recorded is because they’re missing the bass that’s transmitted through the skull, i.e. they sound higher pitched on recording. So I’d try increasing the amplitude of the lower frequencies on the EQ, it’ll take a bit of fiddling to get the right balance.

  • Tigwyk@lemmy.vrchat-dev.tech
    link
    fedilink
    English
    arrow-up
    17
    ·
    edit-2
    1 year ago

    I’ve found in my experience that over the years my internal voice updated to match how I sound when recorded. So when I hear myself speaking on a recording it’s much less jarring now since it feels much more like how I predicted it would sound.

    I know this doesn’t help your current situation but it’s a fun fact I recognized since you mentioned it.

  • Bassman1805@lemmy.world
    link
    fedilink
    arrow-up
    10
    ·
    edit-2
    1 year ago

    A degree in audio engineering.

    It’s possible and there has been plenty of research done into it, but it’s a hard problem and not just a question of slapping an off-the-shelf EQ on it.

    • Bizarroland@kbin.social
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      1 year ago

      I think you could probably get a close approximation by first close miking your recording, and then using Adobe audition, apply the dereverb filter to your recording, roll off the highs probably above 6 to 8000 Hertz or so by 12-18 db, and boost the fundamental frequencies and 1st overtone that the speaking voice is recorded in probably six decibels or so, (a rough guess would be something like 120 hz and 240 hz for men and 200 hz and 400 hz for women, use the spectral graph to find for sure), and then put a compressor on the whole thing to limit the difference between the loudest and quietest pitch to somewhere around 70 decibels or so, and then finally put a very light reverb with a lot of damping on it. We’re talking almost so little reverb that you can’t even tell that there is reverb applied to it unless you a/b the sound.

      Things like that should account for the difference in bone conduction, the difference in how a microphone will pick up your voice and the rest of the room, etc., But it will probably need a lot of tweaking to your specific application to sound exactly right.

  • ArchTemperedKoala@lemmy.world
    link
    fedilink
    arrow-up
    5
    ·
    1 year ago

    My guess is there’s some subjective components to it that simple changes wouldn’t be able to adjust to, each person will have to do the changes themselves to make it match what they’re hearing…

  • j4k3@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    So to clarify, I am really curious about speech comprehension if a Text To Speech AI is trained to sound like ourselves (myself).

    Over the last couple of weeks I have been working on learning about and using offline AI. I don’t know why, but I decided to ask here first.

    If anyone is curious, this is what my current favorite LLM has to say (Llama 2 70B Q4 GGML):

    You

    what does it take to tune a recording of my voice and match it to how I hear my own voice?

    Assistant

    To tune your recorded voice so that it matches the way you perceive your own voice, there are several steps involved in audio processing. Here’s an overview of the process:

    • Recording: Firstly, ensure that you have a high-quality recording of your voice using proper equipment such as a microphone and soundproofing to minimize background noise. This will provide a good starting point for further editing.

    • Equalization (EQ): Use an equalizer to adjust the frequency balance in your recorded audio. You can boost or cut specific frequencies that are too prominent or lacking, respectively. For example, if you feel like your voice sounds nasally, you may want to reduce some of the mid-range frequencies around 1kHz and above.

    • Compression: Apply compression to even out the dynamic range in your recording. This will help balance the loudness between different parts of your speech or singing, making it sound more consistent.

    • Reverb/Delay: Add a small amount of reverb or delay to create a sense of space and depth around your voice. This can make it feel like you’re hearing yourself in a room rather than through headphones or speakers.

    • Stereo Imaging: Use stereo imaging tools to adjust the width and positioning of your voice within the mix. This will help create an illusion that your voice is coming from a specific location, making it sound more natural.

    • Monitoring: Finally, listen back to your processed audio on different devices such as headphones or speakers in various environments (e.g., quiet room vs noisy cafe) to ensure that the changes you’ve made translate well across different listening scenarios. Make further adjustments if necessary until you are satisfied with how it sounds.

    Remember, everyone perceives their own voice differently due to bone conduction and other factors, so there may be some trial-and-error involved in finding a sound that feels right for you. It’s also important to keep in mind that the goal is not necessarily to make your recorded voice identical to how it sounds when speaking but rather to create an enjoyable listening experience for others who will hear it through speakers or headphones.

    You

    Is there any formulaic approximation or rule regarding bone conduction and how humans perceive their own voice?

    Assistant

    Bone conduction is the process by which sound vibrations are transmitted directly to the inner ear via the skull, bypassing the outer and middle ears. This occurs when we speak or sing, causing us to hear our voices differently than others do because of this direct transmission pathway.

    There isn’t a specific formulaic approximation or rule that can accurately predict how each individual perceives their own voice through bone conduction since it varies from person to person due to factors such as head shape and size, vocal tract resonance, and the acoustics of the surrounding environment. However, in general, people tend to hear their voices as deeper or richer when speaking because low-frequency sounds are more efficiently transmitted through bone conduction than high frequencies.

    It’s important to note that while understanding how we perceive our own voice can be helpful for tuning a recording of it, the goal should ultimately be to create an enjoyable listening experience for others who will hear your recorded voice through speakers or headphones rather than trying to perfectly replicate what you hear when speaking.

  • twistedtxb@lemmy.ca
    link
    fedilink
    arrow-up
    4
    arrow-down
    1
    ·
    1 year ago

    Please God no. I cannot fathom how people manage to hear my real voice.

    I sound so much better inside my skull

    • Tetra@kbin.social
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      I mean isn’t that what OP is trying to figure out? A way for people to hear how you sound inside your skull.

  • ReallyKinda@kbin.social
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    A very simple approximation of your voice as it’s heard if you’re facing someone when speaking would be using a unidirectional mic and recording yourself with the mic pointing the opposite direction as it normally would be (in other words— with the polarity reversed).

    A slightly better approximation would be if you did the same thing but with two unidirectional mics pointed at slight angles (with the polarity still reversed) to simulate the placement of your ears.

    Obviously the quality of the mic would factor in as well—you’d want mics with a flat frequency curve. To get even pickier you’d also want to use headphones or speakers with a flat frequency curve to listen to it. Once you had the recording you could even take impulse responses of certain rooms and process the audio to get an idea of how you sound to others in specific rooms!