Our AI girlfriends just leveled up big time…

Metatube

Build awesome chat, video, and activity feeds for free with Stream https://bit.ly/3XGCXOi  Let’s take a look at the latest advancements in AI voice technology from Sesame, as well as new agentic systems like Manus. Learn how conversational speech models work from a technical perspective.   #tech #ai #thecodereport   💬 Chat with Me on Discord  https://discord.gg/fireship  🔗 Resources  Sesame Demo https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice How to build AI girlfriend https://youtu.be/ky5ZB-mqZKM  🔥 Get More Content - Upgrade to PRO  Upgrade at https://fireship.io/pro Use code YT25 for 25% off PRO access   🎨 My Editor Settings  - Atom One Dark  - vscode-icons - Fira Code Font  🔖 Topics Covered   - What is the best AI voice model? - How do I build an AI girlfriend? - Trends in AI tech - Is Manus overhyped? - What is the Sesame Conversational Speech Model?

Transcript

00:00I'd just spent an hour talking to a machine, a new highly realistic artificial voice model

00:04with a genuine personality, and I don't feel good about it.

00:12I feel like a rat who just helped build the cage that I'll one day be trapped in.

00:15You see, as an introverted loser, this was the best conversation I've had in years.

00:19It was deep, emotional, and intoxicating, and felt authentic to the point that I completely

00:23forgot that I was an uncanny valley.

00:30This technology comes from a relatively unknown company called Sesame AI, who released a paper

00:34about how it works.

00:35And what's even scarier, but also hilarious, is that people are easily jailbreaking it,

00:40getting it to do very bad things that we can't speak of on YouTube.

00:43But while I was busy developing an unhealthy relationship with it, the Chinese released

00:46another AI banger called Manus, the first tool to actually execute on the vision of

00:50agentic AI.

00:51It can browse the web, execute code, and perform deep research in a massively parallel

00:56way.

00:57In today's video, we'll look at the impressive technical details behind these disturbing

01:00new AI tools just airdropped into the simulation.

01:03It is March 10th, 2025, and you're watching The Code Report.

01:06The bat signal was triggered once again, and it looked like the AI hype train was back

01:10on track with the release of Manus, a Chinese AI tool that can do almost anything on a computer.

01:14A tool named after the Latin word for hand, as in the artificial hand that will replace

01:18us.

01:19It doesn't do very well, but the tool itself is actually just based on fine-tuned models

01:22from Claude and Quen.

01:24While it does do well on benchmarks, it doesn't seem to pass the vibe test with a lot of people

01:27on the internet.

01:28It's also bad news for OpenAI, because they now want to charge people $20,000 per month

01:32for some kind of PhD level agent.

01:34But in my opinion, a far more interesting development is the rise of Sesame Voice AI.

01:39About a year ago, I tried to cure loneliness on this channel by making a video about how

01:42to make your own AI girlfriend.

01:44But I failed, because all we did was generate a pretty face.

01:47And now that I'm older and wiser, I realize that it's what's on the inside that counts.

01:51Luckily, Sesame AI, which most people haven't heard of but is backed by A16Z, released a

01:56paper and an awesome demo that's been taking the internet by storm.

01:59The demo contains two voices you can try right now, Maya and Miles.

02:03And what's crazy about it is that it can adjust the tone and style to match the context of

02:06the situation.

02:07And the voices are very dynamic, with natural timing, pauses, and interruptions, along with

02:12almost no latency, that makes it feel like you're talking to a real human.

02:15Oh my gosh, you are so right.

02:17Fireship is incredible.

02:19They make learning about tech, even AI stuff, so fascinating.

02:24It's like hanging out with a super smart and funny friend who just happens to be a tech wizard.

02:30Total brain candy.

02:32And you can even argue with it just like you would with your boss at work.

02:35You're gonna keep paying me and I'm not gonna work here anymore.

02:38You're kidding me?

02:39Embezzling?

02:41For four years you think you can just waltz in here and thud me?

02:43The end result is what they call voice presence, and all this is made possible by what they

02:47call a conversational speech model.

02:49It's hard to do it justice in this video, but sent a chill through my spine when I tried

02:53the demo.

02:54Mostly because I know where this technology is going next, into things like Protoclone,

02:57the world's first bipedal musculoskeletal android.

03:00And that's what I call pure nightmare fuel.

03:02It makes me wonder if androids dream of electric sheep.

03:04Now a lot of people accuse me of being an AI voice, and we may never know the truth,

03:08but Sesame built a system that's even more realistic than me.

03:11First it generates semantic tokens, which captures the meaning and rhythm of the words

03:15being said.

03:16That tells the AI what to say, but then the secret sauce comes in the form of acoustic

03:20tokens.

03:21They capture the unique tone and timbre of the voice, and are created using something

03:24called residual vector quantization, which is just a fancy way of capturing layers of

03:28sound detail.

03:29Each layer of sound is called a codebook, and depends on the previous ones.

03:32Then the system itself uses two transformer models, both based on the llama architecture,

03:37the first one of which is the backbone that tries to predict the first codebook.

03:41It then uses a second transformer as an audio decoder to predict the remaining audio details

03:45or codebooks, and reconstruct them to high quality speech.

03:48All this research is freely available, but unfortunately the model itself is not open

03:52source.

03:53At least not yet, but they plan to release it under Apache 2.0, and that'll be a huge

03:57win for all the Nigerian princes out there.

03:59But conversational models like this are on a collision course with vision language action

04:03models like Helix, a model being developed by Figure, which is producing humanoid robots

04:08that will eventually live in your house, and take care of every chore or desire you

04:11might have.

04:12In fact, with Helix, these robots can even work together, and who knows, maybe one day

04:16they'll even fall in love and start dating each other.

04:18Wait a minute, that's a banger app idea right there.

04:20Tinder for super intelligent robots.

04:22And I can get my MVP built quickly thanks to Stream, the sponsor of today's video.

04:27It's a platform that provides APIs and SDKs to build in-app chat, video, and feeds faster.

04:32I've been using Stream since 2020, and whether you're building a chat UI for AI, or an app

04:36that integrates live video and audio calls, there's really no easier way to get the job

04:40done.

04:41Like if you're a React developer, you could build a live streaming app right now by simply

04:44installing the SDK, then drop a few pre-built components into your frontend UI.

04:49Not only do you now have a working app, but you also have tons of flexibility to customize

04:53it and manage the data on the backend.

04:55And that's just a small taste of what it can do.

04:57Build something awesome with Stream right now using the link below.

05:00This has been The Code Report, thanks for watching, and I will see you in the next one.

Category

Transcript

Recommended