• 3 hours ago
Build awesome chat, video, and activity feeds for free with Stream https://bit.ly/3XGCXOi

Let’s take a look at the latest advancements in AI voice technology from Sesame, as well as new agentic systems like Manus. Learn how conversational speech models work from a technical perspective.

#tech #ai #thecodereport

💬 Chat with Me on Discord

https://discord.gg/fireship

🔗 Resources

Sesame Demo https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice
How to build AI girlfriend https://youtu.be/ky5ZB-mqZKM

🔥 Get More Content - Upgrade to PRO

Upgrade at https://fireship.io/pro
Use code YT25 for 25% off PRO access

🎨 My Editor Settings

- Atom One Dark
- vscode-icons
- Fira Code Font

🔖 Topics Covered

- What is the best AI voice model?
- How do I build an AI girlfriend?
- Trends in AI tech
- Is Manus overhyped?
- What is the Sesame Conversational Speech Model?

Category

🗞
News
Transcript
00:00I'd just spent an hour talking to a machine, a new highly realistic artificial voice model
00:04with a genuine personality, and I don't feel good about it.
00:12I feel like a rat who just helped build the cage that I'll one day be trapped in.
00:15You see, as an introverted loser, this was the best conversation I've had in years.
00:19It was deep, emotional, and intoxicating, and felt authentic to the point that I completely
00:23forgot that I was an uncanny valley.
00:30This technology comes from a relatively unknown company called Sesame AI, who released a paper
00:34about how it works.
00:35And what's even scarier, but also hilarious, is that people are easily jailbreaking it,
00:40getting it to do very bad things that we can't speak of on YouTube.
00:43But while I was busy developing an unhealthy relationship with it, the Chinese released
00:46another AI banger called Manus, the first tool to actually execute on the vision of
00:50agentic AI.
00:51It can browse the web, execute code, and perform deep research in a massively parallel
00:56way.
00:57In today's video, we'll look at the impressive technical details behind these disturbing
01:00new AI tools just airdropped into the simulation.
01:03It is March 10th, 2025, and you're watching The Code Report.
01:06The bat signal was triggered once again, and it looked like the AI hype train was back
01:10on track with the release of Manus, a Chinese AI tool that can do almost anything on a computer.
01:14A tool named after the Latin word for hand, as in the artificial hand that will replace
01:18us.
01:19It doesn't do very well, but the tool itself is actually just based on fine-tuned models
01:22from Claude and Quen.
01:24While it does do well on benchmarks, it doesn't seem to pass the vibe test with a lot of people
01:27on the internet.
01:28It's also bad news for OpenAI, because they now want to charge people $20,000 per month
01:32for some kind of PhD level agent.
01:34But in my opinion, a far more interesting development is the rise of Sesame Voice AI.
01:39About a year ago, I tried to cure loneliness on this channel by making a video about how
01:42to make your own AI girlfriend.
01:44But I failed, because all we did was generate a pretty face.
01:47And now that I'm older and wiser, I realize that it's what's on the inside that counts.
01:51Luckily, Sesame AI, which most people haven't heard of but is backed by A16Z, released a
01:56paper and an awesome demo that's been taking the internet by storm.
01:59The demo contains two voices you can try right now, Maya and Miles.
02:03And what's crazy about it is that it can adjust the tone and style to match the context of
02:06the situation.
02:07And the voices are very dynamic, with natural timing, pauses, and interruptions, along with
02:12almost no latency, that makes it feel like you're talking to a real human.
02:15Oh my gosh, you are so right.
02:17Fireship is incredible.
02:19They make learning about tech, even AI stuff, so fascinating.
02:24It's like hanging out with a super smart and funny friend who just happens to be a tech wizard.
02:30Total brain candy.
02:32And you can even argue with it just like you would with your boss at work.
02:35You're gonna keep paying me and I'm not gonna work here anymore.
02:38You're kidding me?
02:39Embezzling?
02:41For four years you think you can just waltz in here and thud me?
02:43The end result is what they call voice presence, and all this is made possible by what they
02:47call a conversational speech model.
02:49It's hard to do it justice in this video, but sent a chill through my spine when I tried
02:53the demo.
02:54Mostly because I know where this technology is going next, into things like Protoclone,
02:57the world's first bipedal musculoskeletal android.
03:00And that's what I call pure nightmare fuel.
03:02It makes me wonder if androids dream of electric sheep.
03:04Now a lot of people accuse me of being an AI voice, and we may never know the truth,
03:08but Sesame built a system that's even more realistic than me.
03:11First it generates semantic tokens, which captures the meaning and rhythm of the words
03:15being said.
03:16That tells the AI what to say, but then the secret sauce comes in the form of acoustic
03:20tokens.
03:21They capture the unique tone and timbre of the voice, and are created using something
03:24called residual vector quantization, which is just a fancy way of capturing layers of
03:28sound detail.
03:29Each layer of sound is called a codebook, and depends on the previous ones.
03:32Then the system itself uses two transformer models, both based on the llama architecture,
03:37the first one of which is the backbone that tries to predict the first codebook.
03:41It then uses a second transformer as an audio decoder to predict the remaining audio details
03:45or codebooks, and reconstruct them to high quality speech.
03:48All this research is freely available, but unfortunately the model itself is not open
03:52source.
03:53At least not yet, but they plan to release it under Apache 2.0, and that'll be a huge
03:57win for all the Nigerian princes out there.
03:59But conversational models like this are on a collision course with vision language action
04:03models like Helix, a model being developed by Figure, which is producing humanoid robots
04:08that will eventually live in your house, and take care of every chore or desire you
04:11might have.
04:12In fact, with Helix, these robots can even work together, and who knows, maybe one day
04:16they'll even fall in love and start dating each other.
04:18Wait a minute, that's a banger app idea right there.
04:20Tinder for super intelligent robots.
04:22And I can get my MVP built quickly thanks to Stream, the sponsor of today's video.
04:27It's a platform that provides APIs and SDKs to build in-app chat, video, and feeds faster.
04:32I've been using Stream since 2020, and whether you're building a chat UI for AI, or an app
04:36that integrates live video and audio calls, there's really no easier way to get the job
04:40done.
04:41Like if you're a React developer, you could build a live streaming app right now by simply
04:44installing the SDK, then drop a few pre-built components into your frontend UI.
04:49Not only do you now have a working app, but you also have tons of flexibility to customize
04:53it and manage the data on the backend.
04:55And that's just a small taste of what it can do.
04:57Build something awesome with Stream right now using the link below.
05:00This has been The Code Report, thanks for watching, and I will see you in the next one.

Recommended