Recently, Andrej Karpathy argued that the ideal interaction pattern for AI models is voice in, visuals out: Audio is the human-preferred input to AIs, but vision is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision; it is the 10-lane superhighway of information into brain. The claim is that while “text in, markdown out” is the mode most people use LLMs today, what we should be building toward is a Jarvis-like mode where we primarily speak to AI – and it primarily responds with UI, video, or other visuals. Let’s check in on where we’re at for both halves of this claim: visuals as output, and voice as input. Visuals Out Humans love looking at things! While it can be convenient to be able to listen to our computers speak, waiting through a voice response feels kinda… ugh. You can increase the speaking rate, but fundamentally, the fastest way for a computer to give humans information is to display it. We’re faster at…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.