It doesn’t get encoded in to plaintext. First, the microphone picks up the sounds, and outputs values for frequencies and intensities. Recording software takes those values, and compresses them down into binary data. Then that binary data is saved onto storage. Depending on your storage, it’s then stored magnetically (cassette, floppy, HDD) or as a “lockable” logic gate (USB, SSD) or as laser etched dots and dashes (CD/DVD)
It’s not getting turned in to rocks, it’s getting written on media.
Also, some number for scale…
My computer has 3.5ghz processors. It can run 3.5 billion instructions every second. To put that in perspective, the smallest unit of time humans can perceive is ~13ms. That processor can run ~270,000 instructions in that time frame. Computers perform very simple tasks, extremely quickly, and it gives the impression of intelligence.
It’s doesn’t get your exact voice. Your speech gets compressed into digital “steps” that closely mimic the continuous “analog” output of your voice.