I like thinking about how everything can be and is represented by numbers. For example, plaintext is represented by a code like ASCII, and images are represented by RGB valu
The answers all relate to sampling frequency, but don't address the question. A particular snapshot of a sound would, I imagine, include individual amplitudes for a lot of different frequencies (say you hit both an A and a C simultaneously on a keyboard, with the A being louder). How does that get recorded in a 16 bit number? If all you are doing is measuring amplitude (how loud the sound is), how do you get the different notes?
Ah! I think I get it from this comment: "This number is then converted to the linear displacement of the diaphragm of your speaker." The notes appear by how fast the diaphragm is vibrating. That's why you need the 44,000 different values per second. A note is somewhere on the order of 1000 hertz, so a pure note would make the diaphragm move in and out about 1000 times per second. A recording of a whole orchestrate has many different notes all over the place, and that miraculously can be converted into a single time history of diaphragm motion. 44,000 times per second the diaphragm is instructed to move in or out a little bit, and that simple (long) list of numbers can represent Beyonce to Beethoven!