With the microphone attached to your lap top, get 1000 people to speak the word "ARTICHOKE". In each sample, the word is stored as a number of digital values. Cross correlate between their inputs and a base sample (the way YOU say "artichoke"). You might want to do an FFT on each sample and correlate in the frequency domain. While some people might say "artichoke" (fast), others might say "aaarrrtttiiiccchhhoookkkeee" (much slower). With the slower guy you would have three times the number of samples compared to the first guy. So you might have to manipulate the input data before you could get a good correlation.
I just used the word "artichoke" off the top of my head. In reality, you could use just about any word.