Tuesday, January 3, 2012

Book - Probability and Statistics - Chapter 1 - 1.3. Descriptive Statistics - Population and Sample

1.3. Descriptive Statistics - Population and Sample


Collecting data, as we have said, is a task often costly, time consuming and arduous. So in practice we are obliged to infer some interesting features in a population from a subset of it, called “sample”.

In Statistics you might say that “a number is not just a number”! What do I mean by this? Associated with a number there is a measure of uncertainty or variability. Why? Simply by not collecting all information about the variable of interest, as this information is based on a subset of data, which is the sample. This brings an intrinsic uncertainty to the process - as we do not collect data from the entire population, the number we get has an associated uncertainty due the fact that this number has been collected from a "piece" of the population, the sample, and NOT from the ENTIRE population.

The population is the collection of all elements whose characteristics we wish to know. The elements (or "individuals") in the population are not necessarily people! We might be interested in the population of whales in the ocean, stars in the sky, files in a computer, stocks that went up in a given year, etc…

The sample is the subset of the population whose characteristics are measured. The sample will be used to discover characteristics of the population.

Examples of population and sample
Population = voters in the city of Rio de Janeiro. Sample = 650 voters chosen at random (by chance). Feature of interest: the percentage of voters who plan to vote for a candidate X in the upcoming elections.
Population = mobile phone users in the city of Rio de Janeiro. Sample = 10,000 randomly selected users. Characteristic of interest: whether the user used text messages (SMS) in the last day.
Population = all TV households in the city of Rio de Janeiro. Sample = 1000 households with cable TV selected at random. Characteristic of interest = percentage of each audience TV station on a certain day of the week between 6 and 10 PM.

In short: we collect information from a sample that lets us learn something interesting about the population. And why do it? It's cost effective! The costs are infinitely lower than those of sampling the entire population (that is, doing a "census"). One can prove that for very large populations, a sample of about 600 or 1000 "individuals" provides very reliable estimates of the population characteristics one wishes to measure.

A census is a survey from the entire population. The population census in Brazil (and in most countries) is performed every 10 to 10 years. Why? Because a census is expensive and time consuming. Indeed, in the 90's it was delayed because the government had no money....

Now what?
You have collected a sample and, within this sample, you collected numerical data (for example, the average monthly electricity consumption in kWh of households in a certain area of ​​town). What to do with it? There are two possibilities….

-        You can simply describe these numerical data using tables, graphs and numerical measures. This is called descriptive statistics. Most market research does just that, and it is undoubtedly very important.
-        The second alternative is to try to draw conclusions about the population characteristics from the data observed in the sample. This is called inferential statistics (or just statistics!). To be able to do it, we must have a thorough understanding of probability theory. This might explain why descriptive statistics emerged long before inferential statistics, which is basically a 20th century science

No comments:

Post a Comment