Tuesday, January 3, 2012

Book - Probability and Statistics - Chapter 1

Chapter 1 - Descriptive Statistics


1.1. Classification of Data


All the time we collect facts and figures for presentation and interpretation. Collecting data is usually hard and expensive (although it is getting easier and easier and the main difficulty might be to deal with the huge amount of data we have gathered…) so you might as well use your data carefully and wisely. This chapter is a VERY brief introduction to “types” of data and data summaries.

We can separate the "types" of data in different classes according to their characteristics:
Qualitative (or categorical) Data
Generally non-numeric data, or numeric data that do not “really” represent numbers, they refer to labels, and we call them data on nominal or ordinal scales (see below).
Quantitative Data
They indicate numerical quantities, such as price, volume, pitch, duration, profit, height, weight,...

We can also establish the scale of our data as nominal, ordinal, interval or ratio, as indicated in the following examples.

Nominal Scale
For example, suppose we divide a country into five different regions (Southeast, South, Northeast, North, Midwest). A numerical code can be assigned to each region (e.g., Southeast = 1, South =2 , ...), but there is no particular ordering to this code (we could have set North = 1, Northeast = 2, … and it would, in principle, work just as well).

Another common practice is to interview people and assign a code to their gender, for example, 1 if male, 0 if female. Obviously this code can be reversed without major problems, although there are some situations where the analyst might wish to code variables in a particular way, but that is something we shall address much later in these notes.

Ordinal Scale
It is similar to the nominal scale, but there is an "order" inherent in the codes.
For example, "Excellent" = 4, "Good"= 3, "Fair"= 2, "Poor" = 1, "Bad" = 0.

Scale Interval
It applies to ordinal data within a range. For example, "scores" of a standardized examination, such as the SAT (or GRE or GMAT), or any one test in which all grades remain restricted to a pre-specified range.

Proportion
It is data in the “usual” range. For example, the price of a car. If one car costs $40,000 and another $20 thousand, this means there is proportionality between prices, the first costs twice as much as the second.

In addition to the classifications already shown, we need to know if our data are cross-sectional or time series representations. The difference between the two types is easy to spot, as you will see next.

Cross Sectional Data
The data are obtained at the same time (or almost at the same time). For example, suppose you are interested in buying a used car from a certain brand, model and year and check a web site looking for the ads in your city. Based on this sample, you can decide what price you are willing to offer for the car (with some variability depending on mileage, maintenance and specific conditions of a given car). The basic point is – all the ads you “sampled”  basically referred to a given point in time – this point represents the time when you were willing, or searching, or with money in your bank account, to buy a car!

Time Series Data
Data are obtained at different instants of time. The main idea is to observe how the variable evolves over time. The analysis and time series forecasting is a separate area of ​​statistics, with its own methods, and will not be the focus of our chapter. For example, you collect the monthly sales data for a certain product of your company and verifies that these are increasing, decreasing, stable, and if there is any seasonal pattern in these sales (due to events such as Christmas, Mother's Day, Valentine's Day, “Black Friday” - traditionally the beginning of the Christmas shopping season in the United States, etc.).

No comments:

Post a Comment