Tuesday, January 3, 2012

Book - Probability and Statistics - Chapter 1 - 1.4. Descriptive Statistics - Charts

1.3. Descriptive Statistics - Population and Sample

Collecting data, as we have said, is a task often costly, time consuming and arduous. So in practice we are obliged to infer some interesting features in a population from a subset of it, called “sample”.

In Statistics you might say that “a number is not just a number”! What do I mean by this? Associated with a number there is a measure of uncertainty or variability. Why? Simply by not collecting all information about the variable of interest, as this information is based on a subset of data, which is the sample. This brings an intrinsic uncertainty to the process - as we do not collect data from the entire population, the number we get has an associated uncertainty due the fact that this number has been collected from a "piece" of the population, the sample, and NOT from the ENTIRE population.

The population is the collection of all elements whose characteristics we wish to know. The elements (or "individuals") in the population are not necessarily people! We might be interested in the population of whales in the ocean, stars in the sky, files in a computer, stocks that went up in a given year, etc…

The sample is the subset of the population whose characteristics are measured. The sample will be used to discover characteristics of the population.

Examples of population and sample

Population = voters in the city of Rio de Janeiro. Sample = 650 voters chosen at random (by chance). Feature of interest: the percentage of voters who plan to vote for a candidate X in the upcoming elections.

Population = mobile phone users in the city of Rio de Janeiro. Sample = 10,000 randomly selected users. Characteristic of interest: whether the user used text messages (SMS) in the last day.

Population = all TV households in the city of Rio de Janeiro. Sample = 1000 households with cable TV selected at random. Characteristic of interest = percentage of each audience TV station on a certain day of the week between 6 and 10 PM.

In short: we collect information from a sample that lets us learn something interesting about the population. And why do it? It's cost effective! The costs are infinitely lower than those of sampling the entire population (that is, doing a "census"). One can prove that for very large populations, a sample of about 600 or 1000 "individuals" provides very reliable estimates of the population characteristics one wishes to measure.

A census is a survey from the entire population. The population census in Brazil (and in most countries) is performed every 10 to 10 years. Why? Because a census is expensive and time consuming. Indeed, in the 90's it was delayed because the government had no money....

Now what?

You have collected a sample and, within this sample, you collected numerical data (for example, the average monthly electricity consumption in kWh of households in a certain area of town). What to do with it? There are two possibilities….

- You can simply describe these numerical data using tables, graphs and numerical measures. This is called descriptive statistics. Most market research does just that, and it is undoubtedly very important.

- The second alternative is to try to draw conclusions about the population characteristics from the data observed in the sample. This is called inferential statistics (or just statistics!). To be able to do it, we must have a thorough understanding of probability theory. This might explain why descriptive statistics emerged long before inferential statistics, which is basically a 20th century science

Book - Probability and Statistics - Chapter 1 - Data Sources

1.2. Data Sources

You can find a lot of interesting things in Internet site. I don´t even dare to suggest anywhere to look, as the list grows exponentially daily. Google itself has an abundance of data, and quite recently (in 2009) they published a very interesting applied working paper called “Predicting the Present with Google Trends“ (the article is available at http://ec.europa.eu/bepa/pdf/seminars/google_predicting_the_present.pdf and a presentation at http://www.frbsf.org/economics/conferences/1103/Varian-part_1.pdf). More than ever, you should explore, investigate, snoop around, and be curious!

You can find data quite easily at price comparison sites, and I bet you also use them quite often!

If you are an economist, there are wonderful databases provided by several US government agencies, and also multilateral and international agencies and institutions. You can check my blog or website for some links, but as I said before, I´m kind of discouraged about publishing links, as they evolve much faster than I can keep up with them. Just to get the conversation flowing, check the homepages of the St. Louis Fed and the Department of Energy – you’ll get LOTS of good data on these sites!

Most of the time, over the course of your professional life, you’ll be analyzing data obtained in your own company. This is a growing trend in companies of all sizes, as the collection and maintenance of data in companies becomes cheaper.

This is also a consequence of a change in manager’s mindsets, as businesspeople recognize the importance of the data stored within the company. These data are useful for: estimating the demand for current and new products, predicting and controlling inventories, payment flows and financing. In short, companies themselves often contain a substantial amount of critical data whose analysis helps increase their competitiveness (and sometimes even ensures their survival).

Some public sites with free economic information which I consider interesting (but the list is very personal) are:

Ipeadata (Brazilian Economic data):

BCB (Brazil´s Central Bank): www.bcb.gov.br

IBGE (Brazil’s Census Bureau): www.ibge.gov.br

St. Louis Federal Reserve Bank (USA): http://www.stlouisfed.org/

Dept. of Energy (USA): http://www.eia.gov/

U.S. Census Bureau: http://www.census.gov/

Yahoofinance - for free historical stock quotes: www.yahoofinance.com

Google Finance: http://www.google.com/finance

Book - Probability and Statistics - Chapter 1

Chapter 1 - Descriptive Statistics

1.1. Classification of Data

All the time we collect facts and figures for presentation and interpretation. Collecting data is usually hard and expensive (although it is getting easier and easier and the main difficulty might be to deal with the huge amount of data we have gathered…) so you might as well use your data carefully and wisely. This chapter is a VERY brief introduction to “types” of data and data summaries.

We can separate the "types" of data in different classes according to their characteristics:

Qualitative (or categorical) Data

Generally non-numeric data, or numeric data that do not “really” represent numbers, they refer to labels, and we call them data on nominal or ordinal scales (see below).

Quantitative Data

They indicate numerical quantities, such as price, volume, pitch, duration, profit, height, weight,...

We can also establish the scale of our data as nominal, ordinal, interval or ratio, as indicated in the following examples.

Nominal Scale

For example, suppose we divide a country into five different regions (Southeast, South, Northeast, North, Midwest). A numerical code can be assigned to each region (e.g., Southeast = 1, South =2 , ...), but there is no particular ordering to this code (we could have set North = 1, Northeast = 2, … and it would, in principle, work just as well).

Another common practice is to interview people and assign a code to their gender, for example, 1 if male, 0 if female. Obviously this code can be reversed without major problems, although there are some situations where the analyst might wish to code variables in a particular way, but that is something we shall address much later in these notes.

Ordinal Scale

It is similar to the nominal scale, but there is an "order" inherent in the codes.

For example, "Excellent" = 4, "Good"= 3, "Fair"= 2, "Poor" = 1, "Bad" = 0.

Scale Interval

It applies to ordinal data within a range. For example, "scores" of a standardized examination, such as the SAT (or GRE or GMAT), or any one test in which all grades remain restricted to a pre-specified range.

Proportion

It is data in the “usual” range. For example, the price of a car. If one car costs $40,000 and another $20 thousand, this means there is proportionality between prices, the first costs twice as much as the second.

In addition to the classifications already shown, we need to know if our data are cross-sectional or time series representations. The difference between the two types is easy to spot, as you will see next.

Cross Sectional Data

The data are obtained at the same time (or almost at the same time). For example, suppose you are interested in buying a used car from a certain brand, model and year and check a web site looking for the ads in your city. Based on this sample, you can decide what price you are willing to offer for the car (with some variability depending on mileage, maintenance and specific conditions of a given car). The basic point is – all the ads you “sampled” basically referred to a given point in time – this point represents the time when you were willing, or searching, or with money in your bank account, to buy a car!

Time Series Data

Data are obtained at different instants of time. The main idea is to observe how the variable evolves over time. The analysis and time series forecasting is a separate area of statistics, with its own methods, and will not be the focus of our chapter. For example, you collect the monthly sales data for a certain product of your company and verifies that these are increasing, decreasing, stable, and if there is any seasonal pattern in these sales (due to events such as Christmas, Mother's Day, Valentine's Day, “Black Friday” - traditionally the beginning of the Christmas shopping season in the United States, etc.).

Book - Probability and Statistics - Introduction

Introduction

These lecture notes are an introduction to Probability and Statistics designed for undergraduate students in Engineering, Economics and the quantitative Sciences. They should also be useful for practitioners.

Their style is quite informal, and I would like my readers to feel as if I were talking to them.

I realize it is a very bold and maybe pretentious move for a non-native speaker to attempt writing a textbook in English, but the objective is to reach as wide an audience as possible and, in my humble opinion, English is the Latin of the early 21^st century, at least in what concerns technical jargon.

Many examples and datasets in the text refer to Brazil, my native country, but hopefully they are common enough to be understood by readers worldwide. If not, please let me know.

These notes will be continuously published as blog posts and eventually will form a book.

Needless to say, all mistakes are solely my fault, and comments and suggestions will be much appreciated.

Hope you – the reader, enjoy it and use it as part of a fascinating journey that I once started travelling and haven´t yet reached the end – the knowledge of Probability, Statistics and Data Analysis.

Let’s go!

Monica Barros, D.Sc.

January 2012

Book - Probability and Statistics - work in progress

As a New Year promise, I decided to start translating my lecture notes on Probability and Statistics to English and will post them here, together with exercises and solutions.

Hopefully this will widen the blog's audience.

Comments are more than welcome.

Cheers,

M.Barros

M. Barros (in English)

Páginas

Sunday, January 15, 2012

Book - Probability and Statistics - Section 1.5. Descriptive Statistics - Numerical Measures