Sunday, December 23, 2012

Data Mining and Predictive Analytics: 6 Reasons You Hired the Wrong Data Miner

Thursday, December 13, 2012

A Compilation on Time Series and Stochastic Processes geared towards Quantitative Finance

Beware - this is NOT an original piece of work!

I just prepared a compilation from Wikipedia articles and repackaged it into a pdf file. I will later try to repackage it into Kindle format using Caliber. I couldn´t get to upload it to my own site, so I am sending it to Slideshare so you can download it.

It amounts to a LARGE file, almost 200 pages long, so think before you print...

The idea was to put everything together (or almost everything...) in a halfway cohesive manner, to serve as an off-line and free reference for those interested in Time Series, Stochastic Processes and Quant Finance.

Once again, I am NOT responsible for the content, I just compiled it from Wikipedia in English.

The Economist Charts - Economic Opportunity for women (by country)

Economic opportunity for women: Where to be female | The Economist

The results are not surprising at all. I am, however, quite disappointed with Brazil's ranking. It seems that having a female president had a non significant effect on the opportunities for women so far. But, let's be realistic - cultural changes don't happen easily or quickly most of the times.

The Brave New World of BI

One of many interesting articles on the
subject. My guess is BI will be THE subject for the years to come and reshape business and academia.
The Brave New World of Business Intelligence

Wednesday, December 12, 2012

Free online courses in stat and data mining - Coursera

I discovered by chance a website that I thought was absolutely wonderful, with sensational and free courses taught by professors from major universities (so far I only noticed universities in the USA).

It looks so good that I signed up for three courses. Now I have no excuses not to learn R.

The link to the homepage is:

I was looking for courses related to Statistics and Data Mining, and found things that seemed wonderful. If you want to start where I began , see :

Monday, December 10, 2012

Presentation - International Symposium on Forecasting 2012

This is my presentation at ISF 2012. The purpose is to forecast the electricity spot price in Brazil through a hybrid neuro-fuzzy/neural network model.

The presentation can also be downloaded from my site at:

Noteworthy blog

I came across this blog today and it gave me ideas to write most of today's posts. It is really worth reading if you're interested in statistics. The link is:

Case study - Daily load forecasting

Some preliminary analysis I did a long time ago and I believe it still is relevant.

You can download it in my main site (under "Case Studies").

MOOC - an acronym for the future in teaching?

I just came across a blog post about a free statistics course (more about it later) and MOOC, which, of course, I had no idea of what it was.

Well, if you go to:

You´ll find out MOOC stands for "Massively Open Online Courses". They are a relatively new idea, and the blog author narrows MOOCs to those that "that allow registration to anyone (for free or for a small price) and also allow some degree of two way communication". Traditional Online courses are not necessarily MOOCs, as MOOCs require and promote an incredible amount of interaction among participants, and information is not concentrated, or stored, in a single place.

The next video explains the concept pretty well and is fun to watch.

If we go back to the beginning of this post and recall the online stat course I was talking about, the link is:

And the course (Statistics 110: Probability ) is free and available at Itunes. It is a Harvard University course taught by Prof. Joe Blitzstein.

R introductory course - free

I haven't checked out the entire course, but it seems very nice and the registration process is quick and without hassles (just use your FB account).

I believe it is worth a look.

Big Data in practice - from NYTimes

A very interesting article from the NY Times with a direct application of big data.

Your car insurance soon will be a direct function of very measurable and personal parameters. Basically, the insurance policy works in a "pay per mile" basis, and charges less to people who drive shorter distances.

Naturally, this can be easily extended (and will, of course) to other areas, and even other types of insurance policies, as the NY Times article points out.

Read more at:

Sunday, November 4, 2012

Scilab - version 5.4.0. (new version and features)

Youtube video presenting Scilab's functionalities. I haven´t tried it yet, but I think it´s definitely worth a shot, since it´s free, it  is quite similar to Matlab and the number of users is growing fast.

Wednesday, October 3, 2012

Play time - Monty Hall Paradox - Videos and Links

Here are some links and videos about the Monty Hall paradox. I confess that the idea that switching doors was the right strategy was highly counterintuitive to me, but at least I am not alone (see the Wikipedia link).

See the Wolfram simulation at:

Tuesday, January 3, 2012

Book - Probability and Statistics - Chapter 1 - 1.4. Descriptive Statistics - Charts

Book - Probability and Statistics - Chapter 1 - 1.3. Descriptive Statistics - Population and Sample

1.3. Descriptive Statistics - Population and Sample

Collecting data, as we have said, is a task often costly, time consuming and arduous. So in practice we are obliged to infer some interesting features in a population from a subset of it, called “sample”.

In Statistics you might say that “a number is not just a number”! What do I mean by this? Associated with a number there is a measure of uncertainty or variability. Why? Simply by not collecting all information about the variable of interest, as this information is based on a subset of data, which is the sample. This brings an intrinsic uncertainty to the process - as we do not collect data from the entire population, the number we get has an associated uncertainty due the fact that this number has been collected from a "piece" of the population, the sample, and NOT from the ENTIRE population.

The population is the collection of all elements whose characteristics we wish to know. The elements (or "individuals") in the population are not necessarily people! We might be interested in the population of whales in the ocean, stars in the sky, files in a computer, stocks that went up in a given year, etc…

The sample is the subset of the population whose characteristics are measured. The sample will be used to discover characteristics of the population.

Examples of population and sample
Population = voters in the city of Rio de Janeiro. Sample = 650 voters chosen at random (by chance). Feature of interest: the percentage of voters who plan to vote for a candidate X in the upcoming elections.
Population = mobile phone users in the city of Rio de Janeiro. Sample = 10,000 randomly selected users. Characteristic of interest: whether the user used text messages (SMS) in the last day.
Population = all TV households in the city of Rio de Janeiro. Sample = 1000 households with cable TV selected at random. Characteristic of interest = percentage of each audience TV station on a certain day of the week between 6 and 10 PM.

In short: we collect information from a sample that lets us learn something interesting about the population. And why do it? It's cost effective! The costs are infinitely lower than those of sampling the entire population (that is, doing a "census"). One can prove that for very large populations, a sample of about 600 or 1000 "individuals" provides very reliable estimates of the population characteristics one wishes to measure.

A census is a survey from the entire population. The population census in Brazil (and in most countries) is performed every 10 to 10 years. Why? Because a census is expensive and time consuming. Indeed, in the 90's it was delayed because the government had no money....

Now what?
You have collected a sample and, within this sample, you collected numerical data (for example, the average monthly electricity consumption in kWh of households in a certain area of ​​town). What to do with it? There are two possibilities….

-        You can simply describe these numerical data using tables, graphs and numerical measures. This is called descriptive statistics. Most market research does just that, and it is undoubtedly very important.
-        The second alternative is to try to draw conclusions about the population characteristics from the data observed in the sample. This is called inferential statistics (or just statistics!). To be able to do it, we must have a thorough understanding of probability theory. This might explain why descriptive statistics emerged long before inferential statistics, which is basically a 20th century science

Book - Probability and Statistics - Chapter 1 - Data Sources

1.2. Data Sources

You can find a lot of interesting things in Internet site. I don´t even dare to suggest anywhere to look, as the list grows exponentially daily. Google itself has an abundance of data, and quite recently (in 2009) they published a very interesting applied working paper called “Predicting the Present with Google Trends“ (the article is available at and a presentation at More than ever, you should explore, investigate, snoop around, and be curious!

You can find data quite easily at price comparison sites, and I bet you also use them quite often!

If you are an economist, there are wonderful databases provided by several US government agencies, and also multilateral and international agencies and institutions. You can check my blog or website for some links, but as I said before, I´m kind of discouraged about publishing links, as they evolve much faster than I can keep up with them. Just to get the conversation flowing, check the homepages of the St. Louis Fed and the Department of Energy – you’ll get LOTS of good data on these sites!

Most of the time, over the course of your professional life, you’ll be analyzing data obtained in your own company. This is a growing trend in companies of all sizes, as the collection and maintenance of data in companies becomes cheaper.

This is also a consequence of a change in manager’s mindsets, as businesspeople recognize the importance of the data stored within the company. These data are useful for: estimating the demand for current and new products, predicting and controlling inventories, payment flows and financing. In short, companies themselves often contain a substantial amount of critical data whose analysis helps increase their competitiveness (and sometimes even ensures their survival).

Some public sites with free economic information which I consider interesting (but the list is very personal) are:

Ipeadata (Brazilian Economic data):

BCB (Brazil´s Central Bank):

IBGE (Brazil’s Census Bureau):

St. Louis Federal Reserve Bank (USA):

Dept. of Energy (USA):

U.S. Census Bureau:

Yahoofinance - for free historical stock quotes:

Book - Probability and Statistics - Chapter 1

Chapter 1 - Descriptive Statistics

1.1. Classification of Data

All the time we collect facts and figures for presentation and interpretation. Collecting data is usually hard and expensive (although it is getting easier and easier and the main difficulty might be to deal with the huge amount of data we have gathered…) so you might as well use your data carefully and wisely. This chapter is a VERY brief introduction to “types” of data and data summaries.

We can separate the "types" of data in different classes according to their characteristics:
Qualitative (or categorical) Data
Generally non-numeric data, or numeric data that do not “really” represent numbers, they refer to labels, and we call them data on nominal or ordinal scales (see below).
Quantitative Data
They indicate numerical quantities, such as price, volume, pitch, duration, profit, height, weight,...

We can also establish the scale of our data as nominal, ordinal, interval or ratio, as indicated in the following examples.

Nominal Scale
For example, suppose we divide a country into five different regions (Southeast, South, Northeast, North, Midwest). A numerical code can be assigned to each region (e.g., Southeast = 1, South =2 , ...), but there is no particular ordering to this code (we could have set North = 1, Northeast = 2, … and it would, in principle, work just as well).

Another common practice is to interview people and assign a code to their gender, for example, 1 if male, 0 if female. Obviously this code can be reversed without major problems, although there are some situations where the analyst might wish to code variables in a particular way, but that is something we shall address much later in these notes.

Ordinal Scale
It is similar to the nominal scale, but there is an "order" inherent in the codes.
For example, "Excellent" = 4, "Good"= 3, "Fair"= 2, "Poor" = 1, "Bad" = 0.

Scale Interval
It applies to ordinal data within a range. For example, "scores" of a standardized examination, such as the SAT (or GRE or GMAT), or any one test in which all grades remain restricted to a pre-specified range.

It is data in the “usual” range. For example, the price of a car. If one car costs $40,000 and another $20 thousand, this means there is proportionality between prices, the first costs twice as much as the second.

In addition to the classifications already shown, we need to know if our data are cross-sectional or time series representations. The difference between the two types is easy to spot, as you will see next.

Cross Sectional Data
The data are obtained at the same time (or almost at the same time). For example, suppose you are interested in buying a used car from a certain brand, model and year and check a web site looking for the ads in your city. Based on this sample, you can decide what price you are willing to offer for the car (with some variability depending on mileage, maintenance and specific conditions of a given car). The basic point is – all the ads you “sampled”  basically referred to a given point in time – this point represents the time when you were willing, or searching, or with money in your bank account, to buy a car!

Time Series Data
Data are obtained at different instants of time. The main idea is to observe how the variable evolves over time. The analysis and time series forecasting is a separate area of ​​statistics, with its own methods, and will not be the focus of our chapter. For example, you collect the monthly sales data for a certain product of your company and verifies that these are increasing, decreasing, stable, and if there is any seasonal pattern in these sales (due to events such as Christmas, Mother's Day, Valentine's Day, “Black Friday” - traditionally the beginning of the Christmas shopping season in the United States, etc.).

Book - Probability and Statistics - Introduction


These lecture notes are an introduction to Probability and Statistics designed for undergraduate students in Engineering, Economics and the quantitative Sciences. They should also be useful for practitioners.

Their style is quite informal, and I would like my readers to feel as if I were talking to them.

I realize it is a very bold and maybe pretentious move for a non-native speaker to attempt writing a textbook in English, but the objective is to reach as wide an audience as possible and, in my humble opinion, English is the Latin of the early 21st century, at least in what concerns technical jargon.

Many examples and datasets in the text refer to Brazil, my native country, but hopefully they are common enough to be understood by readers worldwide. If not, please let me know.

These notes will be continuously published as blog posts and eventually will form a book.

Needless to say, all mistakes are solely my fault, and comments and suggestions will be much appreciated.

Hope you – the reader, enjoy it and use it as part of a fascinating journey that I once started travelling and haven´t yet reached the end – the knowledge of Probability, Statistics and Data Analysis.

Let’s go!

Monica Barros, D.Sc.

January 2012

Book - Probability and Statistics - work in progress

As a New Year promise, I decided to start translating my lecture notes on Probability and Statistics to English and will post them here, together with exercises and solutions.

Hopefully this will widen the blog's audience.

Comments are more than welcome.