We
all know that wrong data leads to wrong analysis and hence can cost a
lot of money. In this post I am highlighting 5 issues that lead to
wrong data and hence wrong analysis. Make sure you take care of these
issues before spending any time on analyzing and putting your
presentation together.
- Manual entry by end users – When you rely on the
end users/customer to enter the data in the free form you will get lot
of variations of the data. For example, when you ask them to enter the
city, you can get variations such as Redmond, Redmond WA. Redmond
(Seattle), Remond etc. Doing any analysis on the city level will pose
an issues since you have several variation of the same city. Unless you
have thoroughly cleaned and accounted for every possible variation you
will not have the right analysis. In order to avoid such issues,
wherever possible provide the choices via drop down or auto-fill rather
than letting users type in the answers.
- Manual entry by someone in the organization – Though this is more controlled than letting end users enter the data, you still have issues similar to number 1 above.
- Excel sheets (calculations) – This becomes an issue
when you have lot of calculated columns and some of them are dependent
on the other calculated columns and sheets. One simple mistake can
cascade to multiple columns and can mess up all your analysis. Whenever
possible, do not rely on calculated columns (prepared by someone else)
in excel sheet, just use the raw data and do your own calculations so
that you can stand behind them.
- Data Imports, connections and processing – In most
companies the data resides in various places – databases, excel sheets,
flat files Hadoop, 3rd parties etc. For you to get a 360 views you will
need to collect and combine the data for all the sources. The data
corruption can occur at several places including but not limited to
mapping wrong keys, missing some key data, writing wrong queries,
importing partial data etc. You should always verify the data that you
are getting and make sure that it is clean and complete. Since there are
several owners of the data this is not always an easy task,
particularly in the large organizations. There is not much that you can
do on daily basis to verify the quality of the data, but make sure you
understand the underlying data sources and have good understanding of
the process that is used to combine the data. Make sure that you in
loop on any changes that are being made to the data sources and the
process. You can not afford for your raw material (data) to be of a low
quality.
- Visualization tools – You expect these tools to
work, don’t you? However be careful and double check your calculations
before you present your data. I recently had two issues that made me
believe that there are many of you who will run into these issues and
might present the wrong analysis.
- When you use averages makes sure that you are using the right
columns. Average of an average is not the same as average by summing
the values in individual rows. Recently I ran into a situation where
CPM (Cost Per Thousand Impressions) was already calculated in the excel
sheet (see Issue no. 3 above). When that data was brought over in
Tableau, the analyst, in an effort to find average CPM, used the
calculated column to compute average CPM. Everything looked good on the
surface but it was wrong since it calculated average of an average.
- When using a map make sure you use the correct values of Longitude
and Latitude. I saw an example where the map showed up perfectly,
however a quick quality check showed that the average value for a state
was more than the individual values of all cities in that state, which
is not possible. On further investigation we found that the issue was
with the way Longitude and Latitude were used to render that map. Once
the issue was fixed, everything worked fine.
I would love to hear from you if you have encountered these or any other sources of data quality issues.
No comments:
Post a Comment
I would like to hear your comments and questions.