Big Data, Better Data?

Big Data Analysis

Businesses are experiencing of tsunami of data and a paradigm shift when it comes to how we make it meaningful. In 1965, Gordon Moore concluded that computing would increase in power and decrease in costs at an exponential rate. Moore’s Law has held up since that time. So as we move towards big data is Moore’s Law still valid?

We see the evolution of data management (see figure below) and the move from enterprise resource planning, customer relationship management, web based and finally big data. Data is now too big in size and unstructured to justify the use of traditional relational database designs.





Though traditional database management and business intelligence systems are still in use (e.g. SQL), the tide is moving to the challenges of storing and manipulating big data for management. The biggest winner by far has been ‘Hadoop’ the open source software framework managed by the Apache Software Foundation. Businesses are currently starting to see how they can utilise and make decisions based on big data tools. Some examples of big data in action today are Spotify recommending playlists, Facebook suggesting friends and Netflix picking your next movie or box set.

Asking the right questions

Big data nails down to one thing when looking for an outcome – asking the right question!

The example from Douglas Adam’s ‘Hitchhiker’s Guide to the Galaxy’ is an apt parable for Big Data. A supercomputer analyses for hundreds of years to find the meaning of life, the universe and everything. The computer calculates the answer to be ‘42’. After protest, it is routinely explained that now they have the answer, they need to find the actual question – which requires a more sophisticated computer.

All data is meaningless without the skill to analyse and yield results. The most successful companies are making decisions based on facts and information. A business must create a strategy and be clear about what information it needs to achieve set goals. An example would be where a company wants to increase customers. Some good questions to ask would be ‘who are our current customers’ and what are our demographics for valued customers’ – this makes it easier to identify big data that can be gathered (Marr, 2015)


Sources of Big Data

Main Sources of Big Data

The real problem with big data is not the functions of storage or analysis – it is transforming the relevant data into useful information. This is not a new phenomenon; making data relevant has been an issue with all data sources not only big data. Trying to architect and design big data will create a competitive advantage for organisations.

Volume, Variety, Velocity & (Veracity)

When analysing the dimensions of what characterises big data prevailing, theory outlines three distinct elements known as the 3 V’s – Volume (scale or size), Variety (sources) and Velocity (motion). There have been numerous studies identifying a number of additional elements (see appendix below – all beginning with V!) but the seminal work from Douglas Laney is still relevant. The 3Vs must be taken into consideration when designing an organisations Business Intelligence model. However, I would argue the inclusion of a fourth V when characterising big data, namely Veracity.

Veracity is an important feature of big data as essentially no matter how accurate data seems, there will always be an inherent uncertainty. Examples include; weather patterns, human sentiment, economic factors and future trends (IBM Report, 2012). No amount of data cleansing will make this data fully accurate. It is important to gain information, analyse and forecast using this ‘uncertain data’ and still create valuable information. An organisation must factor in the Four V’s to create a competitive advantage from a big data strategic plan.

Big Data challenges

The main challenges coming from the characteristics of big data include (Jagadish et al., 2014);

  • Heterogeneity – the structure of the data must be interpreted and metadata required
  • Scale – size of data is bigger than hardware capabilities parallelism (nodes) and cloud computing
  • Inconsistency and incompleteness – diverse sources and errors need to be identified and corrected/mitigated
  • Timelines – real time techniques to filter and summarise data
  • Privacy and data ownership – laws to consider but also a philosophical argument on who ‘owns’ personal data.

These are the more technical challenges faced at the moment when looking at big data. There are wider issues such as economic, social and political, which need to be addressed on an international level. We have seen the fallout from the NSA’s collection of data and the impact it has on people. Big data and especially personal information will become more freely available and this has privacy and security implications. Organisations will have to face a number of not only technical, but moral hazards in the future.

Conclusion – Future Direction

It is no longer applicable to handle increasing volumes of data, as in the past. Big Data has caused a fundamental shift and CPU speeds and other resources are not able to manage these data volumes. Moore’s Law is being seriously challenged but is, so far, holding up in the face of big data. If big data continues to proliferate, we may need to examine the validity of Moore’s Law in the future.






  • Marr, B. (2015) ‘Forbes – Big Data: Too Many Answers, Not Enough Questions’ Available at: (Accessed on 26 August 2015)
  • IBM Report (2012) ‘Analytics: The real-world use of big data How innovative enterprises extract value from uncertain data’ Available at: (Accessed on 29 August 2015)
  • Laney, D. (2001) ‘3D data management: Controlling data volume, velocity, and variety, Application Delivery Strategies’, META Group.
  • Moorthy, J., Lahiri, R., Biswas, N., Sanyal, D., Ranjan, J., Nanath, K., & Ghosh, P. (2015) ‘Big Data: Prospects and Challenges’, Vikalpa: The Journal For Decision Makers, 40, 1, pp. 74-96, Business Source Complete, EBSCOhost (Accessed on 25 August 2015)
  • Martin, KE 2015, ‘Ethical Issues in the Big Data Industry’, MIS Quarterly Executive, 14, 2, pp. 67-85, Business Source Complete, EBSCOhost (Accessed on 25 August 2015)
  • Jagadish, H., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J., Ramakrishnan, R., & ShahabiI, C. (2014) ‘Big Data and Its Technical Challenges’, Communications Of The ACM, 57, 7, pp. 86-94, Business Source Complete, EBSCOhost (Accessed on 25 August 2015)






Appendix: Characteristics of Big Data

Big Data Dimensions Explanation


Quantum of data generated, stored and used is explosive now. Terabytes, Petabytes +
Variety Data can now be generated through multiple channels. Examples such as, Facebook and Twitter, call centres, chats, voice data, video from CCTVs of retail outlets, IoT, RFID, GIS, smart phone, SMS, etc.
Velocity Real-time data is accessible in many cases such as mobile telephony, RFID, Barcode scan downs, Click stream, online transactions and blogs. The data generated from all such sources can be accumulated with the speed at which they are generated.
Veracity Authenticity of the data increases with automation of data capture. With multiple sources of data, it would be possible to triangulate the results for authenticity.
Validity The terms veracity and validity are often confusing. Perhaps the term validity should be understood as in the market research methodology that the data should be representing the concept that it is expected to represent.
Value Return on Investment and business value are being emphasized more than value for multiple stakeholders.
Variability Variance in the data is often treated as the information content in the data. With a large temporal and spatial data, there can be considerable difference in the data at different sub-set levels.
Venue Multiple data platforms, data bases, data warehouses, format heterogeneity, data generated for different purposes and public and private data sources.
Vocabulary New concepts, definitions, theories and technical terms are now emerging; they were not necessarily required in the earlier context, for example, MapReduce, Apache Hadoop, NoSQL and MetaData
Vagueness It relates to the confusion about the meaning and overall developments around Big Data. Though it is not necessarily characteristic of the Big Data deployment, it reflects the current context. This may change and more clarity is likely to emerge in the future

Challenges of Big Data Deployment (Moorthy et al., 2015, p.76)





R You Ready

R – An Introduction
R is a programming language package aimed at data scientist as a tool for computational statistics and visualisation. It has been developed into a popular language and data science programme for finance and data analytical companies. R is part of the open source revolution and has been created and supported entirely by developers and experts worldwide. R has a number of advantages including; every data analysis technique downloadable and free, cutting edge community reviewed methods, stunning data visualisation infographics, faster results with a manageable programme language and expert resources.

R Code School

The best way to get to grips with R is to take the online tutorial through ‘Try R Code School’. Though basic, it runs through the primary sections and gets you acquainted with the R programming language. The tutorial is pirate themed and this made the sections enjoyable and the pirate in-jokes kept me entertained throughout. The seven sections in the tutorial were:
1. Using R
2. Vectors
3. Matrices
4. Summary Statistics
5. Factors
6. Data Frames
7. Real World Data
After completing each section, I was rewarded with a badge and each topic covered the basics to get me started with real world data sets.

Try R Code School Badges
Try R Code School Badges

Analysing the Data
Having previously worked in finance, I have an inherent interest (and experience) in financial analysis and reporting. I decided to use R programming to take financial data from the Irish Stock Exchange (ISEQ). I decided to focus on Aer Lingus shares over a ten year timeframe.
The first part of my research consisted of analysing the most powerful R packages to analyse my data. I found the most trended of the packages best at extracting financial time series data from internet sources were – Quantmod and Quandl. These packages work in a similar vein to a Bloomberg terminal but at no cost. As I was focusing on historic data, I used Quantmod to extract the data. Quandl would be the preferred package when looking at futures.
I installed the Quantmod package from the ‘Packages’ dropdown in R and then tested searching for data using ticker symbols related to Aer Lingus shares – AERL.L.
This command essentially searched google to pull the ticker number ‘AERL.L’ and retrieve any data since 01/Jan/2004. This data is presented as daily log returns as the price; Open price, High price, Low price, Close price and Volume traded.

R commands to pull AERL.L data
R commands to pull AERL.L data

Now we have the data set, it is time to analyse the data to form some interesting information. The first chart I created was to run a time series showing the share price and the volume traded. This provides an illustration of the shares following an almost U-curve between 2006 and 2015.

Time Series of AERL.L data

Time Series of AERL.L data

We have a large data set giving daily prices of Aer Lingus shares over approximately a ten year period. A majority of modelling systems use the data in a XTS command object to extract subsets from the data range. This is widely used when extracting say monthly or quarterly data for additional analysis or reporting. This functionality is an example of how R can be used effectively over older analytical tools.

R commands to create XTS file

R commands to create XTS file and view data sets

Using the capabilities of the data set, I want to plot a graph showing the closing price of the shares. This graph is exactly what would be used to present to management and is an excellent representation of the data set.

R Graph of Closing Prices

R Graph to visualise closing prices of Aer Lingus Shares

An interesting analysis is to plot the daily log return of the closing prices. The resulting time series graph shows the visual impact of volatility in the share price. We can see that during the financial crisis (2008-2009) the share price was in flux and this would be evident of many traded shares at the time. Since 2010, the share price is still fluctuating (though at a lesser rate) and this would indicate instability in the company. Based on remedial research the likely effect has been the recovery in the business since 2010 and the recent speculation of a takeover from International Airlines Group (IAG).

Closing Prices Daily Log Return

Closing daily prices (daily log return)


R Summary Statistics

Summary statistics of Closing Prices

Concepts – If I had more time
It would be extremely difficult to propose a new effective financial model in such a short timeframe. In the above example we are not using indicators just taking data to determine market direction or trends. This example has given the power of R at modelling data and presenting the data in an excellent visual format. The data is current and can be easily updated through internet searches.
My analysis is limited in the sense that I have taken past data and from only one company. An excellent way to enhance the analysis would be to take competitor data and plot these against each other. This analysis over time would give an insight into market factors.
Quandl is another programme which looks at futures based on financial data. This would be an excellent programme to create models and predict future prices based on this information. Another way to analyse the data trends would be to analyse internet trends and keywords to see if there is a correlation with market movement. R would be able to analyse large data over time and this could be plotted against the share price chart.


  • Cookbook for R, (Accessed: 01 August 2015)
  • Irish Stock Exchange (2015) ‘Market Data’ (Accessed: 01 August 2015)
  • Playing Financial Data Series(1), Chenangen, (2014) (Accessed: 03 August 2015)
  • Quantitative Finance Applications in R (Internet Sources), Joseph Rickert (2013) (Accessed: 03 August 2015)
  • Quantitative Finance Applications in R (XTS), Joseph Rickert (2014) (Accessed: 03 August 2015)
  • Revolution Analytics (2015) ‘R is Hot’ Available at: (Accessed: 03 August 2015)
    Revolution Analytics (2015) ‘What R’ (Accessed: 03 August 2015)
  • Try R Code School (2015) (Accessed: 28 July 2015)

R Editor Commands
# Open Quantmod and, xts and moments
library(moments) # to get skew & kurtosis

# Searches from Google and pulls data since 01/08/2005
getSymbols(“AERL.L”, src=”google”,from=”2005-08-01″,to= “2015-08-14”);

# Plot a time series chart

# Create an xts file of the ISEQ data and return TRUE

# View the dataset

(AERL.L.Close) # returns TRUE
AERL.L.Close is.xts(AERL.L.Close) # returns TRUE

#Plot a graphic profile of the data
plot(AERL.L.Close, main = “Closing Daily Prices for Aer Lingus Shares(AERL.L)”,
col = “red”,xlab = “Date”, ylab = “Price”, major.ticks=’years’,

# Set Closing price and Plot data
AERL.L.ret AERL.L.ret

plot(AERL.L.ret, main = “Closing Daily Prices for Aer Lingus (AERL.L)”,
col = “red”, xlab = “Date”, ylab = “Return”, major.ticks=’years’,

# Set and plot data to find Mean, Std Dev (volatility), Skewness & Kurtosis
AERL.L.ret AERL.L.ret

statNames AERL.L.stats names(AERL.L.stats) AERL.L.stats

Google Fusion Table

Irish Population per County (Source: CSO Census Data)

Google Fusion Tables – Visualise your data

Okay so you have gathered some awesome data and you want to impress your boss with some useful information. Now while bar charts have their place, here is a way to make data visually alive. Thankfully there is a useful application which will do the hard work for you, and impress your boss at the same time.

“Google Fusion Tables is an experimental data visualization web application to gather, visualize, and share data tables.”

Google Fusion Tables is a web application tool used to create a visual interpretation of data sets. Data tables can be gathered from public data or imported from your own data. The data is then visualised and can be published and shared on the web. There is a real collaborative feel to the application and the information can be communicated to your target audience with ease.
Google Fusion Tables must firstly be installed by creating a Google account and signing into My Drive. Simply connect Fusion Tables as a new application, for free, and you are ready to begin.

Designing an Irish Population Heat Map
To create a Heat Map of the Irish population by county we needed two specific data tables, namely:
• Population figures by county (csv. file)
• Counties of Ireland data map (kml. file)
Now there are various ways these can be created but for this Heat Map the Population figures were taken from the most recent CSO database, which was taken in 2011.
The Map data was derived from a KML data file and contained geometry data on all the counties in the Republic of Ireland. This data was used to essentially plot the county boundaries in Google maps.

The next step was to cleanse the data which is important for any data exercises. The data from the CSO population table was converted into an Excel document and it was noticed that some of the counties included subsets which needed to be amended. The ‘State’ and ‘Provinces’ were removed and the data for Tipperary North and South was combined into one county. This left the data with 26 counties and corresponding population figures for each county.

The KML file was downloaded into Fusion Tables and there were 99 rows in total. This was the geometry data for the counties. This step is very important as the data from the two tables must be compatible or the files will not merge correctly.
These tables were uploaded into Google Fusion Tables ready to be ‘Merged’. This is where the power of Fusion Tables comes into its own. The Map file was opened and from File

A new Tab was created, with the merged data given a visual representation of the Population of Ireland by County for 2011. At this point, the map needs to be edited to give the Heat Map some visual meaning. It was decided to distribute the counties into six buckets based on population density. The figures were distributed as; 0 – 75,000 (6 counties), 75,000 – 100,000 (4), 100,000 – 125,000 (4), 125,000 – 180,000 (6), 180,000 – 250,000 (3) & 250,000 – 1,273,070 (3). Though this was not evenly distributed, the counties were easier to distinguish and the map had a clearer visual impact. The counties could have been evenly distributed by breaking the data from the Population table into even sets and represented in this fashion. Each bucket was given a colour which was incrementally darker as the population density increased. A legend was created and gives the Heat Map more context when distinguishing the county’s population numbers.

I have made my data public and this is an important feature of Google Fusion Tables. Anyone can now take my data use this to carry out further research on population in Ireland.

Irish Population Data in action
The Heat Map of Irish population could be used in a number of interesting ways depending on data gathered. The CSO website has a number of detailed databases with well-presented data sets on a number of topics including; housing, health, education, labour market, tourism and transport. These would be used on a macro level to for the government to decide on future spending requirements in certain areas. The country is experiencing a housing shortage and the government are expected to deliver social housing projects. To identify the biggest number of social housing areas needed the government would use a combination of social housing applicants by area. Plotting these two data sets would give a nationwide Heat Map and identifying the most needed areas on a more local scale. The KML data would need to be more granular to target specific areas within counties. A well-presented Heat Map would give an excellent representation specific area shortages and therefore where funding is most needed.
Taking data from the 2011 CSO Census, another heat map was created showing the Vacancy rates of Housing per County. The Heat Map below shows properties that re left vacant per county. This is another example of using CSO data to present a visual Heat Map in reporting social issues.

Further Practical uses for Fusion Tables
Google Fusion Tables have a variety of functions for making a visual interpretation of your data. Scene perception studies have proven people show an increased understanding of pictures based on colour. The most recognisable is for representing weather. News and weather reports are presented with predicted weather patterns and forecasts. This visual information is consumed and used for; sea crossings, floods, farming, heatwaves, icy roads, planning journeys.
An excellent use of Heat Maps has been on the research of Global Warming patterns. Predictive maps are powerful when publishing outcomes. The psychological impact of seeing the global warming patterns is proven to help with understanding and give meaning to, often complicated, data sets.

Google Fusion Tables is an excellent application to present data in a clear visual format. The application is extremely useful for taking geometry data in a KML file and creating a Heat Map using Google Maps. This visual representation, when shared, is an interactive way to present your data to a wider audience. The collaboration element gives the opportunity to enhance data and findings based on original data sources. The application has the potential to give a greater understanding of data sets, in a user friendly visual format.