Big Data, Better Data?

Big Data Analysis

Businesses are experiencing of tsunami of data and a paradigm shift when it comes to how we make it meaningful. In 1965, Gordon Moore concluded that computing would increase in power and decrease in costs at an exponential rate. Moore’s Law has held up since that time. So as we move towards big data is Moore’s Law still valid?

We see the evolution of data management (see figure below) and the move from enterprise resource planning, customer relationship management, web based and finally big data. Data is now too big in size and unstructured to justify the use of traditional relational database designs.

 

big-data-evolution

Source: http://image.slidesharecdn.com/bigdata-140128092341-phpapp02/95/big-data-6-638.jpg?cb=1390901096

 

Though traditional database management and business intelligence systems are still in use (e.g. SQL), the tide is moving to the challenges of storing and manipulating big data for management. The biggest winner by far has been ‘Hadoop’ the open source software framework managed by the Apache Software Foundation. Businesses are currently starting to see how they can utilise and make decisions based on big data tools. Some examples of big data in action today are Spotify recommending playlists, Facebook suggesting friends and Netflix picking your next movie or box set.

Asking the right questions

Big data nails down to one thing when looking for an outcome – asking the right question!

The example from Douglas Adam’s ‘Hitchhiker’s Guide to the Galaxy’ is an apt parable for Big Data. A supercomputer analyses for hundreds of years to find the meaning of life, the universe and everything. The computer calculates the answer to be ‘42’. After protest, it is routinely explained that now they have the answer, they need to find the actual question – which requires a more sophisticated computer.

All data is meaningless without the skill to analyse and yield results. The most successful companies are making decisions based on facts and information. A business must create a strategy and be clear about what information it needs to achieve set goals. An example would be where a company wants to increase customers. Some good questions to ask would be ‘who are our current customers’ and what are our demographics for valued customers’ – this makes it easier to identify big data that can be gathered (Marr, 2015)

 

Sources of Big Data

Main Sources of Big Data

The real problem with big data is not the functions of storage or analysis – it is transforming the relevant data into useful information. This is not a new phenomenon; making data relevant has been an issue with all data sources not only big data. Trying to architect and design big data will create a competitive advantage for organisations.

Volume, Variety, Velocity & (Veracity)

When analysing the dimensions of what characterises big data prevailing, theory outlines three distinct elements known as the 3 V’s – Volume (scale or size), Variety (sources) and Velocity (motion). There have been numerous studies identifying a number of additional elements (see appendix below – all beginning with V!) but the seminal work from Douglas Laney is still relevant. The 3Vs must be taken into consideration when designing an organisations Business Intelligence model. However, I would argue the inclusion of a fourth V when characterising big data, namely Veracity.

Veracity is an important feature of big data as essentially no matter how accurate data seems, there will always be an inherent uncertainty. Examples include; weather patterns, human sentiment, economic factors and future trends (IBM Report, 2012). No amount of data cleansing will make this data fully accurate. It is important to gain information, analyse and forecast using this ‘uncertain data’ and still create valuable information. An organisation must factor in the Four V’s to create a competitive advantage from a big data strategic plan.

Big Data challenges

The main challenges coming from the characteristics of big data include (Jagadish et al., 2014);

  • Heterogeneity – the structure of the data must be interpreted and metadata required
  • Scale – size of data is bigger than hardware capabilities parallelism (nodes) and cloud computing
  • Inconsistency and incompleteness – diverse sources and errors need to be identified and corrected/mitigated
  • Timelines – real time techniques to filter and summarise data
  • Privacy and data ownership – laws to consider but also a philosophical argument on who ‘owns’ personal data.

These are the more technical challenges faced at the moment when looking at big data. There are wider issues such as economic, social and political, which need to be addressed on an international level. We have seen the fallout from the NSA’s collection of data and the impact it has on people. Big data and especially personal information will become more freely available and this has privacy and security implications. Organisations will have to face a number of not only technical, but moral hazards in the future.

Conclusion – Future Direction

It is no longer applicable to handle increasing volumes of data, as in the past. Big Data has caused a fundamental shift and CPU speeds and other resources are not able to manage these data volumes. Moore’s Law is being seriously challenged but is, so far, holding up in the face of big data. If big data continues to proliferate, we may need to examine the validity of Moore’s Law in the future.

 

 

 

 

References

  • Marr, B. (2015) ‘Forbes – Big Data: Too Many Answers, Not Enough Questions’ Available at: http://www.forbes.com/sites/bernardmarr/2015/08/25/big-data-too-many-answers-not-enough-questions/2/ (Accessed on 26 August 2015)
  • IBM Report (2012) ‘Analytics: The real-world use of big data How innovative enterprises extract value from uncertain data’ Available at: http://www03.ibm.com/systems/hu/resources/the_real_word_use_of_big_data.pdf (Accessed on 29 August 2015)
  • Laney, D. (2001) ‘3D data management: Controlling data volume, velocity, and variety, Application Delivery Strategies’, META Group.
  • Moorthy, J., Lahiri, R., Biswas, N., Sanyal, D., Ranjan, J., Nanath, K., & Ghosh, P. (2015) ‘Big Data: Prospects and Challenges’, Vikalpa: The Journal For Decision Makers, 40, 1, pp. 74-96, Business Source Complete, EBSCOhost (Accessed on 25 August 2015)
  • Martin, KE 2015, ‘Ethical Issues in the Big Data Industry’, MIS Quarterly Executive, 14, 2, pp. 67-85, Business Source Complete, EBSCOhost (Accessed on 25 August 2015)
  • Jagadish, H., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J., Ramakrishnan, R., & ShahabiI, C. (2014) ‘Big Data and Its Technical Challenges’, Communications Of The ACM, 57, 7, pp. 86-94, Business Source Complete, EBSCOhost (Accessed on 25 August 2015)

 

 

 

 

 

Appendix: Characteristics of Big Data

Big Data Dimensions Explanation
Volume

 

Quantum of data generated, stored and used is explosive now. Terabytes, Petabytes +
Variety Data can now be generated through multiple channels. Examples such as, Facebook and Twitter, call centres, chats, voice data, video from CCTVs of retail outlets, IoT, RFID, GIS, smart phone, SMS, etc.
Velocity Real-time data is accessible in many cases such as mobile telephony, RFID, Barcode scan downs, Click stream, online transactions and blogs. The data generated from all such sources can be accumulated with the speed at which they are generated.
Veracity Authenticity of the data increases with automation of data capture. With multiple sources of data, it would be possible to triangulate the results for authenticity.
Validity The terms veracity and validity are often confusing. Perhaps the term validity should be understood as in the market research methodology that the data should be representing the concept that it is expected to represent.
Value Return on Investment and business value are being emphasized more than value for multiple stakeholders.
Variability Variance in the data is often treated as the information content in the data. With a large temporal and spatial data, there can be considerable difference in the data at different sub-set levels.
Venue Multiple data platforms, data bases, data warehouses, format heterogeneity, data generated for different purposes and public and private data sources.
Vocabulary New concepts, definitions, theories and technical terms are now emerging; they were not necessarily required in the earlier context, for example, MapReduce, Apache Hadoop, NoSQL and MetaData
Vagueness It relates to the confusion about the meaning and overall developments around Big Data. Though it is not necessarily characteristic of the Big Data deployment, it reflects the current context. This may change and more clarity is likely to emerge in the future

Challenges of Big Data Deployment (Moorthy et al., 2015, p.76)