Under The Hood of Testing Big Data Applications

Author : Venkatagiri P

The Internet is filled with a lot of information on what big data is, the tools that are used to capture, manage and process big data sets. However, there is limited content available when it comes to devising a test strategy for big data applications or how big data needs to be approached from a testing point of view. Through this series of articles, let’s try and understand how traditional data processing is different from processing large data sets and look at how testers should strategize and approach testing for them.


In the older times, the traditional tester wrote simple read/write queries against the database to store, retrieve and compare data. Slowly, as the size of data started increasing due to business needs and newer technologies like Data Warehousing came into being, it mandated specialized skills which created a whole lot of designations within the tester community who were referred to as “Database Testers”, “ETL Testers” and “Data Warehouse Testers”. Now with the advent of big data, things are only getting more complex for the tester. What kind of challenges does it put forth to the tester? Does he need new skills? How does he supplement his previous testing experience?


What is Big Data?
A classic definition as per the dictionary, is that big data is “extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions”

Lately, the term “big data” tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

Examples of big data could be : Social Media data, Trading data from stock exchanges, Airlines data, Retail data, Web Page Analytics (including search engine data) etc.

Typically, Big data can be divided into the following types of data,

  • Structured data : Relational data.
  • Semi Structured data : XML and JSON files.
  • Unstructured data : Social media, Images, Videos, Blogs, Media Logs.


When thinking of testing big data applications, we need to keep the following key characteristics of Big Data into account at each stage of the Big Data lifecycle. These are primarily

Volume – The applications ability to handle large volumes of data at all stages – right from its ingestion to transformation before it can be consumed by the application, and the business logic to handle the volumes.


Velocity – Velocity refers to the speed at which data is being generated, produced, created, or refreshed. The approach to tests may change (the priority, focus) depending on this characteristic of data.


Variety – The source of the data and the ability of the application to handle structured data but also semi structured and mostly unstructured data as well will be a focus of test


Veracity – The level of “noise” in the data, the data cleansing algorithms used, and the resultant set, can play a part in impacting the focus & priority of the testing.


Variability – Another factor to keep in mind is the variability, which refers to data which is constantly changing, and how the end results can be validated.


In subsequent articles, we shall see how basing  a testing approach around the 5 characteristics, we can evolve an improved focus to testing at each stage of a big data lifecycle.