BIG DATA TESTING POINT OF VIEW

1.  Overview

Organizations are adopting "Big Data" as their Data Analytics solution, they are finding it difficult to define a robust testing strategy and setting up an optimal test environment for Big Data.

This is mostly due to the lack of knowledge and understanding on Big Data testing. Big Data involves processing of huge volume of structured/unstructured data across different nodes using languages such as "Map-reduce", "Hive" and "Pig".

 A robust testing strategy needs to be defined well in advance in order to ensure that the functional and non-functional requirements are met and that the data conforms to acceptable quality.

 In this document we intend to define recommended test approaches in order to test Big data Projects.


2.  Definition

We are living in the data age. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.

Big Data refers to the massive amounts of data collected over time that are difficult to analyze and handle using common database management tools. Below are the few heterogeneous sources from where data is collected -

§  sensors used to gather climate information,
§  posts to social media sites,
§  digital pictures and videos,
§  purchase transaction records,
§  military surveillance,
§  e-commerce,
§  complex scientific and
§  Mobile phone GPS signals and so on and on to name a few.
In short Big Data is nothing but an assortment of such a huge and complex data set which is hard to capture, store, process, retrieve and analyze with the traditional approach.

And 90% of the data collected will be unstructured and only 10 % is structured. There is a high need to evaluate and analyze these 90% of unstructured data.

Example:

Consumer product companies and retail organizations are monitoring social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception   

3.  Characteristics of BIG DATA

3.1.      Volume
Big data implies enormous volumes of data.
Now a day’s data is generated by machines, networks and human interaction on systems like social media and we see the exponential growth in the data storage as the data is now more than text data. We can find data in the format of videos, music’s and large images on our social media channels in TB’s and even in PB’s (e.g.,. blog text is a few kilobytes; voice calls or video files are a few megabytes; sensor data, machine logs, and clickstream data can be in gigabytes.)
In terms of QA there will be a big challenge to ensure that entire data setup processed is correct.
E.g. - Millions of smartphones send a variety of information to the network infrastructure; multiple sensors readings from Factories Pipelines tec. This data did not exist five years ago and the result is that more sources of data with a larger size of data combine to increase the Volume of data that has to be analyzed and tested.

3.2.      Variety
Variety refers to the many sources and types of data structured, semi structured and unstructured.
We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of structured and unstructured data creates problems for storage, mining and analyzing data.
E.g. - Big Data comes in multiple formats as it ranges from emails to tweets to social media and sensor data. There is no control over the input data format or the structure of the data.

3.3.      Velocity
Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc.
The flow of data is massive and continuous.
With the advent of Big Data, understanding the velocity of data is extremely important. The basic reason for this arises from the fact that in the early days of data processing, we used to analyze data in batches, acquired over time.
Typically, data is broken into fixed-size chunks and processed through different layers from source to targets, and the end result is stored in a data warehouse for further use in reporting and analysis. This data processing technique in batches or micro batches works great when the flow of input data is at a fixed rate and results are used for analysis with all process delays. The scalability and throughput of the data processing architecture is maintained due to the fixed size of the batches.
In the case of Big Data, the data streams in a continuous fashion and the result sets are useful when the acquisition and processing delays are short. Here is where the need becomes critical for ingestion and processing engine that can work at extremely scalable speeds on extremely volatile sizes of data in a relatively minimal amount of time.
E.g. - The most popular way to share pictures, music, and data today is via mobile devices. The sheer volume of data that is transmitted by mobile networks provides insights to the providers on the performance of their network, the amount of data processed at each tower, the time of day, the associated geographies, user demographics, location, latencies, and much more. The velocity of data movement is unpredictable, and sometimes can cause a network to crash.
The data movement and its study have enabled mobile service providers to improve the QoS (quality of service), and associating this data with social media inputs has enabled insights into competitive intelligence. 

4.  Testing BIG Data Implementations - How is this different from Testing DWH Implementations?

Whether it is a Data Warehouse (DWH) or a BIG Data Storage system, the basic component that's of interest to us, the testers, is the 'Data'.
At the fundamental level, the data validation in both these storage systems involves validation of data against the source systems, for the defined business rules
Let us look at these differences between DWH testing and Big data testing from the following 3 perspectives:

·         Data
·         Infrastructure
·         Validation tools


                         DWH
             Big Data
Data
A DWH tester has the advantage of working with 'Structured' data
BIG Data tester has  to work with 'Unstructured or Semi Structured' data (Data with dynamic schema)
Infrastructure
RDBMS based databases (Oracle, SQL Server etc) are installed in the ordinary file system. So, testing of DWH systems does not require any special test environment as it can be done from within the file system
For testing Big data in HDFS, the tester requires a test environment that is based on HDFS itself. Testers need to learn the how to work with HDFS as it is different than working with ordinary file system.
Validation Tool
DWH tester has to work with structured query language (SQL) so querying the data and validation the data is easy
NoSQL databases are document based, key-value pairs, graph databases or wide-column stores. So testing big data with NOSQL tester has to write junit or unix scripts with logic to validate

5.   Big Data Testing Approach

Big data deals with huge data and executing on multiple nodes so there are high chances of having bad data and each stage of process.
Testing should be performed at each of the phases to ensure that data is getting processes without any error

·         Pre- Hadoop processing
·         Map reduce process
·         Data extract and load in EDW
·         Reports

5.1.      Pre- Hadoop processing

1.    Comparing input data file against source system data to ensure the data is extracted correctly.
2.    Validating the data requirements and ensuring the right data is extracted.
3.    Validating that the files are loaded into HDFS correctly.
4.   Validating the input files are split, moved and replicated in different data nodes

5.2.      Map Reduce Process:

1.    Validating that data processing is completed and output file is generated.
2.    Validating the business logic on standalone node and then validating after running against multiple nodes.
3.    Validating the map reduce process to verify that key value pairs are generated correctly.
4.    Validating the aggregation process after reduce process.
5.   Validating the output data and format against the source file and ensuring the data processing is complete.

5.3.      Data extract and load in EDW:

1.    Validating that transformation rules are applied correctly.
2.    Validating the data load in the target system.
3.    Validating the aggregation of data.
4.    Validating the data integrity in Target system

5.4.      Reports:

1.    Validate the data coming in the reports is as expected
2.    Validated the cube to t verify the pre –aggregated values are calculated correctly
3.    Validate the Dashboards to  ensure that all objects are rendered properly
4.   Validate the reports to ensure that data fetched from various web parts is validated against data base.
6.  Big Data Testing Types

In this section we will discuss about types of testing can be performed to big data

1.    Functional Testing
2.    Non-Functional Testing

6.1.      Functional Testing
Testing big data is essentially testing its three dimension or characteristics of big data i.e. volume, velocity and variety to ensure there is no data quality defects

6.1.1.           Testing  velocity
a)    Performance of Pig/Hive jobs and capture
b)    Job completion time and validating against the benchmark
c)    Throughput of the jobs
d)    Impact of background processes on performance of the system
e)    Memory and CPU details of task tracker
f)    Availability of name node and data nodes

6.1.2.           Testing volume

a.    Use sampling strategy
b.    Convert raw data into expected result format to compare with actual output data
c.    Prepare ‘Compare scripts’ to compare the data present in HDFS file storage

6.1.3.           Testing variety

6.1.3.1.       Structured Data:

ü  Compare data using compare tools and identify the discrepancies

6.1.3.2.       Semi-structured Data:

ü  Convert semi-structured data into structured format
ü  Format converted raw data to expected results
ü  Compare expected result data with actual results

6.1.3.3.       Unstructured Data:

ü  Parse unstructured text data into data blocks and aggregate the computed data blocks
ü  Validate aggregated data against the data output

6.2.      Non- function Testing

6.2.1.           Performance testing
Big data project involves in processing huge volumes of structured and unstructured data and is processed across multiple, nodes to complete the job in less amount of time. At times because of poor design and architecture performance is degraded. Some of the areas where performance issues can occur are imbalance in input slits, redundant shuffle and sorts, moving most of the aggregation computations to reduce process and so on. Performance testing is conducted by setting up huge volume of data in an environment.

NoSQL solutions are very different from your usual RDBMS, but they are still bound by the usual performance constraints: CPU, I/O and most importantly how it is used.
We test big data to identify the bottlenecks. And we conduct performance testing but setting up huge volume of data and infrastructure same as production and check performance metrics like job completion time, throughput and system metrics like memory utilization etc.

The process starts with the setting up of the Big data cluster which is to be tested for performance.
Some tools for performance test on big data are:

1)    YCSB: YCSB is a cloud service testing client that performs reads, writes and updates according to specified workloads.
2)    SandStorm: SandStorm is an automated performance testing tool that supports big data performance testing.
3)    JMeter: JMeter provides few plugins to apply load to Cassandra. This plugin acts as a client of Cassandra and can send requests over Thrift.

6.2.2.           Failover testing
Failover testing is an important focus area in big data implementation with objective of validation the recovery process and to ensure the data processing happens seamlessly when switched to other data nodes.
Some validation needs to be performed during failover testing are validating that checkpoints of edit logs, recovery when data nodes fails or become corrupt.

6.2.3.           Security Testing
As data grows with variation means from different channel and sources associates risk also grows so we can say big data as BIG risk associated with it so security testing plays important role here

a)    Authentication and Authorizations : validating roles and privileges to collect the data , ideal way is to maintain a list of users, roles and authorizations against each source/system

b)   Network Security: sometime Data received/transferred from/to is encrypted to maintain Confidentiality we need to test all aspect of security like Firewall, Networks Policies and Anti-Virus Software, Intrusion detection to insure the data format and degree of confidentiality remain intact.              

7.  Automation testing on big data testing

Manual testing does not scale in the testing in big data implementation as it is on distributed environment and data size is also very huge.
There are some tools available in market which supports big data test automation like
Query surge, Zettaset, Informatica DVO


Comments

  1. Nice Blog, When I was read this blog, I learnt new things & it’s truly have well stuff related to developing technology, Thank you for sharing this blog. Need to learn software testing companies, please share. It is very useful who is looking for smart test automation platform.

    ReplyDelete

Post a Comment

Popular posts from this blog

Data Migration Testing Startegy

Informatica DVO

ETL Testing