BIG DATA TESTING POINT OF VIEW
1.
Overview
Organizations are adopting "Big Data"
as their Data Analytics solution, they are finding it difficult to define a
robust testing strategy and setting up an optimal test environment for Big
Data.
This is mostly due to the lack of knowledge and
understanding on Big Data testing. Big Data involves processing of huge volume
of structured/unstructured data across different nodes using languages such as
"Map-reduce", "Hive" and "Pig".
A robust
testing strategy needs to be defined well in advance in order to ensure that
the functional and non-functional requirements are met and that the data
conforms to acceptable quality.
In this
document we intend to define recommended test approaches in order to test Big
data Projects.
2.
Definition
We are living in the data age. Every day, we
create 2.5 quintillion bytes of data — so much that 90% of the data in the
world today has been created in the last two years alone.
Big Data refers to the massive amounts of data collected over time that are difficult to analyze and handle using common database management tools. Below are the few heterogeneous sources from where data is collected -
§ sensors used to gather
climate information,
§ posts to social media sites,
§ digital pictures and videos,
§ purchase transaction
records,
§ military surveillance,
§ e-commerce,
§ complex scientific and
§ Mobile phone GPS signals and
so on and on to name a few.
In short Big Data is nothing but an assortment
of such a huge and complex data set which is hard to capture, store, process,
retrieve and analyze with the traditional approach.
And 90% of the data collected will be
unstructured and only 10 % is structured. There is a high need to evaluate and
analyze these 90% of unstructured data.
Example:
Consumer product companies and retail
organizations are monitoring social media like Facebook and Twitter to get an
unprecedented view into customer behavior, preferences, and product perception
3.
Characteristics
of BIG DATA
3.1.
Volume
Big data implies enormous volumes of data.
Now a day’s data is generated by machines, networks
and human interaction on systems like social media and we see the exponential
growth in the data storage as the data is now more than text data. We can find
data in the format of videos, music’s and large images on our social media
channels in TB’s and even in PB’s (e.g.,. blog text is a few kilobytes; voice
calls or video files are a few megabytes; sensor data, machine logs, and
clickstream data can be in gigabytes.)
In terms of QA there will be a big challenge to
ensure that entire data setup processed is correct.
E.g. - Millions of
smartphones send a variety of information to the network infrastructure;
multiple sensors readings from Factories Pipelines tec. This data did not exist
five years ago and the result is that more sources of data with a larger size
of data combine to increase the Volume of data that has to be analyzed and
tested.
3.2.
Variety
Variety
refers to the many sources and types of data structured, semi structured and
unstructured.
We
used to store data from sources like spreadsheets and databases. Now data comes
in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
This variety of structured and unstructured data creates problems for storage,
mining and analyzing data.
E.g.
- Big Data comes in multiple formats as it ranges from
emails to tweets to social media and sensor data. There is no control over the
input data format or the structure of the data.
3.3.
Velocity
Big
Data Velocity deals with the pace at which data flows in from sources like
business processes, machines, networks and human interaction with things like
social media sites, mobile devices, etc.
The
flow of data is massive and continuous.
With the advent
of Big Data, understanding the velocity of data is
extremely important. The basic reason for this arises from the fact that in the
early days of data processing, we used to analyze data in batches, acquired
over time.
Typically, data is broken into fixed-size chunks and processed
through different layers from source to targets, and the end result is stored
in a data warehouse for further use in reporting and analysis. This data processing technique in batches or micro batches works
great when the flow of input data is at a fixed rate and
results are used for analysis with all process delays. The scalability and
throughput of the data processing architecture is
maintained due to the fixed size of the batches.
In the case of
Big Data, the data streams in a continuous fashion
and the result sets are useful when the acquisition and processing delays are
short. Here is where the need becomes critical for ingestion
and processing engine that can work at extremely scalable speeds on extremely
volatile sizes of data in a relatively minimal amount of
time.
E.g. - The most
popular way to share pictures, music, and data today is via
mobile devices. The sheer volume of data that
is transmitted by mobile networks provides insights to the providers on the
performance of their network, the amount of data
processed at each tower, the time of day, the associated geographies, user
demographics, location, latencies, and much more. The velocity of data movement is unpredictable, and sometimes can cause a
network to crash.
The data movement and its study have enabled mobile service
providers to improve the QoS (quality of service), and associating this data
with social media inputs has enabled insights into competitive intelligence.
4.
Testing BIG
Data Implementations - How is this different from Testing DWH Implementations?
Whether it is a Data
Warehouse (DWH) or a BIG Data Storage system, the basic component that's of
interest to us, the testers, is the 'Data'.
At the fundamental level,
the data validation in both these storage systems involves validation of data
against the source systems, for the defined business rules
Let us look at these differences between DWH
testing and Big data testing from the following 3 perspectives:
·
Data
·
Infrastructure
·
Validation
tools
|
DWH
|
Big Data
|
|
Data
|
A
DWH tester has the advantage of working with 'Structured' data
|
BIG
Data tester has to work with
'Unstructured or Semi Structured' data (Data with dynamic schema)
|
|
Infrastructure
|
RDBMS
based databases (Oracle, SQL Server etc) are installed in the ordinary file
system. So, testing of DWH systems does not require any special test
environment as it can be done from within the file system
|
For
testing Big data in HDFS, the tester requires a test environment that is
based on HDFS itself. Testers need to learn the how to work with HDFS as it
is different than working with ordinary file system.
|
|
Validation Tool
|
DWH
tester has to work with structured query language (SQL) so querying the data
and validation the data is easy
|
NoSQL
databases are document based, key-value pairs, graph databases or wide-column
stores. So testing big data with NOSQL tester has to write junit or unix
scripts with logic to validate
|
5.
Big Data Testing Approach
Big data deals with huge data and executing on
multiple nodes so there are high chances of having bad data and each stage of
process.
Testing should be performed at each of the
phases to ensure that data is getting processes without any error
·
Pre- Hadoop processing
·
Map reduce process
·
Data extract and load in EDW
·
Reports
5.1. Pre- Hadoop processing
1.
Comparing input
data file against source system data to ensure the data is extracted correctly.
2.
Validating the data
requirements and ensuring the right data is extracted.
3.
Validating that the
files are loaded into HDFS correctly.
4. Validating the input files are split, moved and
replicated in different data nodes
5.2. Map Reduce Process:
1.
Validating that
data processing is completed and output file is generated.
2.
Validating the business
logic on standalone node and then validating after running against multiple
nodes.
3.
Validating the map
reduce process to verify that key value pairs are generated correctly.
4.
Validating the
aggregation process after reduce process.
5. Validating the output data and format against the
source file and ensuring the data processing is complete.
5.3. Data extract and load in EDW:
1.
Validating that
transformation rules are applied correctly.
2.
Validating the data
load in the target system.
3.
Validating the
aggregation of data.
4.
Validating the data
integrity in Target system
5.4. Reports:
1.
Validate the data
coming in the reports is as expected
2.
Validated the cube
to t verify the pre –aggregated values are calculated correctly
3.
Validate the
Dashboards to ensure that all objects are
rendered properly
4. Validate the
reports to ensure that data fetched from various web parts is validated against
data base.
6.
Big Data
Testing Types
In this section we will discuss about types of testing can be performed
to big data
1.
Functional Testing
2.
Non-Functional
Testing
6.1.
Functional Testing
Testing big data is essentially testing its three dimension or
characteristics of big data i.e. volume, velocity and variety to ensure there
is no data quality defects
6.1.1.
Testing velocity
a)
Performance of
Pig/Hive jobs and capture
b)
Job completion time
and validating against the benchmark
c)
Throughput of the
jobs
d)
Impact of
background processes on performance of the system
e)
Memory and CPU
details of task tracker
f)
Availability of
name node and data nodes
6.1.2.
Testing volume
a.
Use sampling strategy
b.
Convert raw data into expected result
format to compare with actual output data
c.
Prepare ‘Compare scripts’ to compare the
data present in HDFS file storage
6.1.3.
Testing
variety
6.1.3.1. Structured
Data:
ü Compare data using compare tools and identify the
discrepancies
6.1.3.2. Semi-structured
Data:
ü Convert semi-structured data into structured format
ü Format converted raw data to expected results
ü Compare expected result data with actual results
6.1.3.3.
Unstructured Data:
ü Parse unstructured text data into data blocks and
aggregate the computed data blocks
ü Validate aggregated data against the data output
6.2.
Non- function Testing
6.2.1.
Performance
testing
Big data project involves in processing huge volumes of structured and
unstructured data and is processed across multiple, nodes to complete the job
in less amount of time. At times because of poor design and architecture
performance is degraded. Some of the areas where performance issues can occur
are imbalance in input slits, redundant shuffle and sorts, moving most of the
aggregation computations to reduce process and so on. Performance testing is
conducted by setting up huge volume of data in an environment.
NoSQL solutions are very different from your usual RDBMS, but they are
still bound by the usual performance constraints: CPU, I/O and most importantly
how it is used.
We test big data to identify the bottlenecks. And we conduct performance
testing but setting up huge volume of data and infrastructure same as
production and check performance metrics like job completion time, throughput
and system metrics like memory utilization etc.
The process starts with the setting up of the Big data cluster which is
to be tested for performance.
Some tools for performance test on big data are:
1)
YCSB: YCSB is a
cloud service testing client that performs reads, writes and updates according
to specified workloads.
2)
SandStorm:
SandStorm is an automated performance testing tool that supports big data
performance testing.
3)
JMeter: JMeter
provides few plugins to apply load to Cassandra. This plugin acts as a client
of Cassandra and can send requests over Thrift.
6.2.2.
Failover
testing
Failover testing is an important focus area in big data implementation
with objective of validation the recovery process and to ensure the data
processing happens seamlessly when switched to other data nodes.
Some validation needs to be performed during failover testing are
validating that checkpoints of edit logs, recovery when data nodes fails or
become corrupt.
6.2.3.
Security
Testing
As data grows with variation
means from different channel and sources associates risk also grows so we can
say big data as BIG risk associated with it so security testing plays important
role here
a) Authentication and Authorizations : validating
roles and privileges to collect the data , ideal way is to maintain
a list of users, roles and authorizations against
each source/system
b)
Network
Security: sometime Data received/transferred from/to
is encrypted to maintain Confidentiality we need to test all aspect of security
like Firewall, Networks Policies and Anti-Virus Software, Intrusion detection
to insure the data format and degree of confidentiality remain
intact.
7.
Automation
testing on big data testing
Manual testing
does not scale in the testing in big data implementation as it is on
distributed environment and data size is also very huge.
There are some
tools available in market which supports big data test automation like
Query
surge, Zettaset, Informatica DVO
Nice Blog, When I was read this blog, I learnt new things & it’s truly have well stuff related to developing technology, Thank you for sharing this blog. Need to learn software testing companies, please share. It is very useful who is looking for smart test automation platform.
ReplyDelete