The term of
big data was originated in 1997 and was introduced by two NASA researchers,
Michael Cox and David Ellsworth 28. Big data is known to be a large-scale,
voluminous, and multi-format data that is collected from multiple heterogeneous
data streams 49. One of the most well known applications of big data is Smart Cities 3, 27, 41, 2, which is a
city that collects different types of data about the facilities of that city
using numerous data collection sensors in order to manage the resources efficiently
(e.g. using water sensors in the mains in
order to detect leaks or using video sensors to manage crowds or spot crimes).
In such systems data volume keeps growing eagerly, because the data is
continuously gathered using numerous data streams (e.g. mobile devices, cameras, microphones, wireless sensor networks,
is no certain volume value to reach in order to consider a set of data a big
data in terms of its size, a data is said to be big in terms of volume if the
underlying computing systems can not easily process that data. For instance, a
cloud system may be able to process multiple terabytes of data but a mobile
device may not be able to process few gigabytes of data 36. Due to the volume
of big data, traditional data processing application software are useless and
requires huge efforts analysis and processing 7.
There are a
lot of challenges of big data, some to mention: capturing data, data storage, data analysis, search, sharing, transfer,
visualization, querying, updating and information privacy 40, 33, 35.
Moreover, big data inherits the curse of
dimensionality, meaning that the data has a massive set of dimensions which
also makes the act of analysis and processing harder 52, 15. One of the ways
to overcome such issues is big data reduction, the term reduce can mean either the complexity or volume of the data. By
reducing the complexity or the volume of the data, it becomes more manageable,
hence easier to analyze.
Big data reduction can be done using
many methods, some to mention:
Dimension reduction: the
dimensions of the data are the attributes of that data (i.e. id and name of a student, color and speed of a car, etc.).
The dimensionality can be reduced by Feature selection or Feature extraction.
Feature selection is done by only considering the important dimensions of that
data, as all the dimensions will not be needed. Feature extraction is done by
merging multiple sets dimensions to derive new ones 46, 22, 44, 17.
Deduplication: the data
collected may contain redundancies, a redundancy is not necessarily duplicate
row in the database, a redundancy in the data can be the order of bits or block
of memory that is exactly identical to another one, in that case, the original
one is kept and the copy is replaced with a pointer to the original in order to
reduce the volume in use 18, 54, 13, 50.
Graph theory: to reduce the
complexity of the data, the topological and dynamical network properties are
extracted. To construct topological networks relationships between data points
are established 8, 43, 16, 47.
showed that these methods can not be used singlehandedly by considering the all
Vs properties of big data 8, which motivates the need for more reliable data
reduction approaches that combines multiple methods together.
knowledge graph 39, 25, 53, 10, This paper presents a tool and approach to
reduce big data complexity using graphs. In the graph, data is presented using
nodes and edges instead of using ordinary relational database (see fig. 1),
where each row is presented by a node (object) and relationships between the
entities are replaced with edges between the nodes. Presenting the database
using graphs makes the act of unlocking knowledge patterns easier 8.
The tool uses
MySQL database 31 to convert its schema and data into Neo4J graph database
12, the reason of choosing MySQL because it is well known and widely used,
and the reason of choosing Neo4J is because of it’s high performance 45, 20,
30, 23. The tool is built using Java, because Neo4J
Fig. 1. Graph database sample
is well supported
to be connected with it.
The rest of the paper is organized as
follows: Section 2 describes related work; Section 3 describes the tool and
used approach; Section 4 presents an experimental evaluation; Section 5
presents the conclusion.
II. RELATED WORK
A lot of big data reduction methods
exist, some of these methods are based on Network
Theory 42, Compression 24, 51,
1 and others based on Dimension
Reduction 46, 21, 17. However, based on our research, there are currently
no tools or approaches to reduce big data complexity using graph database or
III. GRAPHJ TOOL
GraphJ is a standalone GUI based
application written in JavaFX 9 and based on Spring Framework 26 for
resource management. It is built to convert a MySQL database 31 into a Neo4J
graph database 12. To convert the database, GraphJ does the following
activities (see fig. 2):
A. MySQL host
requires a live MySQL host to read the database schema from. There is no need
to specify a database in that host, because GraphJ will inspect all the
databases. The connection is established to the host using mysql Java database
connectivity JDBC driver 19, which may not work on remote hosts due to remote
direct access regulations, as most of databases do not allow direct access to
is made using an interface called DBConnection
that contains abstract methods to get: host,
port, username, password and database. The database is used to only define the default database for the
connection. To set all the data required, GraphJ reads these properties from
environment variables (see table I).