NTT : Leading Development of Scalable Distributed Computing Framework for Real-Time Analysis of Big Data -- Released as Open-Source on October 27 --(NTT)
Leading Development of Scalable Distributed
Computing Framework for Real-Time Analysis of Big
Data
-- Released as Open-Source on October 27 --
Nippon Telegraph & Telephone Corporation
(NTT, of Chiyoda Ward, Tokyo, CEO: Satoshi
Miura), and Preferred Infrastructure
Corporation (PFI, of Bunkyo Ward, Tokyo, CEO:
Toru Nishikawa) have developed an
infrastructure technology called
"Jubatus" (1st Edition)*2, which is
capable of high-speed, real-time analysis of
large-scale data, referred to as "Big
Data" *1.
Conventional batch processing methods
periodically process data in batches and put
newly arrived data on hold until the next
batch execution. These methods are inadequate
for Big Data applications such as real-time
trend analysis, for which the timeliness of
data is a critical requirement.
By providing the capability of analyzing the
latest data in real time, Jubatus can help
create value-added services in a wide range
of areas such as fraud detection, forecasting
of market, economic and stock prices, natural
disaster prediction, parts and materials
procurement estimation for manufacturing,
health-risk assessment, and predictive
techniques in life and natural sciences.
This development is a result of open
innovation between NTT Information Sharing
Platform Laboratories and PFI Corporation. It
will be released as open source on October 27
on the Jubatus OSS Web site, http://jubat.us/, as a
public domain software contributing to the
utilization of Big Data.
[Technical Background]
In recent years, the term "information
explosion" is used in various fields to
describe the rapid increase in the amount of
published data available. It is important for
companies to generate business intelligence
by utilize this "Big Data"
proactively and effectively.
Currently, Big Data is analyzed by
temporarily storing it in a cloud environment
composed of a server farm, and periodically
processing it in high-speed batches.
Hadoop*3, is one such system that is gaining
recognition and popularity.
However, the world is changing extremely
quickly, and there is a growing need for
technology able to perform sophisticated,
real-time analysis of large volumes of data,
arriving in time sequence, without storing
it. Applications such as SNS analysis or
detection of abnormal traffic or unauthorized
access use it in order to implement
high-speed decision making based on
sophisticated analysis and forecasting.
[Technical Overview]
Jubatus is a large-scale, distributed,
real-time analysis framework with the
objective of continuous, high-speed, deep
analysis of high-volume data (Figure 1).
Jubatus achieves continuous, high-speed
processing of high-volume data by dividing
the large-volume of data among multiple
servers and processing it sequentially and in
parallel. Deep analysis requires use of
sophisticated statistical processing and
machine learning, and implementation in a
distributed environment requires a framework
allowing multiple servers to share
intermediate results. Such sharing requires
frequent communication between servers, and
this can become a bottleneck to overall
performance if a suitable communication
method is not devised.
Accordingly, Accordingly, Jubatus not only
ensures the real-timeliness and accuracy of
data analysis, but also increases robustness
by exchanging the intermediate results among
multiple servers in a loose manner and
thereby reducing the communication overhead
between servers..
[Technical Features]
The main features of Jubatus are as follows
(Figure 2).
(1) MIX Processing system
This processing system has the following
three functions.
<1>
<2>
MIX Protocol control: Determines how
data is aggregated and redistributed
when checking intermediate analysis
results among the servers.
<3>
Membership management: Performs tasks
such as recovery from server faults and
adding more servers in order to ensure
continuous data processing, before data
overflow can occur.
Even with simultaneous parallel
analysis, having all servers wait for
each other to compare intermediate
results at each iteration will clearly
result in a bottleneck. We were able to
ensure that each server can run
autonomously without slowing down by
having servers exchange and mix
intermediate results with other servers
at suitable time intervals, rather than
at every iteration. The balance in
achieving both real-time nature and
scalability is adjusted within the
range allowable by the application, so
that the precision and strictness
(overall consistency) of the aggregate
results can be relaxed (Figure 3).
(2) Pluggable architecture
Analysis engines, analysis modules, and data
storage methods (local, distributed) can be
combined and rearranged flexibly (plugged-in,
out) due to the definition of shared
interfaces.
(3) Workflow definition
It is possible to define and control
execution of paths and parallel execution
between process components easily and
flexibly, from data input, to applied
analysis, analysis engine and others.
At this time, we have implemented and
evaluated a multi-value classifier for online
machine learning as the first instance of
analysis engines for Jubatus.
[Future Developments]
In order to further advance R&D and
contribute to the development of information
processing technology for Big Data, NTT and
PFI Corp. are working to promote the spread
of real-time large-scale data analysis
infrastructure and related business by
expanding the Jubatus community and
businesses built on it. We are considering an
"SNS analysis application" service
in particular. This application will perform
sophisticated analysis, such as
categorization, fuzzy search, real-time
filtering, and relevancy ranking, of the
large volumes of real-time SNS data generated
every day, so that it can be used for
marketing and other applications. illustrates
the concept of SNS analysis applications
using Jubatus.
Other applications include:
"Sensor data analysis"
"POS data analysis"
"Log data analysis"
"Financial data analysis"
"Behavioral analysis"
[References]
See the Preferred Infrastructure Web site: http://www.preferred.jp)
PFI Corp. is a venture company with excellent
research and development staff in the fields
of natural language processing and machine
learning. The high-performance analysis
engine in Jubatus has been produced from PFI
technical expertise.
*1 Big Data
Refers to data sets that are very large and
have complex structure, so they are difficult
to manage and process using conventional
technologies. Although not clearly defined,
these data sets are usually
several-hundred-terabyte or peta-byte-class
in size, do not have fixed form and are
real-time in nature. They can include, for
example, data from RFID tags or other types
of sensors, or text from blogs or other new
communications tools.
*3 Hadoop
An open-source clone of Google's
infrastructure system (MapReduce, BigTable,
GFS, etc.). It is the representative example
of a batch-processing large-scale distributed
processing infrastructure. The Hadoop
community is quite mature.
Nippon Telegraph And Telephone Corp is a provider of fixed and mobile voice related services, regional communications services, long distance and international communications business, data communications business and other business. The Company operates in five segments. Mobile Communications segment conducts mobile voice related services and sale of telecommunications equipment. Regional Communications segment provides fixed voice related services and other services. Long Distance and International Communications segment comprises fixed voice related services and international communications services, solution and other services. Data Communications segment comprises system integration services and network system service. Other segment comprises real estate rentals, financial business, systems development and other services related to research and development.
NTT : Leading Development of Scalable Distributed Computing Framework for Real-Time Analysis of Big Data -- Released as Open-Source on October 27 --(NTT)