Thursday, May 12, 2016


BIG DATA - HADOOP

ATS_BDH16_001 - Self-Healing in Mobile Networks with Big Data
          Mobile networks have rapidly evolved in recent years due to the increase in multimedia traffic and offered services. This has led to a growth in the volume of control data and measurements that are used by self-healing systems. To maintain a certain quality of service, self-healing systems must complete their tasks in a reasonable time. The conjunction of a big volume of data and the limitation of time requires a big data approach to the problem of self-healing. This article reviews the data that self-healing uses as input and justifies its classification as big data. Big data techniques applied to mobile networks are examined, and some use cases along with their big data solutions are surveyed.

ATS_BDH16_002 - A Parallel Patient Treatment Time Prediction Algorithm and Its Applications in Hospital Queuing-Recommendation in a Big Data Environment
          Effective patient queue management to minimize patient wait delays and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all the patients before him is the time that he must wait. It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the large-scale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and low-latency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals.
In this paper, we propose a Patient Treatment Time Prediction (PTTP) algorithm and a Hospital Queuing-Recommendation (HQR) system. Considering the real-time requirements, enormous data, and complexity of the system, we employ big data and cloud computing models for efficiency and scalability. The PTTP algorithm is trained based on an improved Random Forest (RF) algorithm for each treatment task, and the waiting time of each task is predicted based on the trained PTTP model. Then, HQR recommends an efficient and convenient treatment plan for each patient. Patients can see the recommended plan and predicted waiting time in real-time using a mobile application. Extensive experimentation and application results show that the PTTP algorithm achieves high precision and performance.

                 In this paper, we propose a Patient Treatment Time Prediction (PTTP) algorithm and a Hospital Queuing-Recommendation (HQR) system. Considering the real-time requirements, enormous data, and complexity of the system, we employ big data and cloud computing models for efficiency and scalability. The PTTP algorithm is trained based on an improved Random Forest (RF) algorithm for each treatment task, and the waiting time of each task is predicted based on the trained PTTP model. Then, HQR recommends an efficient and convenient treatment plan for each patient. Patients can see the recommended plan and predicted waiting time in real-time using a mobile application. Extensive experimentation and application results show that the PTTP algorithm achieves high precision and performance.

ATS_BDH16_003 - Protection of Big Data Privacy
          In recent years, big data have become a hot research topic. The increasing amount of big data also increases the chance of breaching the privacy of individuals. Since big data require high computational power and large storage, distributed systems are used. As multiple parties are involved in these systems, the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms developed for privacy protection at different stages (e.g., data generation, data storage, and data processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms. In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacy-preserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges and future research directions related to privacy preservation in big data.

ATS_BDH16_004 - Big Data Analytics in Mobile Cellular Networks
          Mobile cellular networks have become both the generators and carriers of massive data. Big data analytics can improve the performance of mobile cellular networks and maximize the revenue of operators. In this paper, we introduce an unified data model based on random matrix theory and machine learning. Then, we present an architectural framework for applying big data analytics in mobile cellular networks. Moreover, we describe several illustrative examples, including big signaling data, big traffic data, big location data, big radio waveforms data, and big heterogeneous data in mobile cellular networks. Finally, we discuss a number of open research challenges of big data analytics in mobile cellular networks.

ATS_BDH16_005 - Spark-based Large-scale Matrix Inversion for Big Data Processing
          Matrix inversion is a fundamental operation for solving linear equations for many computational applications, especially for various emerging big data applications. However, it is a challenging task to invert large-scale matrices of extremely high order (several thousands or millions), which are common in most web-scale systems such as social networks and recommendation systems. In this paper, we present a LU decomposition-based block-recursive algorithm for large-scale matrix inversion. We present its well-designed implementation with optimized data structure, reduction of space complexity and effective matrix multiplication on the Spark parallel computing platform. The experimental evaluation results show that the proposed algorithm is efficient to invert large-scale matrices on a cluster composed of commodity servers and is scalable for inverting even larger matrices. The proposed algorithm and implementation will become a solid foundation for building a high-performance linear algebra library on Spark for big data processing and applications.

ATS_BDH16_006 - A Tutorial on Secure Outsourcing of Large-scale Computations for Big Data
          Today's society is collecting a massive and exponentially growing amount of data that can potentially revolutionize scientific and engineering fields, and promote business innovations. With the advent of cloud computing, in order to analyze data in a cost-effective and practical way, users can outsource their computing tasks to the cloud, which offers access to vast computing resources on an on-demand and pay-per-use basis. However, since users' data contains sensitive information that needs to be kept secret for ethical, security, or legal reasons, many users are reluctant to adopt cloud computing. To this end, researchers have proposed techniques that enable users to offload computations to the cloud while protecting their data privacy. In this paper, we review the recent advances in the secure outsourcing of large-scale computations for a big data analysis. We first introduce two most fundamental and common computational problems, i.e., linear algebra and optimization, and then provide an extensive review of the data privacy preserving techniques. After that, we explain how researchers have exploited the data privacy preserving techniques to construct secure outsourcing algorithms for large-scale computations.

ATS_BDH16_007 - A Cloud Service Architecture for Analyzing Big Monitoring Data
          Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures, platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pattern and ensures workloads are operating at optimum levels. The analysis process involves data query and extraction, data analysis, and result visualization. Since the volume of monitoring data is big, these operations require a scalable and reliable architecture to extract, aggregate, and analyze data in an arbitrary range of granularity. Ultimately, the results of analysis become the knowledge of the system and should be shared and communicated. This paper presents our cloud service architecture that explores a search cluster for data indexing and query. We develop REST APIs that the data can be accessed by different analysis modules. This architecture enables extensions to integrate with software frameworks of both batch processing (such as Hadoop) and stream processing (such as Spark) of big data. The analysis results are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis process. This cloud architecture is empirically assessed to evaluate its responsiveness when processing a large set of data records under node failures.

ATS_BDH16_008 - Data and Energy Integrated Communication Networks for Wireless Big Data
          This paper describes a new type of communication network called data and energy integrated communication networks (DEINs), which integrates the traditionally separate two processes, i.e., wireless information transfer (WIT) and wireless energy transfer (WET), fulfilling co-transmission of data and energy. In particular, the energy transmission using radio frequency is for the purpose of energy harvesting (EH) rather than information decoding. One driving force of the advent of DEINs is wireless big data, which comes from wireless sensors that produce a large amount of small piece of data. These sensors are typically powered by battery that drains sooner or later and will have to be taken out and then replaced or recharged. EH has emerged as a technology to wirelessly charge batteries in a contactless way. Recent research work has attempted to combine WET with WIT, typically under the label of simultaneous wireless information and power transfer. Such work in the literature largely focuses on the communication side of the whole wireless networks with particular emphasis on power allocation. The DEIN communication network proposed in this paper regards the convergence of WIT and WET as a full system that considers not only the physical layer but also the higher layers, such as media access control and information routing. After describing the DEIN concept and its high-level architecture/protocol stack, this paper presents two use cases focusing on the lower layer and the higher layer of a DEIN network, respectively. The lower layer use case is about a fair resource allocation algorithm, whereas the high-layer section introduces an efficient data forwarding scheme in combination with EH. The two case studies aim to give a better explanation of the DEIN concept. Some future research directions and challenges are also pointed out.
DEIN Overall Protocol Stack.
ATS_BDH16_009 - Wide Area Analytics for Geographically Distributed Datacenters
          Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of geo-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.

ATS_BDH16_0010 - Privacy Preserving Deep Computation Model on Cloud for Big Data Feature Learning
          To improve the efficiency of big data feature learning, the paper proposes a privacy preserving deep computation model by offloading the expensive operations to the cloud. Privacy concerns become evident because there are a large number of private data by various applications in the smart city, such as sensitive data of governments or proprietary information of enterprises. To protect the private data, the proposed model uses the BGV encryption scheme to encrypt the private data and employs cloud servers to perform the high-order back-propagation algorithm on the encrypted data efficiently for deep computation model training. Furthermore, the proposed scheme approximates the Sigmoid function as a polynomial function to support the secure computation of the activation function with the BGV encryption. In our scheme, only the encryption operations and the decryption operations are performed by the client while all the computation tasks are performed on the cloud. Experimental results show that our scheme is improved by approximately 2.5 times in the training efficiency compared to the conventional deep computation model without disclosing the private data using the cloud computing including ten nodes. More importantly, our scheme is highly scalable by employing more cloud servers, which is particularly suitable for big data.


ATS_BDH16_0011 - Assessing Big Data SQL Frameworks for Analyzing Event Logs
Performing Process Mining by analyzing event logs generated by various systems is a very computation and I/O intensive task. Distributed computing and Big Data processing frameworks make it possible to distribute all kinds of computation tasks to multiple computers instead of performing the whole task in a single computer. This paper assesses whether contemporary structured query language (SQL) supporting Big Data processing frameworks are mature enough to be efficiently used to distribute computation of two central Process Mining tasks to two dissimilar clusters of computers providing BPM as a service in the cloud. Tests are performed by using a novel automatic testing framework detailed in this paper and its supporting materials. As a result, an assessment is made on how well selected Big Data processing frameworks manage to process and to parallelize the analysis work required by Process Mining tasks.

ATS_BDH16_0012 - A comparative study of various clustering techniques on big data sets using Apache Mahout
Clustering algorithms have materialized as an unconventional tool to precisely examine the immense volume of data produced by present applications. In specific, their main objective is to classify data into clusters such that objects are grouped in the same cluster when they are similar rendering to particular metrics and dissimilar to objects of other groups. From the machine learning perspective clustering can be viewed as unsupervised learning of concepts. Hadoop is a distributed file system and an open-source implementation of MapReduce dealing with big data. Apache Mahout clustering algorithms are implemented on top of Hadoop using MapReduce paradigm. In this paper three clustering algorithms are described: K-means, Fuzzy K-Means (FKM) and Canopy clustering implemented by using Apache Mahout as well as providing a comparison. In addition, we underlined the clustering algorithms that are the preeminent performing for big data.

ATS_BDH16_0013 - Data analysis for chronic disease -diabetes using map reduce technique
Chronic disease endured for a long period of time. They are only be controlled but cannot be cured completely. Most of the people in the world are affected by chronic disease. In foreign countries like U.S. most of the death happens due to chronic disease. Some of the chronic diseases are Allergy, Cancer, Asthma, Heart disease, Glaucoma, Obesity, viral diseases such as Hepatitis C and HIV/AIDS. Of all the diseases Diabetes is the most hazardous disease. Diabetes means that blood glucose (blood sugar is too high. It is categorized into two divisions: Diabetes of category 1 and diabetes of category 2. In category 1, the human body does not make insulin, people with type1 need to take insulin every day. In type 2 the glucose level is of very high in the blood, it is one of the most common forms of diabetes. In type 2 diabetes, need to do physical activity and should have proper diet. [8] The analysis on the data is performed using Big data analytics framework Hadoop. Hadoop framework is used to process large data sets. The analysis is done using map reduce algorithm.

ATS_BDH16_0014 - Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments
Cloud computing has become increasingly popular model for delivering applications hosted in large data centers as subscription oriented services. Hadoop is a popular system supporting the MapReduce function, which plays a crucial role in cloud computing. The resources required for executing jobs in a large data center vary according to the job type. In Hadoop, jobs are scheduled by default on a first-come-first-served basis, which may unbalance resource utilization. This paper proposes a job scheduler called the job allocation scheduler (JAS), designed to balance resource utilization. For various job workloads, the JAS categorizes jobs and then assigns tasks to a CPU-bound queue or an I/O-bound queue. However, the JAS exhibited a locality problem, which was addressed by developing a modified JAS called the job allocation scheduler with locality (JASL). The JASL improved the use of nodes and the performance of Hadoop in heterogeneous computing environments. Finally, two parameters were added to the JASL to detect inaccurate slot settings and create a dynamic job allocation scheduler with locality (DJASL). The DJASL exhibited superior performance than did the JAS, and data locality similar to that of the JASL.

ATS_BDH16_0015 - Evaluate H2Hadoop and Amazon EMR performances by processing MR jobs in text data sets
Text data is defined as sequences of characters that may become big data that has no specific format and only can be processed using the original Hadoop. Amazon Web Services AWS provides virtual Cloud Computing services such as storing data using S3 service and processing big data using EMR service. Amazon Elastic MapReduce EMR uses the original Hadoop as a processing environment to its Cloud Computing services. Also, H2Hadoop is a developed version of Hadoop that provides big data processing service that uses the metadata of related jobs to improve Hadoop performance. In this paper, we process a find sequence job in text data using Amazon EMR and H2Hadoop, and we came up with a comparison between them that shows H2Hadoop performance is more efficient than Amazon EMR in some cases under different considerations.

ATS_BDH16_0016 - An improved HDFS for small file
Hadoop is an open source distributed computing platform, and HDFS is Hadoop distributed file system. The HDFS has a powerful data storage capacity. Therefore, it is suitable for cloud storage system. However, HDFS was originally developed for the streaming access on large software, it has low storage efficiency for massive small files. To solve this problem, the HDFS file storage process is improved. The files are judged before uploading to HDFS clusters. If the file is a small file, it is merged and the index information of the small file is stored in the index file with the form of key-value pairs. The simulation shows that the improved HDFS has lower NameNode memory consumption than original HDFS and Hadoop Archives (HAR files). Thus, it can improve the access efficiency.

ATS_BDH16_0017 - Profiling apache HIVE query from run time logs
Apache Hive is a widely used data warehousing and analysis tool. Developers write SQL like HIVE queries, which are converted into MapReduce programs to runs on a cluster. Despite its popularity, there is little research on performance comparison and diagnose. Part of the reason is that instrumentation techniques used to monitor execution can not be applied to intermediate MapReduce code generated from Hive query. Because the generated MapReduce code is hidden from developers, run time logs are the only places a developer can get a glimpse of the actual execution. Having an automatic tool to extract information and to generate report from logs is essential to understand the query execution behavior. We designed a tool to build the execution profile of individual Hive queries by extracting information from HIVE and Hadoop logs. The profile consists of detailed information about MapReduce jobs, tasks and attempts belonging to a query. It is stored as a JSON document in MongoDB and can be retrieved to generate reports in charts or tables. We have run several experiments on AWS with TPC-H data sets and queries to demonstrate that our profiling tool is able to assist developers in comparing HIVE queries written in different formats, running on different data sets and configured with different parameters. It is also able to compare tasks/attempts within the same job to diagnose performance issues.

ATS_BDH16_0018 - An efficient key partitioning scheme for heterogeneous MapReduce clusters
Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problem, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers a node's capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen algorithm, on the average our approach achieve performance gain of 29% and 17%, respectively.

ATS_BDH16_0019 - FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31% with an average of 18%.

ATS_BDH16_0020 - Defining Human Behaviors Using Big Data Analytics in Social Internet of Things
As we delve into the Internet of Things (IoT), we are witnessing the intensive interaction and heterogeneous communication among different devices over the Internet. Consequently, these devices generate a massive volume of Big Data. The potential of these data has been analyzed by the complex network theory, describing a specialized branch, known as 'Human Dynamics.' The potential of these data has been analyzed by the complex network theory, describing a specialized branch, known as 'Human Dynamics.' In this extension, the goal is to describe human behavior in the social area at real-time. These objectives are starting to be practicable through the quantity of data provided by smartphones, social network, and smart cities. These make the environment more intelligent and offer an intelligent space to sense our activities or actions, and the evolution of the ecosystem. To address the aforementioned needs, this paper presents the concept of 'defining human behavior' using Big Data in SIoT by proposing system architecture that processes and analyzes big data in real-time. The proposed architecture consists of three operational domains, i.e., object, SIoT server, application domain. Data from object domain is aggregated at SIoT server domain, where the data is efficiently store and process and intelligently respond to the outer stimuli. The proposed system architecture focuses on the analysis the ecosystem provided by Smart Cities, wearable devices (e.g., body area network) and Big Data to determine the human behaviors as well as human dynamics. Furthermore, the feasibility and efficiency of the proposed system are implemented on Hadoop single node setup on UBUNTU 14.04 LTS coreTMi5 machine with 3.2 GHz processor and 4 GB memory.

ATS_BDH16_0021 - A Big Data Clustering Algorithm for Mitigating the Risk of Customer Churn
As market competition intensifies, customer churn management is increasingly becoming an important means of competitive advantage for companies. However, when dealing with big data in the industry, existing churn prediction models cannot work very well. In addition, decision makers are always faced with imprecise operations management. In response to these difficulties, a new clustering algorithm called semantic-driven subtractive clustering method (SDSCM) is proposed. Experimental results indicate that SDSCM has stronger clustering semantic strength than subtractive clustering method (SCM) and fuzzy c-means (FCM). Then, a parallel SDSCM algorithm is implemented through a Hadoop MapReduce framework. In the case study, the proposed parallel SDSCM algorithm enjoys a fast running speed when compared with the other methods. Furthermore, we provide some marketing strategies in accordance with the clustering results and a simplified marketing activity is simulated to ensure profit maximization.

ATS_BDH16_0022 - A Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids
The capability of switching into the islanded operation mode of microgrids has been advocated as a viable solution to achieve high system reliability. This paper proposes a new model for the microgrids optimal scheduling and load curtailment problem. The proposed problem determines the optimal schedule for local generators of microgrids to minimize the generation cost of the associated distribution system in the normal operation. Moreover, when microgrids have to switch into the islanded operation mode due to reliability considerations, the optimal generation solution still guarantees for the minimal amount of load curtailment. Due to the large number of constraints in both normal and islanded operations, the formulated problem becomes a large-scale optimization problem and is very challenging to solve using the centralized computational method. Therefore, we propose a decomposition algorithm using the alternating direction method of multipliers (ADMM) that provides a parallel computational framework. The simulation results demonstrate the efficiency of our proposed model in reducing generation cost as well as guaranteeing the reliable operation of microgrids in the islanded mode. We finally describe the detailed implementation of parallel computation for our proposed algorithm to run on a computer cluster using the Hadoop MapReduce software framework.

ATS_BDH16_0023 - Secure inter cloud data migration
Cloud computing services are becoming more and more popular. However, the high concentration of data and services on the clouds make them attractive targets for various security attacks, including DoS, data theft, and privacy attacks. Additionally, cloud providers may fail to comply with service level agreement in terms of performance, availability, and security guarantees. Therefore, it is of paramount importance to have secure and efficient mechanisms that enable users to transparently copy and move their data from one provider to another. In this paper, we explore the state-of-the-art inter-cloud migration techniques and identify the potential security threats in the scope of Hadoop Distributed File System HDFS. We propose an inter-cloud data migration mechanism that offers better security guarantees and faster response time for migrating large scale data files in cloud database management systems. The performance of the proposed approach is validated by measuring its impact on response time and throughput, and comparing the performance to that of other techniques in the literature. The results show that our approach significantly improves the performance of HDFS and outperforms its counterparts.

ATS_BDH16_0024 - Dynamic Resource Allocation for MapReduce with Partitioning Skew
MapReduce has become a prevalent programming model for building data processing applications in the cloud. While being widely used, existing MapReduce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is unevenly distributed among reduce tasks. Existing solutions follow a similar principle that repartitions workload among reduce tasks. However, those approaches often incur high performance overhead due to the partition size prediction and repartitioning. In this paper, we present DREAMS, a framework that provides run-time partitioning skew mitigation. Instead of repartitioning workload among reduce tasks, we cope with the partitioning skew problem by controlling the amount of resources allocated to each reduce task. Our approach completely eliminates the repartitioning overhead, yet is simple to implement. Experiments using both real and synthetic workloads running on a 21-node Hadoop cluster demonstrate that DREAMS can effectively mitigate the negative impact of partitioning skew, thereby improving the job completion time by up to a factor of 2:29 over the native Hadoop YARN. Compared to the state-of-the-art solution, DREAMS can improve the job completion time by a factor of 1:65.

ATS_BDH16_0025 - Fuzzy Based Clustering Algorithms to Handle Big Data with Implementation on Apache Spark
With the advancement in technology, a huge amount of data containing useful information, called Big Data, is generated on a daily basis. For processing such tremendous volume of data, there is a need of Big Data frameworks such as Hadoop MapReduce, Apache Spark etc. Among these, Apache Spark performs up to 100 times faster than conventional frameworks like Hadoop Mapreduce. For the effective analysis and interpretation of this data, scalable Machine Learning methods are required to overcome the space and time bottlenecks. Partitional clustering algorithms are widely adopted by researchers for clustering large datasets due to their low computational requirements. Thus, we focus on the design of partitional clustering algorithm and its implementation on Apache Spark. In this paper, we propose a partitional based clustering algorithm called Scalable Random Sampling with Iterative Optimization Fuzzy c-Means algorithm (SRSIO-FCM) which is implemented on Apache Spark to handle the challenges associated with Big Data Clustering. Experimentation is performed on several big datasets to show the effectiveness of SRSIO-FCM in comparison with a proposed scalable version of the Literal Fuzzy c-Means (LFCM) called SLFCM implemented on Apache Spark. The comparative results are reported in terms of value of F-measure, ARI, Objective function, Run-time and Scalability. The reported results show the great potential of SRSIO-FCM for Big Data clustering.


















No comments:

Post a Comment