Sunday, May 15, 2016

DATA ENGINEERING

ATS_DE16_001 - Incremental and Decremental Max-flow for Online Semi-supervised Learning
            In classification, if a small number of instances is added or removed, incremental and decremental techniques can be applied to quickly update the model. However, the design of incremental and decremental algorithms involves many considerations. In this paper, we focus on linear classifiers including logistic regression and linear SVM because of their simplicity over kernel or other methods. By applying a warm start strategy, we investigate issues such as using primal or dual formulation, choosing optimization methods, and creating practical implementations. Through theoretical analysis and practical experiments, we conclude that a warm start setting on a high-order optimization method for primal formulations is more suitable than others for incremental and decremental learning of linear classification.


ATS_DE16_002 - Personalized Influential Topic Search via Social Network Summarization

          Social networks are a vital mechanism to disseminate information to friends and colleagues. In this work, we investigate an important problem - the personalized influential topic search, or PIT-Search in a social network: Given a keyword query q issued by a user u in a social network, a PIT-Search is to find the top-k q-related topics that are most influential for the query user u. The influence of a topic to a query user depends on the social connection between the query user and the social users containing the topic in the social network. To measure the topics’ influence at the similar granularity scale, we need to extract the social summarization of the social network regarding topics. To make effective topic-aware social summarization, we propose two random-walk based approaches: random clustering and an L-length random walk. Based on the proposed approaches, we can find a small set of representative users with assigned influential scores to simulate the influence of the large number of topic users in the social network with regards to the topic. The selected representative users are denoted as the social summarization of topic-aware influence spread over the social network. And then, we verify the usefulness of the social summarization by applying it to the problem of personalized influential topic search. Finally, we evaluate the performance of our algorithms using real-world datasets, and show the approach is efficient and effective in practice.


ATS_DE16_003 - Survey on Aspect-Level Sentiment Analysis

           The field of sentiment analysis, in which sentiment is gathered, analyzed, and aggregated from text, has seen a lot of attention in the last few years. The corresponding growth of the field has resulted in the emergence of various subareas, each addressing a different level of analysis or research question. This survey focuses on aspect-level sentiment analysis, where the goal is to find and aggregate sentiment on entities mentioned within documents or aspects of them. An in-depth overview of the current state-of-the-art is given, showing the tremendous progress that has already been made in finding both the target, which can be an entity as such, or some aspect of it, and the corresponding sentiment. Aspect-level sentiment analysis yields very fine-grained sentiment information which can be useful for applications in various domains. Current solutions are categorized based on whether they provide a method for aspect detection, sentiment analysis, or both. Furthermore, a breakdown based on the type of algorithm used is provided. For each discussed study, the reported performance is included. To facilitate the quantitative evaluation of the various proposed methods, a call is made for the standardization of the evaluation methodology that includes the use of shared data sets. Semanticallyrich concept-centric aspect-level sentiment analysis is discussed and identified as one of the most promising future research direction.

ATS_DE16_004 - Multilabel Classification via Co-evolutionary Multilabel Hypernetwork
           Multilabel classification is prevalent in many real-world applications where data instances may be associated with multiple labels simultaneously. In multilabel classification, exploiting label correlations is an essential but nontrivial task. Most of the existing multilabel learning algorithms are either ineffective or computational demanding and less scalable in exploiting label correlations. In this paper, we propose a co-evolutionary multilabel hypernetwork (Co-MLHN) as an attempt to exploit label correlations in an effective and efficient way. To this end, we firstly convert the traditional hypernetwork into a multilabel hypernetwork (MLHN) where label correlations are explicitly represented. We then propose a co-evolutionary learning algorithm to learn an integrated classification model for all labels. The proposed Co-MLHN exploits arbitrary order label correlations and has linear computational complexity with respect to the number of labels. Empirical studies on a broad range of multilabel data sets demonstrate that Co-MLHN achieves competitive results against state-of-the-art multilabel learning algorithms, in terms of both classification performance and scalability with respect to the number of labels.

ATS_DE16_005 - Answering Pattern Queries Using Views
          Answering queries using views has proven effective for querying relational and semistructured data. This paper investigates this issue for graph pattern queries based on graph simulation. We propose a notion of pattern containment to characterize graph pattern matching using graph pattern views. We show that a pattern query can be answered using a set of views if and only if it is contained in the views. Based on this characterization, we develop efficient algorithms to answer graph pattern queries. We also study problems for determining (minimal, minimum) containment of pattern queries. We establish their complexity (from cubic-time to NPcomplete) and provide efficient checking algorithms (approximation when the problem is intractable). In addition, when a pattern query is not contained in the views, we study maximally contained rewriting to find approximate answers; we show that it is in cubic-time to compute such rewriting, and present a rewriting algorithm. We experimentally verify that these methods are able to efficiently answer pattern queries on large real-world graphs.

ATS_DE16_006 - Similarity Measure Selection for Clustering Time Series Databases
          In the past few years, clustering has become a popular task associated with time series. The choice of a suitable distance measure is crucial to the clustering process and, given the vast number of distance measures for time series available in the literature and their diverse characteristics, this selection is not straightforward. With the objective of simplifying this task, we propose a multi-label classification framework that provides the means to automatically select the most suitable distance measures for clustering a time series database. This classifier is based on a novel collection of characteristics that describe the main features of the time series databases and provide the predictive information necessary to discriminate between a set of distance measures. In order to test the validity of this classifier, we conduct a complete set of experiments using both synthetic and real time series databases and a set of five common distance measures. The positive results obtained by the designed classification framework for various performance measures indicate that the proposed methodology is useful to simplify the process of distance selection in time series clustering tasks.
ATS_DE16_007 - Incremental Consolidation of Data-Intensive Multi-Flows
         Business intelligence (BI) systems depend on efficient integration of disparate and often heterogeneous data. The integration of data is governed by data-intensive flows and is driven by a set of information requirements. Designing such flows is in general a complex process, which due to the complexity of business environments is hard to be done manually. In this paper, we deal with the challenge of efficient design and maintenance of data-intensive flows and propose an incremental approach, namely CoAl , for semi-automatically consolidating data-intensive flows satisfying a given set of information requirements. CoAl works at the logical level and consolidates data flows from either high-level information requirements or platform-specific programs. As CoAl integrates a new data flow, it opts for maximal reuse of existing flows and applies a customizable cost model tuned for minimizing the overall cost of a unified solution. We demonstrate the efficiency and effectiveness of our approach through an experimental evaluation using our implemented prototype.

No comments:

Post a Comment