The First ScAi Invitational Symposium

May 30-31, 2014
Boelter Hall 3551P, UCLA

The Rise of Data Science from Information Management to Big Data Analytics

Forward by Carlo Zaniolo (UCLA)
This symposium is being held to celebrate our new Scalable Analytics Institute, and the scientific advances that led to it. Indeed, it is not without some pride and emotion that I recollect the rise of data science from the humble origins that I witnessed when I began my graduate studies. Back then, data was viewed as just raw material in production tasks such as payroll, with research mostly focusing on physical structures and efficient processing. A big leap forward came with the relational data model which introduced logic-based query languages and logical schemata designed to capture the relationships and dependencies that are known to hold in the real world. An interesting role reversal then occurred as databases grew and were used to discover useful and previously unknown statistical relationships and dependencies that hold in the real world. This discovery process is made possible by the design of powerful analytics and their application to the massive datasets that now permeate many areas of human activity, whereby major benefits in science and economy are expected from it. Therefore, the objective of ScAI is the development of more advanced analytics, and of techniques that assure their high-performance and scalable support on parallel systems.

A Unified Big Data Runtime Stack

Tyson Condie (UCLA)    slides
In this talk, I will present ongoing work to building a unified Big Data runtime stack that supports a broad range of tasks e.g., ETL, Machine Learning, Big Graph Analytics. A key enabler is the REEF framework, which provides a centralized control plane for implementing a decentralized data plane. This work is in collaboration with the Cloud and Information Services Lab (CISL) at Microsoft and the ScAi Lab.

Recursive Queries with Monotonic Aggregates in DeAL

Alexander Shkapsky (UCLA)
Recent theoretical work has identified a class of aggregates that are monotonic w.r.t. set containment, and can therefore be freely used in recursive rules while keeping the least-fixpoint and model-theoretic semantics of Datalog programs. In this talk, I will present the design and implementation of monotonic aggregate functions based upon this theoretical work. We will review the syntax and semantics of the monotonic aggregates implemented in the Deductive Application Language (DeAL). We will present example programs demonstrating DeAL’s capacity to express and efficiently support a wide range of advanced applications, including graph queries, graph analytics and KDD algorithms. Lastly, we will discuss optimization techniques enabled by the monotonicity of the aggregates and review the results of experimental evaluation showing the effectiveness of these logical extensions in applications.

Parallel and Distributed Recursive Query Evaluation

Mohan Yang (UCLA)    slides
There is a resurgence of interest in using Datalog for big data analytics. The emergence of new computational environments for big data analytics has triggered the re-evaluation of classical Datalog evaluation algorithms on the new environments. In this talk, I will present the results of our experimental study on finding efficient recursive query evaluation algorithms on two widely used computational environments — a multi-core shared memory environment and a MapReduce environment. We review the semi-naive evaluation, the Smart algorithm, a pair of single-source closure (SSC) algorithms and the Floyd-Warshall algorithm. We also propose a new hybrid SSC algorithm, the SSC12 algorithm, which is a combination of the two previously known SSC algorithms. We compare the memory requirements, scalability, and execution time of the evaluation algorithms on synthetic and real-life graphs. Our experimental results of the transitive closure and all-pairs shortest path query evaluation show that the surprisingly simple SSC12 algorithm is the only evaluated algorithm that consistently performs well on all test graphs that fit in the memory of a single machine, while the Smart algorithm is the best choice when the graph doesn’t fit in the memory of a single machine.

Scalable Approximate Query Processing through Scalable Error Estimation

Kai Zeng (UCLA)    slides
Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)—an area of research that is now made more critical by the need for timely and cost-effective analytics over “Big Data”. Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this talk, I will introduce my previous projects on exploiting and improving bootstrap to develop scalable error estimation systems, and focus on the more recent project—analytical bootstrap.
In analytical bootstrap, we propose a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between the analytical approach and the bootstrap approach. Based on this probabilistic framework, we develop efficient algorithms to predict the error distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that analytical bootstrap has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

Impact of new hardware trends on the design of next generation data management and analytics systems

Reza Sadri (Levyx)
Processing large amounts of data requires complex distributed systems. There are two major drawbacks to these systems, most famously Hadoop: 1) they are costly and complex; and 2) they follow a batch processing model and have very high latencies. We discuss new software architecture that leverages new trends in new non volatile memory technologies and multi-core processors to enable building systems that can process very large amount of data with low latency and are economically feasible.

Popularity-Aware Topic Model for Social Graphs

John Cho (UCLA)    slides
Probabilistic topic models are highly effective in automatically “grouping” words and documents in a text dataset by their “topics.” Unfortunately, when a few words frequently appear in the dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear even in “irrelevant” topic groups. This problem may not be a serious issue in a text dataset because the frequent words (e.g., “the” “and” “is”) do not have much meaning and can be safely removed before a topic model analysis. However, in a non-textual dataset, such as a social network dataset or a product review and rating dataset, these “popular” items correspond to items of interest (e.g., Barack Obama and Justin Bieber in Twitter dataset) and cannot be simply removed because most people are interested in them. In this talk, I will explain how we can address this “popularity problem.” In the talk, I first show the popularity problem with LDA (one of the most successful topic models) and describe novel extensions to LDA to deal with the popularity problem. In experimental evaluation of the proposed models, we observe that the models achieve significantly lower perplexity (i.e., better prediction power) and improved human-perceived result quality compared to the traditional LDA.

Toward a Universal Unsupervised Clustering Method

Massimo Mazzeo (UCLA)    slides
A simple hierarchical clustering algorithm called CLUBS (for CLustering Using Binary Splitting) is presented. CLUBS is faster and more accurate than existing algorithms, including k-means and its recently proposed refinements. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. CLUBS derives good clusters without requiring input from users, and it can be also adopted for supporting clustering algorithms requiring input parameters, such as k-means, and to identify clusters with non-globular shape.

SWIM System: From Structured Summaries to Integrated Knowledge Base

Shi Gao (UCLA)    slides
The Web provides a vast amount of high-quality information through structured, semi-structured, and unstructured sources. Fast-progressing projects such as Cyc, DBpedia, Freebase, and Probase generate a huge number of structured summaries. However, the coverage provided by each individual KB (knowledge base) remains limited. Since the process of generating structured summaries is usually manual and a standard ontology is often not used, current KBs are also prone to issues such as inconsistency and incorrectness.
In our SWIM system, we propose the IKBstore technique to integrate currently existing KBs into a more complete and consistent KB. IKBstore first merges existing KBs using the interlinks. Then, it employs IBminer to improve the coverage of the initially integrated KB. Finally, IKBstore utilizes the CS3 to extract context-aware attribute and entity synonyms and uses them to reconcile among different terminologies used by different KBs. In this way, IKBstore creates an integrated KB which outperforms existing KBs in terms of coverage, accuracy, and consistency.

High Performance Spatial Queries and Analytics for Spatial Big Data

Fusheng Wang (Emory University)    slides
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries due to its multi-dimensional nature. Our goal is to develop a general framework to support high performance spatial queries and analytics for spatial big data on MapReduce and CPU-GPU hybrid platforms. In this talk, I will present Hadoop-GIS — a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, multi-level indexing, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS is integrated into Apache Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated Hadoop-GIS is highly efficient and scalable, and outperforms parallel spatial DBMS for compute-intensive spatial queries.

Stock Trade Volume Prediction with Yahoo Finance User Browsing Behavior

Web traffic represents a powerful mirror for various real-world phenomena. For example, volumes of web searches have been shown to have a positive correlation with stock trading volumes and with the sentiment of investors. Our hypothesis is that user browsing behavior on a domain-specific portal is a better predictor of user intent than web searches.
We focus on the financial domain and we analyze the web browsing and trading data of more than 2,600 stocks traded on NYSE, Nasdaq and SNP. The web browsing data consists of user page views related to stock S on Yahoo! Finance, while the trading data includes the trading volume of S. We study the correlation and causality between web browsing and trading data while varying the time granularity (hourly, daily) and financial segmentation (individual tickers, industries, sectors).
We find that web browsing on Yahoo! Finance can anticipate stock trading volumes by two or three days, resulting in a higher predictive power than that of previous work that used web searches to predict trading volume. We also observe that grouping stocks into industries or sectors decreases the predictive power, whereas moving from hourly to daily time series granularity improves predictive power. We corroborate our findings with a theoretical intuition and extensive statistical and causality tests.

Query Petabytes of Data in a Blink Time!

Barzan Mozafari (University of Michigan)
For the past few decades, databases have been a successful abstraction for accessing and managing data. However, the rapid growth of data and the demand for more complex analytics have significantly hindered the scalability and applicability of these systems beyond traditional business data processing scenarios. In this talk, I will show how we can marry advanced statistical techniques with database systems to build robust and scalable query systems. In particular, I will focus on interactive analytics and visualization in the presence of massive volumes of data.

Query Optimization in Cloud Environment

Cindy Chen (University of Massachussetts Lowell)    slides
In this research, we present storage structures, PK-map and Tuple-index-map, to improve the performance of query execution and inter-node communication in Cloud Data Warehouses. Cloud Data Warehouses require Read-Optimized databases because large amount of historical data are integrated on a regular basis to facilitate analytical applications for report generation, future analysis, and decision-making. This frequent data integration can grow the data size rapidly and hence there is a need to allocate resource dynamically on demand. As resource is scaled-out in the cloud environment, the number of nodes involved in the execution of a query increases. This in turn increases the number of inter-node communications. In queries, join operation between two different tables are most common. To perform the join operation of a query in the cloud environment, data need to be transferred among different nodes. This becomes critical when we have huge amount of data (in Terabytes or Petabytes) stored across a large number of nodes. With the increase in number of nodes and amount of data, the size of the communication messages also increases, resulting in even increased bandwidth usage and performance degradation. In this research, we show through extensive experiments using PlanetLab Cloud that our proposed storage structures PK-map and Tuple-indexmap, and query execution algorithms improve the performance of join queries, decrease inter-node communication and workload in Cloud Data Warehouses.

Reservation-based scheduling: If you are late don’t blame us!

Carlo Curino (Microsoft)
In recent years, organizations have increasingly shifted toward a data-driven approach to business—this led to an explosion in the complexity and variety of big-data applications. Furthermore, we are observing a progressive shift from clusters dedicated to a single purpose (e.g., production MapReduce jobs) to large, consolidated clusters running a mixture of production and testing applications. In this paper, we tackle the problem of running a rich mix of applications, including job pipelines with gang-scheduling requirements and completion deadlines, by carefully separating the following concerns: (1) determining resource requirements for a job, and (2) ensuring predictable allocation of requested resources. We propose a resource description language that allows users to specify their resource needs abstractly, exposing many alternative ways of satisfying the job’s resource needs. This gives the system flexibility in allocating resources across several jobs, while also allowing it to plan ahead and determine whether it can satisfy any given request. This allows large production pipelines with deadlines to share a cluster with small, ad-hoc, latency-critical jobs. We show the power of this approach by presenting a scheduling framework that uses this rich language to ensure predictable resource allocation for production jobs while minimizing latency for best-effort jobs. Our framework relies on admission control, work-preserving preemption and quick adaptation to changing cluster conditions to achieve all of this, without sacrificing utilization. We demonstrate these techniques by building Rayon as extension to YARN (Hadoop 2.x). This allows us to validate our work in a real context and against a popular scheduler. We present extensive experimental evaluation by running our system on a 256-node cluster using ten workloads derived from real-world traces from clusters of Cloudera customers, Facebook, Microsoft, and Yahoo!

Flexible and Robust Co-Regularized Multi-Domain Graph Clustering

Wei Wang (UCLA)    slides
Multi-view graph clustering aims to enhance clustering performance by integrating heterogeneous information collected in different domains. Each domain provides a different view of the data instances. Leveraging cross-domain information has been demonstrated an effective way to achieve better clustering results. Existing multi-view graph clustering methods usually assume that different views are available for the same set of instances. In reality, however, data instances in one domain may correspond to multiple instances in another domain. Moreover, relationships between instances in different domains may be associated with weights based on prior (partial) knowledge. I will present a flexible and robust framework, CGC (Co-regularized Graph Clustering), based on non-negative matrix factorization (NMF), to tackle these challenges. CGC has several advantages over the existing methods. First, it supports many-to-many cross-domain instance relationship. Second, it incorporates weight on cross-domain relationship. Third, it allows partial cross-domain mapping so that graphs in different domains may have different sizes. Finally, it provides users with the extent to which the cross-domain instance relationship violates the in-domain clustering structure, and thus enables users to re-evaluate the consistency of the relationship.

Short Text Understanding

Haixun Wang (Google Research)
Representing and reasoning over human knowledge is a computational grand challenge for the 21st century. This talk introduces our approach of using web scale, data driven, probabilistic knowledge bases for text processing. Probase has a large concept space that enables it to “understand” most concepts (about worldly facts) used by human beings. Furthermore, knowledge in Probase is not black and white. We develop a set of scores to quantify its uncertainty. These quantification serve as the probabilistic priors and likelihoods that become the foundations of Probase’s concept learning mechanism for understanding short text. Finally, we address one problem of knowledge-based approaches: They are mostly one-way forward propagation approaches that do not know how to adapt themselves for improving the performance on the given tasks. We bring text processing into the DNN framework by using knowledge as both an input and an output. Early results on real life search experiments show our approach is promising.

Harvesting Wikipedia and Large Text Corpora

Hamid Mousavi (Apple)    slides
Structured summaries, such as Wikipedia’s InfoBoxes, are becoming a popular feature in (semi)curated document corpora, enabling powerful structured searches, and supporting question answering systems. These structured summaries have recently made automatic personal assistants (e.g., Siri, Cortana, and Google Now) more popular than traditional keyword-based searches in several domains. Unfortunately, currently existing summaries are not yet comprehensive enough for the significant goal of the mentioned applications. To cover this gap, we propose an automatic approach to extract structured summaries from textual documents, which are immensely available on the Web. To this end, we perform the following main tasks:
1) We first convert the text into a weighted-graph, called TextGraphs, which captures the grammatical and semantic relations between words and terms in the text. TextGraphs are generated using our new text mining framework, referred to as SemScape. SemScape uses a statistical parser to generate few of the most probable parse trees for each sentence and employs a novel two-step pattern-based technique to extract from parse trees candidate terms and their grammatical relations.
2) Then, we generate semantic links between words and terms in TextGraphs by using a set of predefined SPARQL-like patterns.
3) Finally, by learning from the current InfoBoxes in Wikipedia, and relying on a large body of categorical information, we convert the semantic links generated in the previous step into the final InfoBox triples, and also infer new attribute synonyms.
Our approach, called IBminer, is able to create more complete structured summaries for Wikipedia and similar well-written documents, and significantly improves the recall for structured queries on Wikipedia as shown in our experiments.

Leave a Reply