Core Areas of Research
The goal of ScAi is (i) to promote the development of new computer systems and algorithms for big data management and mining and (ii) to foster the deployment and utilization of these systems and algorithms in big-data science projects. Toward this goal, ScAi builds on the expertise and collaborative projects by the faculty in the following core research areas, and is reaching out to research laboratories from other disciplines and to industrial partners.
Big Data Systems
Big Data applications have experienced extraordinary recent growth. This is largely fueled by the democratization of data and the collection of (even more!) massive data sets. Systems for large-scale analytics are being built for extracting insights that enhance services and optimize operations. However, the current suite of platforms can only capture a small fraction of machine learning algorithms at scale. Our research aims to provide appropriate data management toolkits for data scientists to operate at scale on cloud computing platforms.
Graphs and networks are widely used in many disciplines for modeling structures and relationships between entities. Mining an ensemble of graphs enables the discovery of structural patterns or subgraphs of interest. Our research addresses four basic questions concerning graph-based analytics as follows: (1) how to model subgraph patterns that allow desired structural flexibility, (2) how to discover discriminative subgraph patterns that capture salient characteristics of the graphs, (3) how to organize, summarize, and use subgraph patterns in building predictive models, and (4) to what extent the innovation impacts important problems faced by practitioner.
Language Design for Big Data and Data Streams
New high level programming languages are needed in order to facilitate (i) the development of advanced analytics, (ii) their efficient execution over diverse platforms, and (iii) their scalability via parallelization. The compilation and optimization techniques that achieved all these objectives for the simple logic-based languages of relational databases must be extended to deal with very complex analytical queries and very diverse computing platforms.
Mining High Dimensional Data
Commonly, many and various attributes are monitored and collected as inputs to analysis procedures. The growth in dimensionality is a major barrier in developing efficient analytics methods. One focus of our research has been on discovering coherent patterns embedded in the subspaces of high dimensional data. The detection of these patterns aids in the removal of irrelevant or redundant information, improves the performance of subsequent tasks, and enhances the comprehensibility of the data.
User and Quality Modeling for Big Data
The large volume of data generated and shared by end users presents important challenges in data quality preservation. While some of the user-generated data is of very high quality, the vast majority of the information is often noisy and sometimes created for maliciouspurposes. Therefore, it is essential to develop new ways to judge the intrinsic quality of the available data, by estimating the characteristics and reliability of the users who generated them, the time and location that the data was generated, and the general consensus and independent support for the shared information. Our research aims at developing novel models for representing user’s core data-generation characteristics and combining multiple sources of evidences for estimating the reliability of data. Ultimately these new models will allow users to make sense of often noisy data and to put them to the best use.
Biomedicine: Big Data is becoming the healthcare’s biggest challenge. UCLA school of medicine and hospital offer unique opportunities for ScAI faculty to develop and deploy new data models, analytic tools, and system architectures that directly address the challenges in releasing, accessing, managing, analyzing, and integrating datasets of diverse data types including imaging, phenotypic, molecular, clinical, behavioral, environmental, and many other types of biological and biomedical data.
Social Media: The massive, densely connected and constantly updating web of social sites offers a lot of significant and compelling information, but creates problems at every step of its analysis and use. We plan to initiate partnership with major companies to develop smart social media analytics tools for capturing, modeling, and predicting customer sentiments and behaviors.
World Wide Web: The WWW hosts more than 800 million web pages and is by far the biggest data repository and also poses many of the greatest challenges to data scientists. We will leverage our current strength in web information management and analysis in building next generation data analytics paradigm.