Research Projects
An adaptive graph sampling framework for graph analyticsMay 2020 - Oct 2023
University of Connecticut
Design and implement an adaptive edge sampling algorithm AdapES based on an adaptive graph sampling framework.
We propose an adaptive graph sampling framework, and design AdapES, an adaptive edge sampling algorithm based on this framework. Compared to non-adaptive sampling methods, our approach can continually monitor the difference between the current sampled subgraph and the original graph, and dynamically adjust the edge sampling probability based on this observed sampling difference. Guided by a preset sampling goal, this algorithm automatically adapts to the fluctuations in the random sampling process with high flexibility. The experimental
evaluation in 11 datasets demonstrates that AdapES outperforms other algorithms for preserving various graph properties and statistics.
Dynamic Resource Allocation for Apache Spark ApplicationsAug 2018 - May 2019
University of Connecticut
Designed and implemented a middleware to dynamically allocate computing resources for Apache Spark applications to improve resource utilization.
We design and implement a middleware service for dynamically allocating computing resources for Apache Spark applications on cloud platforms. It leverages the current and prior resource usage to predict future demand, and uses either an a priori calculated or dynamically threshold value to adjust the amount of allocated resources for the future time intervals. Our experiments using six different Apache Spark applications on both physical and virtual clusters demonstrate that our approaches can improve application performance while reducing resource requirements significantly in most cases compared to static resource allocation strategies.
Improving Performance of Apache Spark JobsFeb 2017 - Mar 2019
University of Connecticut
Implemented a Spark optimizer in Java to predict and mitigate potential task stragglers and skewed task distribution problems for Apache Spark platform to improve job performance by up to 71 %.
We present an analytical model driven approach that can predict the possibility of such problems by executing an application with a limited amount of input data and recommend ways to address the identified problems by repartitioning input data (in case of task straggler problem) and/or changing the locality configuration setting (in case of skewed task distribution problem). The novelty of our approach lies in automatically predicting the potential problems a priori based on limited execution data and recommending the locality setting and partition number. Our experimental result using 9 Apache Spark applications on two different clusters shows that our model driven approach can predict these problems with high accuracy and improve the performance by up to 71%.
Automatic Tuning of Apache Spark ConfigurationNov 2016 - May 2018
University of Connecticut
Constructed application specific performance influence models to tune the performance of applications running on Apache Spark platform.
We present a framework for tuning the performance of Apache Spark through optimizing configuration setting. In this framework, it identifies the best model for an application from multiple machine learning algorithms, and uses recursive random search to find the configuration combination based on predicted runtime by this model. This framework is evaluated with different machine learning algorithms as well as different techniques to select the best performance models. By using a representative set of open source applications, we demonstrate that the framework can help to improve the performance of these applications significantly.
Interference Modeling of Apache Spark JobsAug 2015 - May 2017
University of Connecticut
Developed a dynamical job predictor in Java to predict the execution time of multiple Spark jobs in Xen through integrating resource consumption and task event profiles for Spark jobs run in Xen virtual machines, and implemented a job scheduler in Java and Bash to reduce the total execution time.
We develop analytical models to estimate the effect of interference among multiple Apache Spark jobs running concurrently on job execution time in virtualized cloud environment. We evaluated the accuracy of our models using four real-life applications (Page rank, K-means, Logistic regression, and Word count) on a 6 node cluster while running up to four jobs concurrently. Our experimental results show that the model can achieve high prediction accuracy, which ranges between 86% to 99% when the number of concurrent jobs are four and all start simultaneously, and ranges between 71% to 99% when the number of concurrent jobs are four and start at different times. Moreover, our experimental results show that our scheduler can reduce the average execution time of individual jobs between 47% and 26%, and reduces the total execution time 2%–13%.
Performance Prediction for Apache Spark JobsAug 2014 - May 2015
University of Connecticut
Developed a Spark analytics system in Java to parse JSON logs of Apache Spark event, and predict time, I/O overhead, memory consumption using analytical and Machine Learning approaches.
We present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (execution time, memory consumption and I/O overhead) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications (WordCount, Logistic Regression, K-Means clustering and PageRank) on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy, which is up to 98% for time prediction, 99% for memory prediction, and 97% for I/O prediction.
Configurations evolution analysis of cloud softwareOct 2014 - Feb 2016
University of Connecticut
Implemented an automatic tool CSMiner based on source code call graphs for automatically analyzing configuration changes of cloud software such as Apache Cassandra, Apache Hadoop.
We design an automated tool (CSMiner) leveraging static program analysis techniques that helps users to understand how and where a particular setting is used in a program and how settings have evolved across different versions of a software system. CSMiner was applied on four different open source software packages, namely, Apache Cassandra, ElasticSearch, Apache Hadoop, and Apache HBase, and CSMiner identified 109 (out of 109), 109 (out of 113), 811 (out of 847), and 160 (out of 167) settings for these software packages respectively. In each case, CSMiner successfully identified the changes in configuration settings across multiple versions with high accuracy.
Learning environment for Smart Grid securityAug 2013 - Feb 2014
Georgia State University
Implemented an online system using JSP, jQuery, MySQL, Bash to schedule Smart Grid emulator for courses design. Implemented course information display and scheduling calendar using JSP and jQuery, developed Java Servlets and Bash scripts to launch and control Smart Grid emulator, loaded user profile and course schedule information into MySQL database.
We design and implement an integrated learning environment for the education of smart grid security. The core components of this environment are smart grid simulator and a learning website. Based on this learning environment, we design course projects and learning materials in teaching, so that students can better grasp the knowledge of smart grid security.
Optimizing Hadoop MapReduceNov 2011 - Dec 2012
Beihang University
Applied BTrace to trace MapReduce job functions, and monitor resource consumption using Ganglia. Implemented a MapReduce optimizer in Java through constructing Hadoop performance model for execution time prediction and designing heuristic search algorithm to find near optimal configurations for MapReduce jobs.
We implement an experience guided configuration optimizer- Predator, which does not treat the optimization problem as a pure black-box problem but utilizes useful experience learned from Hadoop MapReduce configuration practice to assist the optimizing process. The optimizer uses job execution time estimated by a practical MapReduce cost model as the objective function, and classifies Hadoop MapReduce parameters into different groups by their different tunable levels to shrink search space. Furthermore, the optimization algorithm of the optimizer uses the idea of subspace division to prevent local optimum problem, and it could also reduce the searching time by cutting down the cost in visiting unpromising points in search space. In the experiments on Hadoop clusters, it can improve the MapReduce execution by up to 88%.