Making better recommendation systems.
Working on use of Machine Learning in Compilers (thesis). Developed several tools and techniques that deploy Reinforcement Learning, Large Language Models, and advanced search techniques in compilers. Additionally, working on development of HPCToolkit, large-scale profiling tool used widely accross US national labs. Implemented infrastructure for scalable GPU tracing, node-level metric tracing, performance counters, and various performance analysis.
Finetuning Llama2-based model on LLVM IR programs to solve phase-ordering problem. Developing Priority Sampling method that reaches the performance test data with 30 samples.
Developing deep reinforcement learning compiler for optimizing tensor operations.
Profiling and analysis of power consumption on multi node GPU applications by using Nvidia NVML library.
Hands-on tutorials on cutting-edge supercomputing ATPESC 2021.
Profiling and Analysis of FFT implementation on Xtensa Platform in C and theoretical analysis window functions.
Measuring and characterization of materials on micro-identer device
Thesis | Optimizing Compiler Heuristics with Machine Learning |
Advisors | John Mellor-Crummey, Aleksandar Zlateski and Chris Cummins |
Thesis focus on the use of Machine Learning in Compilers. First, we developed LoopTune, a reinforcement-learning-based framework for optimizing tensor computations, a core component of ML workloads. Second, we pioneered the use of Large Language Models (LLMs) in compiler optimization by predicting the sequence of LLVM optimization flags directly from LLVM-IR in text form. Third, Finally, we developed Unique Sampling, a simple deterministic sampling technique for LLM that produces unique samples ordered by the model’s confidence and outperforms the label’s performance with 30 samples. Additionally, developed infrastructure for scalable GPU profiling over many GPU nodes. Added support for measuring performance counters and node level metrics in HPCToolkit, as well as GPU-idleness analysis, which points to the cause of serialization in GPU code.
Grade | 10/10 |
Thesis | Finding Shortest Path in Dynamic Large-scale Graph, based on Lambda Architecture |
Advisors | Vladimir Dimitrieski |
Developed the system for detecting the shortest path from multiple source in large-scale dynamic graph based on Lambda Architecture. Technologies used: Spark, HDFS, Kafka, Python Dash, Docker, Python
Grade | 9.96/10 |
Thesis | Hardware acceleration of chess engine |
Advisors | Vuk Vrankovic |
FPGA implementation of chess board evaluation by following RTL methodology. Technologies used: C, SystemC, VHDL, SystemVerilog
Authors | Dejan Grubisic, Chris Cummins, Volker Seeker, Hugh Leather |
Publication | Available on Arxiv |
We present Priority Sampling, a simple and deterministic sampling technique that produces unique samples ordered by the model’s confidence. Priority Sampling outperforms Nucleus Sampling for any number of samples, boosting the performance of the original model from 2.87% to 5% improvement over -Oz.
Authors | Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather |
Publication | Available on Arxiv |
We present a 7B-parameter LLaMa2-based model trained from scratch to optimize LLVM assembly for code size. Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler with zero compilations, outperforming two state-of-the-art baselines that require thousands of compilations.
Authors | Dejan Grubisic, Bram Wasti, Chris Cummins, Aleksandar Zlateski |
Publication | International Conference on Compiler Construction 24’ (Under submission) |
We present LoopTune, a deep reinforcement learning framework for optimizing tensor computations in deep learning models. LoopTune consistently exceeds the performance of traditional search-based algorithms, TVM, performing at the level of hand-tuned library Numpy.
Authors | Bram Wasti, Dejan Grubisic, Benoit Steiner, Aleksandar Zlateski |
Publication | Neural Information Processing Systems 22’ |
We present LoopStack, a domain-specific compiler stack for tensor operations. LoopStack is orders of magnitude faster than LLVM, while resulting in equal or improved run time performance, while defining predictable optimization space suitable for tuning with reinforcement learning.
Authors | Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, John Mellor-Crummey |
Publication | Parallel Computing Journal |
To address the challenge of performance analysis on the US DOE forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications.
Authors | Aaron Thomas Cherian, Keren Zhou, Dejan Grubisic, Xiaozhu Meng, John Mellor-Crummey |
Publication | ProTools |
In this paper, we describe extensions to Rice University’s HPCToolkit performance tools that support measurement and analysis of Intel’s DPC++ programming model for GPU-accelerated systems atop an implementation of the industry-standard OpenCL framework for heterogeneous parallelism on Intel GPUs
Authors | Keren Zhou, Xiaozhu Meng, Ryuichi Sai, Dejan Grubisic, John Mellor-Crummey |
Publication | TPDS |
In this paper, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. sing GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01x to 3.58x, with a geometric mean of 1.22x
Authors | Dejan Grubisic |
Publication | Zbornik radova Fakulteta tehničkih nauka u Novom Sadu |
Finding shortest path in dynamic multi source large-scale graph, based on Lambda architecture
Duration | 1 hour 13 minutes |
Demonstrating the use of machine learning in compilers, using reinforcement learning, large language models, and advanced techniques for sampling.
Duration | 13 minutes |
Presentation of Priority Sampling - a simple deterministic sampling technique for LLM that achieves suprisingly high performance.
Duration | 45 minutes |
Implementing system for finding shortest path in dynamic graph suitable for big data.
Duration | 45 minutes |
Implemenging FPGA component for accelerating evaluation of the board in chess engine.
Duration | 5 minutes |
Here we present the newly added features of monitoring power, temperature, and utilization on Nvidia GPUs in HPCToolkit.
C/C++ | |
Python | |
CudaC | |
GNU / Linux | |
Bash | |
OpenMP/MPI | |
VHDL | |
Docker | |
Java | |
Spark | |
Hadoop | |