Sumon Biswas

Assistant Professor, Department of Computer and Data Sciences, Case School of Engineering

Sumon Biswas is a tenure-track Assistant Professor in the Department of Computer and Data Sciences at Case Western Reserve University. His research lies at the intersection of software engineering and AI, with a focus on responsible AI, fairness, safety, and large language models.

Teaching Information

Responsible AI Engineering; Software Engineering

Education

Postdoc, Carnegie Mellon University, 2024
Ph.D. in Computer Science, Iowa State University, 2022
M.S. in Computer Science, Iowa State University, 2021

Sumon Biswas is an Assistant Professor of Computer and Data Sciences at Case Western Reserve University. He received Ph.D. in Computer Science from Iowa State University and completed Postdoc at Carnegie Mellon University. His research interests are in the intersection of Software Engineering (SE) and Artificial Intelligence (AI), focusing on the area of responsible AI Engineering. He combines formal and empirical approaches to design, verify, and analyze responsible AI systems. His work spans algorithmic fairness, safety, and long-term risks in ML-enabled systems. Recent work focuses on large language models and coding agents, with emphasis on formal verification and neuro-symbolic design.

My research focuses at the intersection of Software Engineering and AI, with a focus on responsible AI engineering. I combine formal verification and empirical methods to design, verify, and analyze AI-enabled systems, spanning algorithmic fairness, safety, and sustainability. Recent work investigates large language models and coding agents, with an emphasis on their safety, reliability, and trustworthiness.

Current Projects

Long-Term Risks in ML Systems

This project investigates how feedback loops and deployment dynamics in machine learning systems can create harmful long-term effects, and develops methods to analyze and mitigate those risks before and after deployment.

Related Publications:

Design and Architecture of Data Science Pipelines

This project studies the structure, design, and real-world evolution of data science pipelines, including how preprocessing, modeling, training, and evaluation stages interact in practice.

Related Publications:

Neural Network Verification

This project develops formal methods for verifying fairness properties of neural networks, with emphasis on modular reasoning and analysis for specific subpopulations.

Related Publications:

Causal Fairness in Machine Learning Pipelines

This project investigates how fairness properties compose across pipeline components and uses causal reasoning to understand, diagnose, and improve fairness in end-to-end machine learning workflows.

Related Publications:

Fairness Engineering in ML Models

This project examines fairness issues in real-world machine learning systems and develops practical techniques for fairness analysis, debugging, and performance-aware repair.

Related Publications:

AI-Assisted Software Quality and Technical Debt

This project studies technical debt and quality risks in AI-assisted software engineering, including self-admitted technical debt in machine learning software and the effects of prompt engineering on code-generating assistants.

Related Publications:

FairSense: Long-Term Fairness Analysis of ML-Enabled Systems

2025

Yining She , Sumon Biswas , Christian Kästner , Eunsuk Kang

47th International Conference on Software Engineering (ICSE)

Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are model-centric and designed to detect fairness issues under static settings. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system impact the environment, which in turn affects future decision-making. Such a self- reinforcing feedback loop can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation- based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. Given a fairness requirement, FairSense performs Monte- Carlo simulation to enumerate evolution traces for each system configuration. Then, FairSense performs sensitivity analysis on the space of possible configurations to understand the impact of design options and environmental factors on the long-term fairness of the system. We demonstrate FairSense's potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.

Trustworthy AI Artificial Intelligence

DOI arXiv Code

Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot

2024

David O'Brien , Sumon Biswas , Sayem Imtiaz , Rabe Abdalkareem , Emad Shihab , Hridesh Rajan

46th International Conference on Software Engineering (ICSE)

Code intelligence tools such as GitHub Copilot have begun to bridge the gap between natural language and programming language. A frequent software development task is the management of technical debts, which are suboptimal solutions or unaddressed issues which hinder future software development. Developers have been found to “self-admit” technical debts (SATD) in software artifacts such as source code comments. Thus, is it possible that the information present in these comments can enhance code generative prompts to repay the described SATD? Or, does the inclusion of such comments instead cause code generative tools to reproduce the harmful symptoms of described technical debt? Does the modification of SATD impact this reaction? Despite the heavy maintenance costs caused by technical debt and the recent improvements of code intelligence tools, no prior works have sought to incorporate SATD towards prompt engineering. Inspired by this, this paper contributes and analyzes a dataset consisting of 36,381 TODO comments in the latest available revisions of their respective 102,424 repositories, from which we sample and manually generate 1,140 code bodies using GitHub Copilot. Our experiments show that GitHub Copilot can generate code with the symptoms of SATD, both prompted and unprompted. Moreover, we demonstrate the tool's ability to automatically repay SATD under different circumstances and qualitatively investigate the characteristics of successful and unsuccessful comments. Finally, we discuss gaps in which GitHub Copilot's successors and future researchers can improve upon code intelligence tasks to facilitate AI- assisted software maintenance.

Artificial Intelligence

DOI

Fairify: Fairness Verification of Neural Networks

2023

Sumon Biswas , Hridesh Rajan

45th International Conference on Software Engineering (ICSE)

Fairness of machine learning (ML) software has become a major concern in the recent past. Although recent research on testing and improving fairness have demonstrated impact on real-world software, providing fairness guarantee in practice is still lacking. Certification of ML models is challenging because of the complex decision-making process of the models. In this paper, we proposed Fairify, an SMT-based approach to verify individual fairness property in neural network (NN) models. Individual fairness ensures that any two similar individuals get similar treatment irrespective of their protected attributes e.g., race, sex, age. Verifying this fairness property is hard because of the global checking and non-linear computation nodes in NN. We proposed sound approach to make individual fairness verification tractable for the developers. The key idea is that many neurons in the NN always remain inactive when a smaller part of the input domain is considered. So, Fairify leverages white-box access to the models in production and then apply formal analysis based pruning. Our approach adopts input partitioning and then prunes the NN for each partition to provide fairness certification or counterexample. We leveraged interval arithmetic and activation heuristic of the neurons to perform the pruning as necessary. We evaluated Fairify on 25 real-world neural networks collected from four different sources, and demonstrated the effectiveness, scalability and performance over baseline and closely related work. Fairify is also configurable based on the domain and size of the NN. Our novel formulation of the problem can answer targeted verification queries with relaxations and counterexamples, which have practical implications.

Trustworthy AI Artificial Intelligence

DOI arXiv Code

Towards Understanding Fairness and its Composition in Ensemble Machine Learning

2023

Usman Gohar , Sumon Biswas , Hridesh Rajan

45th International Conference on Software Engineering (ICSE)

Machine Learning (ML) software has been widely adopted in modern society, with reported fairness implications for minority groups based on race, sex, age, etc. Many recent works have proposed methods to measure and mitigate algorithmic bias in ML models. The existing approaches focus on single classifier-based ML models. However, real-world ML models are often composed of multiple independent or dependent learners in an ensemble (e.g., Random Forest), where the fairness composes in a non-trivial way. How does fairness compose in ensembles? What are the fairness impacts of the learners on the ultimate fairness of the ensemble? Can fair learners result in an unfair ensemble? Furthermore, studies have shown that hyperparameters influence the fairness of ML models. Ensemble hyperparameters are more complex since they affect how learners are combined in different categories of ensembles. Understanding the impact of ensemble hyperparameters on fairness will help programmers design fair ensembles. Today, we do not understand these fully for different ensemble algorithms. In this paper, we comprehensively study popular real-world ensembles: bagging, boosting, stacking and voting. We have developed a benchmark of 168 ensemble models collected from Kaggle on four popular fairness datasets. We use existing fairness metrics to understand the composition of fairness. Our results show that ensembles can be designed to be fairer without using mitigation techniques. We also identify the interplay between fairness composition and data characteristics to guide fair ensemble design. Finally, our benchmark can be leveraged for further research on fair ensembles. To the best of our knowledge, this is one of the first and largest studies on fairness composition in ensembles yet presented in the literature.

Trustworthy AI Artificial Intelligence

DOI arXiv Code

Fix Fairness, Don't Ruin Accuracy: Performance Aware Fairness Repair using AutoML

2023

Giang Nguyen , Sumon Biswas , Hridesh Rajan

31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

Machine learning (ML) is increasingly being used in critical decision-making software, but incidents have raised questions about the fairness of ML predictions. To address this issue, new tools and methods are needed to mitigate bias in ML-based software. Previous studies have proposed bias mitigation algorithms that only work in specific situations and often result in a loss of accuracy. Our proposed solution is a novel approach that utilizes automated machine learning (AutoML) techniques to mitigate bias. Our approach includes two key innovations: a novel optimization function and a fairness- aware search space. By improving the default optimization function of AutoML and incorporating fairness objectives, we are able to mitigate bias with little to no loss of accuracy. Additionally, we propose a fairness-aware search space pruning method for AutoML to reduce computational cost and repair time. Our approach, built on the state-of-the-art Auto-Sklearn tool, is designed to reduce bias in real-world scenarios. In order to demonstrate the effectiveness of our approach, we evaluated our approach on four fairness problems and 16 different ML models, and our results show a significant improvement over the baseline and existing bias mitigation techniques. Our approach, Fair-AutoML, successfully repaired 60 out of 64 buggy cases, while existing bias mitigation techniques only repaired up to 44 out of 64 cases.

Trustworthy AI Artificial Intelligence

DOI arXiv Code

Towards Safe ML-Based Systems in Presence of Feedback Loops

2023

Sumon Biswas , Yining She , Eunsuk Kang

International Workshop on Dependability and Trustworthiness of Safety-Critical Systems with Machine Learned Components (SE4SafeML @ ESEC/FSE)

Machine learning (ML) based software is increasingly being deployed in a myriad of socio-technical systems, such as drug monitoring, loan lending, and predictive policing. Although not commonly considered safety-critical, these systems have a potential to cause serious, long-lasting harm to users and the environment due to their close proximity and effect on the society. One type of emerging problem in these systems is unintended side effects from a feedback loop; the decision of ML-based system induces certain changes in the environment, which, in turn, generates observations that are fed back into the system for further decision-making. When this cyclic interaction between the system and the environment repeats over time, its effect may be amplified and ultimately result in an undesirable outcome. In this position paper, we bring attention to the safety risks that are introduced by feedback loops in ML- based systems, and the challenges of identifying and addressing them. In particular, due to their gradual and long-term impact, we argue that feedback loops are difficult to detect and diagnose using existing techniques in software engineering. We propose a set of research problems in modeling, analyzing, and testing ML-based systems to identify, monitor, and mitigate the effects of an undesirable feedback loop.

Trustworthy AI Artificial Intelligence

DOI

The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

2022

Sumon Biswas , Mohammad Wardat , Hridesh Rajan

44th International Conference on Software Engineering (ICSE)

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state- of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

Artificial Intelligence

DOI arXiv Code

23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software

2022

David O'Brien , Sumon Biswas , Sayem Imtiaz , Rabe Abdalkareem , Emad Shihab , Hridesh Rajan

30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

In software development, the term "technical debt" (TD) is used to characterize short-term solutions and workarounds implemented in source code that may incur a long-term cost. Technical debt has a variety of forms and can thus affect multiple qualities of software including but not limited to its legibility, performance, and structure. In this paper, we have conducted a comprehensive study on the technical debt in machine learning (ML) based software. Technical debt can appear differently in ML software by infecting the data that ML models are trained on, thus affecting the functional performance of ML systems. The growing inclusion of ML components in modern software systems are introducing new set of TDs. Does ML software have similar TDs to traditional software? If not, what are the new types of machine learning specific technical debts? Which ML pipeline stages those debts appear? Do these debts differ in ML tools and applications and when they get removed? Currently, we do not know the state of the ML TDs in the wild. To address these questions, we mined 68,821 self admitted technical debts (SATD) from all the revisions of a curated dataset consisting of 2,686 mature ML repositories from GitHub, along with their introduction and removal. By applying an open-coding scheme and following upon prior works, we provided a comprehensive taxonomy of ML SATDs. Our study analyzes ML SATD type organizations, their frequencies within stages of ML software, the differences between ML SATDs in applications and tools, and the effort of ML SATD removals. The findings discovered suggest implications for ML developers and researchers to create maintainable ML systems.

Artificial Intelligence

DOI Code

Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline

2021

Sumon Biswas , Hridesh Rajan

29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.

Trustworthy AI Artificial Intelligence

DOI arXiv Code

Do the Machine Learning Models on a Crowd Sourced Platform Exhibit Bias? An Empirical Study on Model Fairness

2020

Sumon Biswas , Hridesh Rajan

28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

Machine learning models are increasingly being used in important decision- making software such as approving bank loans, recommending criminal sentencing, hiring employees, and so on. It is important to ensure the fairness of these models so that no discrimination is made based on protected attribute (e.g., race, sex, age) while decision making. Algorithms have been developed to measure unfairness and mitigate them to a certain extent. In this paper, we have focused on the empirical evaluation of fairness and mitigations on real-world machine learning models. We have created a benchmark of 40 top- rated models from Kaggle used for 5 different tasks, and then using a comprehensive set of fairness metrics, evaluated their fairness. Then, we have applied 7 mitigation techniques on these models and analyzed the fairness, mitigation results, and impacts on performance. We have found that some model optimization techniques result in inducing unfairness in the models. On the other hand, although there are some fairness control mechanisms in machine learning libraries, they are not documented. The mitigation algorithm also exhibit common patterns such as mitigation in the post-processing is often costly (in terms of performance) and mitigation in the pre-processing stage is preferred in most cases. We have also presented different trade-off choices of fairness mitigation decisions. Our study suggests future research directions to reduce the gap between theoretical fairness aware algorithms and the software engineering methods to leverage them in practice.

Trustworthy AI Artificial Intelligence

DOI arXiv Code