All Seminars

Title: Disease Risk annotation of Genomic and Epigenomic Variants using Machine Learning Approaches
Defense: Computer Science
Speaker: Yanting Huang, Emory University
Contact: Zhaohui Qin, zhaohui.qin@emory.edu
Date: 2022-07-15 at 12:00PM
Venue: https://zoom.us/j/5398613286?pwd=bWltZWxQVi9GZmU5WWRyUDlqamdTdz09
  Download Flyer  Add to Calendar
Abstract:
Understanding the impact of genomic variations and epigenomic modifications is important for discovering the mechanism of complex diseases. Over the last two decades, thousands of genome-wide association studies (GWASs) and epigenome-wide association studies (EWASs) have identified tens of thousands of disease-susceptibility loci that are associated with certain diseases. In addition to the association studies, many machine learning approaches have been applied to predict the pathogenicity of genetic variants and epigenetic modification. For example, logistic regression was used in CADD that prioritized functional, deleterious, and pathogenic variants. Random forests were used in GWAVA to distinguish disease-implicated variants from benign variants. A hybrid two-stage model with support vector machine, random forests, logistic regression, the Lasso and elastic net was used in BioMM to identify epigenetic signatures of schizophrenia.

In my thesis, I proposed several machine learning predictive models with different focuses on genomic and epigenomic variants annotations, which includes 1) EWASplus, an ensemble learning based framework for the risk prediction of DNA methylation loci associated with Alzheimer’s Disease, 2) CASAVA (Disease Category-specific Annotation of Variants), a disease category risk annotation for whole genome wide SNPs (single nucleotide polymorphism), 3) DRAFT (Disease Risk Annotation with Few shoTs learning), an end-to-end deep learning based approach that incorporates contrastive learning to tackle the lack of risk variants that hinder the application of traditional deep learning models to this research field. The raw training data is obtained from the ENCODE and the REMC projects and processed with our own pipeline.
Title: Practical Coded Computation Mechanisms for Distributed Computing
Seminar: Computer Science
Speaker: Dr. Hulya Seferoglu, University of Illinois at Chicago
Contact: Vaidy Sunderam, vss@emory.edu
Date: 2021-12-10 at 1:00PM
Venue: https://emory.zoom.us/j/98352727203
  Add to Calendar
Abstract:
A massive amount of data is generated with the emerging Internet of Things (IoT) including self-driving cars, drones, robots, smartphones, wireless sensors, smart meters, health monitoring devices. These data are expected to be processed in real-time in many time sensitive IoT applications, which is extremely challenging if not impossible with existing centralized cloud. For example, self-driving cars generate around 10GB of data per mile. Transmitting such massive data from end devices (such as self-driving cars) to the centralized cloud and expecting timely processing are not realistic with limited bandwidth between the end users and the centralized cloud. A distributed computing system, where computationally intensive aspects are distributively and securely processed at the end devices with possible help from edge servers and the cloud, might be a better approach to solving this problem. In this context, this talk will present our recent efforts on practical distributed computing mechanisms to securely harvest heterogeneous resources including computing power, storage, battery, networking resources, etc., scattered across end devices, edge servers, and cloud. The first part of the talk will focus on secure coded computation algorithms and protocols adaptive to heterogeneous and dynamic nature of edge computing systems and resources, while the second part will explore distributed learning at the edge.

Biography: Hulya Seferoglu is an Associate Professor in the Electrical and Computer Engineering Department of University of Illinois at Chicago. Before joining University of Illinois at Chicago, she was a Postdoctoral Associate at Massachusetts Institute of Technology. She received her Ph.D. degree in Electrical and Computer Engineering from University of California, Irvine, M.S. degree in Electrical Engineering and Computer Science from Sabanci University, and B.S. degree in Electrical Engineering from Istanbul University. She worked as a summer intern in AT&T Labs Research, Docomo USA Labs, and Microsoft Research, Cambridge. She serves as an associate editor for IEEE/ACM Transactions on Networking. She received the NSF CAREER award in 2020.

**Join Zoom Meeting** Venue: https://emory.zoom.us/j/98352727203
Title: Making Life Accessible through Machine Learning Applications
Seminar: Computer Science
Speaker: Dr. Emily Hand, University of Nevada Reno
Contact: Vaidy Sunderam, vss@emory.edu
Date: 2021-12-07 at 1:00PM
Venue: https://emory.zoom.us/j/98352727203
  Download Flyer  Add to Calendar
Abstract:
Dr. Hand’s long-term research vision is to build a wearable device that can provide individuals with feedback that improves their social interactions. The ideal users of this system are individuals on the Autism spectrum or those with visual impairments. There is a significant body of work on navigation for the visually impaired, but most of life exists outside of getting from point A to point B. Dr. Hand’s research group, the Machine Perception Lab, seeks to improve the quality of life for these individuals by focusing on social interactions. This involves understanding faces, body language, and spoken words. Her group has several active research projects towards this goal, including recognizing first impressions from faces, describing faces, recognizing micro-expressions, summarizing text and identifying sarcasm.
**Join Zoom Meeting** Venue: https://emory.zoom.us/j/98352727203
Title: Mining User Generated Content: Addressing Data Scarcity in Filtering Tasks
Defense: Computer Science
Speaker: Payam Karisani, Emory University
Contact: Li Xiong, lxiong@emory.edu
Date: 2021-12-03 at 10:00AM
Venue: https://emory.zoom.us/j/7390068295
  Download Flyer  Add to Calendar
Abstract:
Filtering tasks have a broad range of applications in mining user-generated data. Examples include public health monitoring, product monitoring, user satisfaction analysis, crisis management, and hate speech detection. This dissertation proposes methods and techniques to overcome one of the primary challenges of these tasks, i.e., the lack of enough training data. It has four main contributions.

First, it employs semi-supervised learning and proposes a novel method based on self-training and pseudo labeling to use unlabeled data. Our model uses the pretraining-finetuning paradigm in a semi-supervised setting to use unlabeled data for model initialization. It also employs a novel learning rate schedule to exploit noisy pseudo-labels as a means to explore the loss surface. We empirically demonstrate the efficacy of these strategies.

Second, it proposes a novel active learning model when additional labels can be obtained for a range of tasks. Specifically, we use a multi-view model to extract two views from documents, and then, we propose a novel acquisition function to aggregate the informativeness and the representativeness metrics for querying additional labels. We analytically argue that our acquisition function incorporates document contexts into the active learning query process. We also treat the highly informal language of users in social media as a factor that manifests itself in the output of learners and causes a high variance. Therefore, we employ a query-by-committee model as a variance reduction technique to combat this undesired effect. Our experiments show that our model significantly outperforms existing models.

Third, it exploits unlabeled documents in a multi-view model . We propose a novel algorithm for one of the most challenging filtering tasks in social media, i.e., the adverse drug reaction monitoring task. Here, we propose a pair of loss functions to pretrain and then finetune the classifier in each view by the pseudo-labels obtained in the other view. Therefore, we effectively transfer the knowledge obtained in one view to the classifier in the other view. We empirically demonstrate that this model is the first known algorithm that outperforms the multi-layer transformer models pretrained on domain specific data.

Finally, we observe that although in many cases labeled data is not available, annotated data for semantically similar tasks is available. Motivated by this, we formulate a new problem and propose an algorithm for single-source domain adaptation. We assume that in addition to the source and target data, we can access a set of unlabeled auxiliary domains. We empirically show that existing state-of-the-art models are unable to effectively use this type of data. We then propose a novel algorithm based on the uncertainty in output predictions to decompose the target data into two sets. Then, we show that training using the set of confidently labeled target documents along the auxiliary unlabeled data yields a classifier that is highly effective in the regions close to the classification decision boundaries. The experiments testify that our algorithm outperforms the state-of-the-art in this new problem setting.
Title: “Welcome Aboard!”: Remote Onboarding of Software Developers
Seminar: Computer Science
Speaker: Dr. Paige Rodeghero, Clemson University
Contact: Vaidy Sunderam, vss@emory.edu
Date: 2021-12-03 at 1:00PM
Venue: https://emory.zoom.us/j/98352727203
  Download Flyer  Add to Calendar
Abstract:
Abstract: The onboarding process of software developers is a “necessary evil” for a productive software development team. The process is expensive and time consuming. New hires onboarding to a remote or hybrid team need to quickly learn their project, understand their tasks, form connections with their team, and learn the company’s culture. Hybrid and remote work can make these tasks challenging due to asynchronous communication and collaboration.

Join Dr. Paige Rodeghero, Assistant Professor at Clemson University, and former Visiting Researcher at Microsoft Research, to explore the onboarding process, productivity, and how we can prepare the future of software developers for the new hybrid work model.

In this talk, we will explore the unique challenges faced by new hires and managers during remote onboarding and the importance of strong social connectedness within teams. Finally, we will cover recommendations for a smooth onboarding process.

Biography: Dr. Paige Rodeghero is an Assistant Professor of Computer Science at Clemson University. She obtained her Ph.D. in Computer Science and Engineering from the University of Notre Dame under the direction of Collin McMillan. Her main research interest is in software engineering, focused on productivity, remote work, onboarding, source code comprehension, computer science education, and software engineering for autism. In 2020, she was a visiting researcher at Microsoft Research. She has won multiple best paper awards, including two ACM SIGSOFT Distinguished Paper Awards. Previously to her research career, she worked in the industry as a lead software engineer for a startup company and as a software engineer at multiple medium-sized companies.

**Join Zoom Meeting** Venue: https://emory.zoom.us/j/98352727203
Title: Temporal Irregular Tensor Factorization and Prediction for Health Data Analysis
Defense: Computer Science
Speaker: Yifei Ren, Emory University
Contact: Dr. Li Xiong, li.xiong@emory.edu
Date: 2021-12-02 at 12:00PM
Venue: https://emory.zoom.us/j/2878070037
  Download Flyer  Add to Calendar
Abstract:
Tensors are a popular algebraic structure for a wide range of applications, due to their exceptional capability to model multidimensional relationships of the data. Among them, regular tensors with aligned dimensions for all modes have been extensively studied, for which various tensor factorization structures are proposed depending on the applications. However, regular tensor decomposition is incapable of handling many real-world cases involving time, due to its irregularity. Electronic health records (EHRs) are often generated and collected across a large number of patients featuring distinctive medical conditions and clinical progress over a long period of time, which results in unaligned records along the time dimension. Recently, PARAFAC2 has been re-popularized for successfully extracting meaningful medical concepts (phenotypes) from such temporal EHR by irregular tensor factorization.
However, existing PARAFAC2 methods suffer from three major limitations that impact their applicability to practical temporal EHR data analysis: 1) they are not robust to missing and erroneous elements in the data; 2) they fail to model the non-linear temporal dependency of patients' disease states, and are designed only for a single data type -- numeric or binary; 3) they are completely unsupervised, i.e., they attempt to learn the latent factors to best recover the original observations without considering downstream predictive tasks. While there are models that use extracted phenotypes for predictive tasks, they are trained separately and only consider a single prediction task, which ignores auxiliary information from other predictive tasks.
To address these limitations, we make three main contributions in this dissertation. We first propose a robust PARAFAC2 tensor factorization method for irregular tensors with a new low-rank regularization function to handle potentially missing and erroneous entries in the input data. We then propose a generalized, low-rank Recurrent Neural Network (RNN) regularized robust irregular tensor factorization for more accurate temporal modeling, with the flexibiity to choose from different losses to best suit different types of data in practice. Finally, we propose a supervised irregular tensor factorization framework with multi-task learning for joint optimization of phenotype extraction and predictive learning, which can yield not only more meaningful phenotypes but also better predictive accuracy.
**Join Zoom Meeting** Venue: https://emory.zoom.us/j/2878070037
Title: Computational discovery of interpretable histopathologic prognostic biomarkers in invasive carcinomas of the breast
Defense: Computer Science
Speaker: Mohamed Amgad, Emory University
Contact: Vaidy Sunderam, VSS@emory.edu
Date: 2021-11-30 at 1:00PM
Venue: https://northwestern.zoom.us/j/95813522155
  Download Flyer  Add to Calendar
Abstract:
While microscopic examination of tumor resections and biopsies has been a cornerstone in breast cancer grading for decades, it suffers from considerable inter-rater variability due to perceptual limitations and high clinical caseloads. Computational analysis of whole-slide image scans using convolutional neural networks (CNN) can help address this challenge. Unfortunately, CNNs can be difficult to interpret, which motivates our adoption of an approach called concept bottlenecking, where models first detect various tissue structures then use them to make their prediction. Concept bottleneck models require a large set of manual annotation data to train. Unfortunately, manual delineation of histopathologic structures is very demanding and impractical given pathologists’ time constraints. This dissertation describes contributions that fall under the themes of scalable data collection, deep learning-based tissue detection, and the discovery of novel histopathologic biomarkers and associations.

First, we examine crowdsourcing approaches that engage medical students to collect manual annotation data. Our results show that a structured, collaborative approach with pathologist supervision is scalable; the resultant publicly-released BCSS and NuCLS datasets contain 20,000 and 200,000 annotations of tissue regions and nuclei, respectively. We show that medical students produce accurate annotations for predominant, visually distinctive structures and that algorithmic suggestions help scale and improve the accuracy of annotations.

Second, we describe a set of CNN modeling approaches for the accurate delineation of histopathologic structures. We describe various improvements to enhance the performance of nucleus detection CNN models and introduce a technique called Decision Tree Approximation of Learned Embeddings, which helps explain CNN nucleus classifications without compromising prediction accuracy. Additionally, we offer consensus recommendations from the International Immuno-Oncology Working Group surrounding the computational detection of tumor-infiltrating lymphocytes, a critical emerging biomarker. Following these recommendations, we develop and validate a multi-scale CNN model that jointly detects tissue regions and nuclei, employing pre-defined biological constraints to improve accuracy.

Finally, we describe the development of a morphologic signature based on quantitative features extracted from computationally-delineated histopathologic regions and cells. This morphologic signature relies partly on a set of stromal features not captured by clinical guidelines for breast cancer grading, and has a stronger independent prognostic value.
Title: Knowledge-Aware User Intent Inference for Web Search and Conversational Agents
Seminar: Computer Science
Speaker: Ali Ahmadvand, Emory University
Contact: Eugene Agichtein, eugene.agichtein@emory.edu
Date: 2021-11-24 at 2:00PM
Venue: https://zoom.us/j/9912158487?pwd=aURCWjVpY1BmVzBaSDB6QktmZ2xvZz09
  Download Flyer  Add to Calendar
Abstract:
User intent inference is a critical step in designing intelligent information systems (e.g., conversational agents and e-commerce search engines). Accurate user intent inference improves user experience and satisfaction, but is a challenging task since user utterances or queries can be short, ambiguous, and contextually dependent. Moreover, in an e-commerce setting, the collected datasets are often labeled by weak supervision (e.g., click-through data), resulting in an imbalanced and sparse dataset. To address these problems, my dissertation proposes integrating entity knowledge-bases, conversation context, and user profile information to improve user intent inference for conversational agents. Additionally, I investigate joint learning, product taxonomies, and unlabeled domain-specific corpora (e.g., catalog) to improve query intent inference in e-commerce search.

To evaluate the proposed models, I examine the user intent inference for two main settings: 1) open-domain conversational agents and 2) e-commerce search engines. The conversational agent research is evaluated on conversations collected from real users as part of Amazon Alexa Prize competitions, and the e-commerce efforts use real query logs collected from The Home Depot's search engine. My dissertation shows that leveraging entity knowledge-base, conversation context, and user profile information accounts for most improvements for the conversational setting. The results demonstrate that the proposed models significantly enhance topic classification accuracy by 15% and dialogue act accuracy by 8% for conversational agents. For e-commerce search, the dissertation shows that joint-learning, product taxonomies, and unlabeled domain-specific corpora can significantly improve intent inference accuracy. The proposed models improve the performance of the top-1 retrieved documents by 6%-8% on standard metrics for e-commerce search. The results in both settings offer a significant improvement over state-of-the-art deep learning methods. The insights and findings in this dissertation suggest a promising direction for developing the user intent inference in both open-domain conversational agents and e-commerce search.
Title: Fairness in Social Networks
Seminar: Computer Science
Speaker: Sucheta Soundarajan, Syracuse University
Contact: Joyce Ho, joyce.ho@emory.edu
Date: 2021-11-19 at 1:00PM
Venue: https://emory.zoom.us/j/98352727203
  Add to Calendar
Abstract:
Social networks play a vital role in the spread of information through a population, and individuals in networks make important life decisions on the basis of the information to which they have access. In many cases, it is important to evaluate whether information is spreading fairly to all groups in a network. For instance, are male and female students equally likely to hear about a new scholarship? In this talk, I present the novel "information unfairness" criterion, which measures whether information spreads fairly to all groups in a network. I then discuss the results of a case study on the DBLP computer science co-authorship network with respect to gender, with several surprising results.

Biography:

Sucheta Soundarajan is an Associate Professor in the Electrical Engineering & Computer Science Department at Syracuse University. Her areas of interest include social network analysis and data mining, and her research covers topics such as network clustering, sampling, information flow, and centrality. She is a recipient of the NSF CAREER award, Army Research Office Young Investigator Award, and the SIAM Science Policy Fellowship. She received her PhD from Cornell University in 2013.
Title: Predicting Rare Clinical Events in Complex and Dynamic Environments
Defense: Computer Science
Speaker: Azade Tabaie, Emory University
Contact: Dr. Rishikesan Kamaleswaran, rkamaleswaran@emory.edu
Date: 2021-11-16 at 3:00PM
Venue: https://us02web.zoom.us/j/3978183382?pwd=anFkN2V3d2VBWEErd2VCOEdWL2xiUT09
  Download Flyer  Add to Calendar
Abstract:
Traditional machine learning classification algorithms assume a balanced proportion of classes in the data. However, class-imbalanced data is a challenge for training predictive models in many fields such as the medical domain. Although patient adverse outcomes occur rarely, they are worthy of prediction to improve the quality of care that patients have received; therefore, monitoring systems are needed in the hospital setting to capture the adverse rare events and improve patient health outcomes.

To that end, machine learning and natural language processing (NLP) techniques were used along with clinical expert knowledge to address the issue of rare event classification in a complex environment such as a hospital setting. In particular, two different patient cohort with distinct characteristics and objectives were investigated.

First, strategies were proposed to predict a rare type of infection among hospitalized children with central venous lines (CVLs). This cohort of pediatric patients are at high risk of morbidity and mortality from hospital acquired infections. Many serious infections in hospitalized children are likely preventable through interventions that prevent the infection or identify them early to initiate antimicrobial therapy. Besides being considered as a rare clinical event, the definitions that have been proposed for bloodstream infection commonly have inadequate sensitivity for clinically important infections and may be difficult to generalize across electronic health records (EHR) platforms. To infer the onset of the infection from EHR and eliminate the need for extensive chart reviews, a surrogate definition for bloodstream infection was proposed and validated. Then, two study designs were tested to improve the prediction accuracy of the onset of the infection during hospitalization. Finally, a data fusion approach was undertaken to integrate structured and unstructured information from EHR to boost the prediction performance. Incremental but meaningful improvements in the predictions were observed after each step.

Second, an algorithm was proposed to monitor the visits to an emergency department to detect intimate partner violence (IPV). IPV is a pervasive social challenge with severe health and demographic consequences. People experiencing IPV may seek care in emergency settings. Despite the urgency of this critical public health issue, IPV continues to be profoundly underdiagnosed and is considered a persistent hidden epidemic. IPV is frequently undercoded, undetected without appropriate screening tools, and underreported, rendering it a rare encounter in EHRs. The early and appropriate detection of and response to such cases is critical in disrupting the cycle of abuse including IPV related morbidity and mortality. Our proposed algorithm benefits from NLP techniques and domain expert knowledge. It can identify victims of IPV with a high sensitivity by analyzing the recorded provider notes and patient narratives.

We argue that all the techniques incorporated in this thesis are transferable to identify other rare clinical events with the ultimate goal of improving the level of care.