Where Does LDA Sit for GitHub? (SEI 2019 )

Sun 10 - Fri 15 November 2019 San Diego, California, United States

Track

SEI 2019

When

Fri 15 Nov 2019 12:00 - 12:30 at Cortez 3 - Software Engineering Intelligence via NLP

Abstract

A lot of research in Software Engineering (SE) automatically extract topics of the text data and use the results directly or as a feature for a machine learning method. Research has shown that the majority of studies in SE use Latent Dirichlet Allocation (LDA) as the topic modeling approach. Similarly, there is a lot of work that apply LDA on GitHub data. However, there is no study that explores whether LDA is a good choice compared to other algorithms, nor is there any to investigate the effects of specific pre-processing steps on its performance. In this paper, we explore a large dataset of GitHub repositories and apply two main topic modeling algorithms, LDA (3 variants) and Non-Negative Matrix Factorization (NMF), in several experiments with different experimental settings. The results show that LDA results in a higher coherence score compared to NMF. However, care should be taken in the choice of LDA algorithm, setting its parameters, and the text pre-processing steps. The results of this paper benefit SE researchers who apply intelligent techniques using LDA.

Session Program

Fri 15 Nov

11:00 - 12:30: SEI 2019 - Software Engineering Intelligence via NLP at Cortez 3

11:00 - 11:30
Talk

Mining Text in Incident Repositories: Experiences and Perspectives on Adopting Machine Learning Solutions in Practice.

11:30 - 12:00
Talk

Predicting Defects with Latent and Semantic Features from Commit Logs in an Industrial Setting.

12:00 - 12:30
Talk

Where Does LDA Sit for GitHub?

Where Does LDA Sit for GitHub?

Fri 15 Nov

Tracks

Co-hosted Events

Workshops

Where Does LDA Sit for GitHub?

Fri 15 Nov

Mining Text in Incident Repositories: Experiences and Perspectives on Adopting Machine Learning Solutions in Practice. SEI 2019

Predicting Defects with Latent and Semantic Features from Commit Logs in an Industrial Setting. SEI 2019

Where Does LDA Sit for GitHub? SEI 2019