Skip to main content
All CollectionsKnowledge BaseMiscellaneous
GitHub Pull Request Time Estimation Algorithm
GitHub Pull Request Time Estimation Algorithm

GitHub Pull Request Time Estimation Algorithm

Updated over 3 months ago

Overview

This algorithm is designed to calculate the total time (in hours) a developer spent working on a pull request by analyzing commit timestamps. The key concept behind this calculation is grouping commits into "coding sessions." A coding session is defined as a group of commits made within a short period of time, based on the assumption that the developer was continuously working.

The algorithm uses two key parameters:

  • max_commit_diff_in_minutes: This defines the threshold for determining if two consecutive commits belong to the same coding session. If the time difference between two commits exceeds this threshold, they are considered to belong to different sessions.

  • first_commit_addition_in_minutes: This estimates how long the developer was working before making the first commit in each coding session.

Default Values

  • max_commit_diff_in_minutes = 60 minutes

  • first_commit_addition_in_minutes = 15 minutes

Algorithm Steps

  1. Input: The algorithm takes an array of commit dates (dates), which are expected to be in string format.

  2. Preprocessing:

    • If there is only one commit in the list, the algorithm assumes the developer was working for the duration specified by first_commit_addition_in_minutes and returns this value divided by 60 to convert it into hours.

    • If there are multiple commits, it parses the commit dates into Time objects and sorts them chronologically (oldest to newest).

  3. Calculating Total Time:

    • The algorithm starts by adding first_commit_addition_in_minutes (converted to hours) to account for the initial coding session.

    • For each pair of consecutive commits, it calculates the time difference between them.

      • If the time difference is smaller than max_commit_diff_in_minutes, the time between commits is added to the total working time.

      • If the time difference exceeds max_commit_diff_in_minutes, the algorithm assumes the developer took a break, and a new coding session starts. In this case, first_commit_addition_in_minutes is added again to account for the new session.

  4. Final Calculation:

    • The total estimated time (in hours) is rounded to one decimal place and returned as the output.

Limitations

  1. Time to first commit:

    • The variable first_commit_addition_in_minutes represents an estimated guess of how long a developer took before making their first commit in a coding session. It is a static value, added to each new session to account for time spent coding prior to the first recorded commit. However, this value is an approximation and may not accurately reflect the actual time the developer spent working before the commit. Different developers may have varying coding practices, and the actual pre-commit work time may differ significantly from this estimation.

  2. Coding session threshold:

    • The coding session time threshold is controlled by the max_commit_diff_in_minutes variable. This threshold helps in distinguishing continuous work from breaks, but it may not always reflect the true nature of a developer's workflow. Developers might take short, untracked breaks or get interrupted, leading the algorithm to overestimate or underestimate the number of coding sessions.

  3. Impact of Squashed, Re-written, and Force-pushed Commits:

    • In Git, it is possible to squash, re-write, or force-push commits, which can significantly alter the history of commits in a pull request. When commits are squashed, multiple commits are combined into a single one, which can reduce the granularity of the timeline and make it difficult to assess how long each individual commit took. Additionally, if commits are re-written or force-pushed, the original commit timestamps may be lost or modified, causing the algorithm to miscalculate the actual time spent on the pull request. This means that the algorithm may not always provide an accurate reflection of time spent in cases where Git history has been altered.

Did this answer your question?