Git Repository Data Extraction¶
The tidyextractors.tidygit submodule lets you extract Git log data from a local Git repository. This page will guide you through the process.
A Minimal Code Example¶
from tidyextractors.tidygit import GitExtractor
# Extract data from a local Git repo
gx = GitExtractor('./your/repo/dir/')
# Commit data in a Pandas DataFrame.
commits_df = gx.commits(drop_collections=True)
# Commit/file keyed change data in a Pandas DataFrame
changes_df = gx.changes()
Step 1: Prepare Your Git Repo¶
All you need to get started is the path to a local Git repository. If you want to extract data from a repository hosted on GitHub, download or clone the repository to your computer.
Step 2: Extract Data¶
You can extract data from any local Git repository using the GitExtractor:
from tidyextractors.tidygit import GitExtractor
gx = GitExtractor('./your/repo/dir/')
You may need to wait while the data is being extracted, but all the data is now stored inside the extractor object. You just need a bit more code to get it in your preferred format.
Step 3: Get Pandas Data¶
Now, you can call a GitExtractor method to return data in a Pandas DataFrame.
# Commit data in a Pandas DataFrame.
commits_df = gx.commits(drop_collections=True)
# Commit/file keyed change data in a Pandas DataFrame
changes_df = gx.changes()
Note
GitExtractor.commits() drops columns with collections of data in cells (i.e. list, set, and dicts) because “tidy data” requires only atomic values in cells.
If you don’t want data dropped, change the optional drop_collections argument to false.