Collecting big dataset of GitHub repositories with a ghminer

During research of samples-filter, where we try to detect sample repositories, that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency, we stuck at the point of dataset collection. GitHub API can return only the 1000 results at once. In order to conquer this problem, we developed ghminer, a command-line tool that aggregates limitless amount of repositories from GitHub, using provided GitHub PATs.

In order to use ghminer, let’s install first (you will need npm installed):

npm install -g ghminer

Then, execute:

ghminer --query "stars:2..100000 size:>=20 mirror:false template:false topic:ruby" \
  --start "2019-01-01" \
  --end "2024-05-01" \
  --tokens pats.txt \
  --json

Where stars:2..100000 size:>=20 mirror:false template:false topic:ruby is the search query to the GitHub API, 2019-01-01 is a start date to search the repositories those were created at this date, 2024-05-01 is an end to search the repositories those were created at this date. pats.txt is a file that contains a number of your GitHub PATs, separated by the line break, should look like this:

ghp_pAAAAAA......VCL7wgw
ghp_oBBBBB......XE1ySTKq

Depends on the possible captured repositories, the collection process can take a while. When it’s done, you should have both: result.csv and result.json.

That’s it, now you can use it in your researches!

PS. Here is the example of collected 14.4k Javascript repositories (took me ~57min to collect) that were created during 2023-01-01..2024-05-01.