During research of samples-filter, where we try to detect sample repositories, that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency, we stuck at the point of dataset collection. GitHub API can return only the 1000 results at once. In order to conquer this problem, we developed ghminer, a command-line tool that aggregates limitless amount of repositories from GitHub, using provided GitHub PATs.
In order to use ghminer
, let’s install first (you will need npm installed):
npm install -g ghminer
Then, execute:
ghminer --query "stars:2..100000 size:>=20 mirror:false template:false topic:ruby" \
--start "2019-01-01" \
--end "2024-05-01" \
--tokens pats.txt \
--json
Where stars:2..100000 size:>=20 mirror:false template:false topic:ruby
is
the search query to the GitHub API, 2019-01-01
is a start date to search
the repositories those were created at this date, 2024-05-01
is an end to
search the repositories those were created at this date.
pats.txt
is a file that contains a number of your GitHub PATs, separated by
the line break, should look like this:
ghp_pAAAAAA......VCL7wgw
ghp_oBBBBB......XE1ySTKq
Depends on the possible captured repositories, the collection process can take
a while. When it’s done, you should have both: result.csv
and result.json
.
That’s it, now you can use it in your researches!
PS. Here is the example of collected 14.4k Javascript repositories (took me ~57min to collect) that were created during 2023-01-01..2024-05-01.