Scraping Top Repositories for Topics on GitHub
- Setup
GitHub is an increasingly popular programming resource used for code sharing. It's a social networking site for programmers that many companies and organizations use to facilitate project management and collaboration. According to statistics collected in August 2021, it was the most prominent source code host, with over 60 million new repositories created in 2020 and boasting over 67 million total developers.
All the projects on Github are stored as repositories. These repositories can get upvotes which are stored as stars. The stars that a repository gets can give us a guage of how popular the repository is. We can further filter all the repositores on GitHub by the topic they ascribe to. The list of topics is available here.
Thus, we'll scrape GitHub for the top repocistories on each topic and then save that to a csv file for future use. In order to do this, we'll use the following tools:
- Python as the programming language
- Requests library for downloading the webpage contents
- BeautifulSoup library for finding and accessing the relevant information from the downloaded webpage.
- Pandas library for saving the accessed information to a csv file.
- Introduction about GitHub and the problem statement
Here are the steps we'll follow:
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 30 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
Set up URLs and the user-agent.
topics_url = "https://github.com/topics"
base_url = 'https://github.com'
header = {"User-Agent": "Mozilla/5.0"}
Create variables to store scraped information.
topic_titles = []
topic_desc = []
topic_URL = []
def get_soup(url):
'''
This function will download the webpage for the url supplied as argument
and return the BeautifulSoup object for the webpage which can be used to
grab required information for the webpage.
'''
response = requests.get(url, "html.parser", headers = header)
if response.status_code != 200:
raise Exception('Failed to load page {}'.format(url))
soup = BeautifulSoup(response.text)
return soup
# Example
soup = get_soup(topics_url)
type(soup)
To get topic titles, we can pick p
tags with the class
"f3 lh-condensed mb-0 mt-1 Link--primary"
# finding all topic titles
def get_topic_titles(soup):
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = soup.find_all('p', {'class': selection_class})
topic_titles = []
for tag in topic_title_tags:
topic_titles.append(tag.text)
return topic_titles
# Example
titles = get_topic_titles(soup)
len(titles)
titles[:5]
This is the list of topics on page number 1. We will today scrape information for topics only on this page. In the future, we can scrape information from other pages as well by changing the page number in the url. Now we'll find the topic descriptions similarly.
# finding all topics descriptions
def get_topic_descs(soup):
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = soup.find_all('p', {'class': desc_selector})
topic_descs = []
for tag in topic_desc_tags:
topic_descs.append(tag.text.strip())
return topic_descs
# Example
topics_descs = get_topic_descs(soup)
len(topics_descs)
topics_descs[:5]
Similary, we'll find the topic urls.
def get_topic_urls(soup):
topic_link_tags = soup.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
topic_urls = []
for tag in topic_link_tags:
topic_urls.append(base_url + tag['href'])
return topic_urls
# Example
topic_urls = get_topic_urls(soup)
len(topic_urls)
topic_urls[:5]
def scrape_topics():
topics_url = 'https://github.com/topics'
soup = get_soup(topics_url)
topics_dict = {
'Title': get_topic_titles(soup),
'Description': get_topic_descs(soup),
'URL': get_topic_urls(soup)
}
return pd.DataFrame(topics_dict)
topics_df = scrape_topics()
topics_df.head()
Now that we have the topics with their titles, descriptions and url, we can access each topic url to grab information about the top 30 repositories from that topic individually and then save the scraped information for each topic as a separate csv file.
Each topic page looks like this
From this page, we'll grab information about the top 30 repositories based on their popularity as measured by the number of stars. The repositories are already sorted by popularity by default, so we can grab 30 of them from the first page on each topic itself. We'll begin by writing a function to download each topic page and create its BeautifulSoup object.
def get_topic_page(topic_url):
# Download the page
response = requests.get(topic_url, "html.parser", headers = header)
# Check successful response
if response.status_code != 200:
raise Exception('Failed to load page {}'.format(topic_url))
# Parse using Beautiful soup
topic_soup = BeautifulSoup(response.text)
return topic_soup
page = get_topic_page('https://github.com/topics/3d')
Get all the required information about a repository
All the information that we need about a repository is given under a div
tag with class d-flex flex-justify-between my-3
. So we will make a function which takes in the content of each repository from these tags as arguement. It will then grab and return the required information from the content.
def get_repo_info(repo):
# returns all the required info about a repository
info = repo.find('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'}).find_all('a')
username = info[0].text.strip()
repo_name = info[1].text.strip()
repo_url = base_url + info[0]['href'].strip()
stars = repo.find('span', {'id': 'repo-stars-counter-star'}).text.strip()
return username, repo_name, stars, repo_url
# Example
repo_contents = page.find_all('div', {'class': 'd-flex flex-justify-between my-3'})
get_repo_info(repo_contents[0])
Here we can see that the function returns the information about the first repository from the topic page. The top repository in this case is 'three.js' with 80.6k stars.
def get_topic_repos(topic_soup):
div_selection_class = 'd-flex flex-justify-between my-3'
repo_tags = topic_soup.find_all('div', {'class': div_selection_class})
topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}
# Get repo info
for i in range(len(repo_tags)):
username, repo_name, stars, repo_url = get_repo_info(repo_tags[i])
topic_repos_dict['username'].append(username)
topic_repos_dict['repo_name'].append(repo_name)
topic_repos_dict['stars'].append(stars)
topic_repos_dict['repo_url'].append(repo_url)
return pd.DataFrame(topic_repos_dict)
# Example
get_topic_repos(page)
As we can see, the function has returned a pandas DataFrame of the top 30 repos from the topic '3d'. Now, we'll make function to save this DataFrame as a csv file if we haven't already created a file on that topic.
def scrape_topic(topic_url, path):
if os.path.exists(path):
print("The file {} already exists. Skipping...".format(path))
return
topic_df = get_topic_repos(get_topic_page(topic_url))
topic_df.to_csv(path, index=None)
def scrape_topics_repos():
print('Scraping list of topics')
topics_df = scrape_topics()
os.makedirs('data', exist_ok=True)
for index, row in topics_df.iterrows():
print(f"Scraping top repositories for {row['Title']}")
scrape_topic(row['URL'], f"data/{row['Title']}.csv")
Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics
scrape_topics_repos()
As we can see, we have successfully scraped top 30 repositories for 30 topics. And we have saved the information on these top 30 repositories as a csv file for each topic separately. The information that we have scraped for each repository is its title, owner username, star count and its url.
We have scraped repositories for only 30 topics today. This was the number of topics available on the page 1 of https://github.com/topics. But it is easy top scrape more topics. What we just need to do is change the page number in the url https://github.com/topics?page={i} where 'i' is the page number. This way, we can scrape info on top repos for all the topics of GitHub.