GitHub is an increasingly popular programming resource used for code sharing. It's a social networking site for programmers that many companies and organizations use to facilitate project management and collaboration. According to statistics collected in August 2021, it was the most prominent source code host, with over 60 million new repositories created in 2020 and boasting over 67 million total developers.

All the projects on Github are stored as repositories. These repositories can get upvotes which are stored as stars. The stars that a repository gets can give us a guage of how popular the repository is. We can further filter all the repositores on GitHub by the topic they ascribe to. The list of topics is available here.

Thus, we'll scrape GitHub for the top repocistories on each topic and then save that to a csv file for future use. In order to do this, we'll use the following tools:

  • Python as the programming language
  • Requests library for downloading the webpage contents
  • BeautifulSoup library for finding and accessing the relevant information from the downloaded webpage.
  • Pandas library for saving the accessed information to a csv file.
  • Introduction about GitHub and the problem statement

Here are the steps we'll follow:

  • We're going to scrape https://github.com/topics
  • We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
  • For each topic, we'll get the top 30 repositories in the topic from the topic page
  • For each repository, we'll grab the repo name, username, stars and repo URL
  • For each topic we'll create a CSV file in the following format:
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

Setup

Import the required libraries

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

Set up URLs and the user-agent.

topics_url = "https://github.com/topics"
base_url = 'https://github.com'

header = {"User-Agent": "Mozilla/5.0"}

Create variables to store scraped information.

topic_titles = []
topic_desc = []
topic_URL = []

Scrape the list of topics.

Download the topics webpage and create a BeautifulSoup object

Let's write a function to download the page.

def get_soup(url):
    '''
    This function will download the webpage for the url supplied as argument
    and return the BeautifulSoup object for the webpage which can be used to 
    grab required information for the webpage.
    '''
    response = requests.get(url, "html.parser", headers = header)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
        
    soup = BeautifulSoup(response.text)
    
    return soup
# Example
soup = get_soup(topics_url)
type(soup)
bs4.BeautifulSoup

Create a transform function

Let's create some helper functions to parse information from the page.

Get topic titles

To get topic titles, we can pick p tags with the class "f3 lh-condensed mb-0 mt-1 Link--primary"

![My Image alt text](https://i.imgur.com/OnzIdyP.png)
# finding all topic titles
def get_topic_titles(soup):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = soup.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles
# Example
titles = get_topic_titles(soup)

len(titles)
30
titles[:5]
['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

This is the list of topics on page number 1. We will today scrape information for topics only on this page. In the future, we can scrape information from other pages as well by changing the page number in the url. Now we'll find the topic descriptions similarly.

Get topic descriptions

# finding all topics descriptions
def get_topic_descs(soup):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = soup.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
# Example
topics_descs = get_topic_descs(soup)

len(topics_descs)
30
topics_descs[:5]
['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

Similary, we'll find the topic urls.

Get topic URLs

def get_topic_urls(soup):
    topic_link_tags = soup.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
# Example
topic_urls = get_topic_urls(soup)
len(topic_urls)
30
topic_urls[:5]
['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Save all information

We'll put together all this information into a single function and then save the scraped information into a pandas DataFrame.

def scrape_topics():
    topics_url = 'https://github.com/topics'
    soup = get_soup(topics_url)
    topics_dict = {
        'Title': get_topic_titles(soup),
        'Description': get_topic_descs(soup),
        'URL': get_topic_urls(soup)
    }
    return pd.DataFrame(topics_dict)
topics_df = scrape_topics()

topics_df.head()
Title Description URL
0 3D 3D modeling is the process of virtually develo... https://github.com/topics/3d
1 Ajax Ajax is a technique for creating interactive w... https://github.com/topics/ajax
2 Algorithm Algorithms are self-contained sequences that c... https://github.com/topics/algorithm
3 Amp Amp is a non-blocking concurrency library for ... https://github.com/topics/amphp
4 Android Android is an operating system built by Google... https://github.com/topics/android

Scraping for top 30 repos for each topic

Now that we have the topics with their titles, descriptions and url, we can access each topic url to grab information about the top 30 repositories from that topic individually and then save the scraped information for each topic as a separate csv file.

Each topic page looks like this

fdg

From this page, we'll grab information about the top 30 repositories based on their popularity as measured by the number of stars. The repositories are already sorted by popularity by default, so we can grab 30 of them from the first page on each topic itself. We'll begin by writing a function to download each topic page and create its BeautifulSoup object.

Download each topic page and create a BeautifulSoup Object

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url, "html.parser", headers = header)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_soup = BeautifulSoup(response.text)
    return topic_soup
page = get_topic_page('https://github.com/topics/3d')

Transform the topic Beautiful Object

Get all the required information about a repository

All the information that we need about a repository is given under a div tag with class d-flex flex-justify-between my-3. So we will make a function which takes in the content of each repository from these tags as arguement. It will then grab and return the required information from the content.

def get_repo_info(repo):
    # returns all the required info about a repository
    info = repo.find('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'}).find_all('a')
    username = info[0].text.strip()
    repo_name = info[1].text.strip()
    repo_url =  base_url + info[0]['href'].strip()
    stars = repo.find('span', {'id': 'repo-stars-counter-star'}).text.strip()
    return username, repo_name, stars, repo_url
# Example
repo_contents = page.find_all('div', {'class': 'd-flex flex-justify-between my-3'})

get_repo_info(repo_contents[0])
('mrdoob', 'three.js', '80.6k', 'https://github.com/mrdoob')

Here we can see that the function returns the information about the first repository from the topic page. The top repository in this case is 'three.js' with 80.6k stars.

Grab the information from top 30 repos under a topic.

Now, we'll write a function to grab information about repositories within a topic. It will take in a topic soup and return a pandas DataFrame on the top 30 repos in that topic.

def get_topic_repos(topic_soup):

    div_selection_class = 'd-flex flex-justify-between my-3'
    repo_tags = topic_soup.find_all('div', {'class': div_selection_class})

    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        username, repo_name, stars, repo_url = get_repo_info(repo_tags[i])
        topic_repos_dict['username'].append(username)
        topic_repos_dict['repo_name'].append(repo_name)
        topic_repos_dict['stars'].append(stars)
        topic_repos_dict['repo_url'].append(repo_url)
        
    return pd.DataFrame(topic_repos_dict)
# Example
get_topic_repos(page)
username repo_name stars repo_url
0 mrdoob three.js 80.6k https://github.com/mrdoob
1 libgdx libgdx 19.8k https://github.com/libgdx
2 pmndrs react-three-fiber 17.4k https://github.com/pmndrs
3 BabylonJS Babylon.js 16.2k https://github.com/BabylonJS
4 aframevr aframe 14k https://github.com/aframevr
5 ssloy tinyrenderer 13.3k https://github.com/ssloy
6 lettier 3d-game-shaders-for-beginners 12.5k https://github.com/lettier
7 FreeCAD FreeCAD 11k https://github.com/FreeCAD
8 metafizzy zdog 9.1k https://github.com/metafizzy
9 CesiumGS cesium 8.5k https://github.com/CesiumGS
10 timzhang642 3D-Machine-Learning 7.8k https://github.com/timzhang642
11 a1studmuffin SpaceshipGenerator 7.1k https://github.com/a1studmuffin
12 isl-org Open3D 6.4k https://github.com/isl-org
13 blender blender 5.2k https://github.com/blender
14 domlysz BlenderGIS 5k https://github.com/domlysz
15 spritejs spritejs 4.8k https://github.com/spritejs
16 openscad openscad 4.7k https://github.com/openscad
17 tensorspace-team tensorspace 4.6k https://github.com/tensorspace-team
18 jagenjo webglstudio.js 4.6k https://github.com/jagenjo
19 YadiraF PRNet 4.6k https://github.com/YadiraF
20 AaronJackson vrn 4.4k https://github.com/AaronJackson
21 google model-viewer 4.1k https://github.com/google
22 ssloy tinyraytracer 4.1k https://github.com/ssloy
23 mosra magnum 3.9k https://github.com/mosra
24 FyroxEngine Fyrox 3.5k https://github.com/FyroxEngine
25 gfxfundamentals webgl-fundamentals 3.5k https://github.com/gfxfundamentals
26 tengbao vanta 3.3k https://github.com/tengbao
27 cleardusk 3DDFA 3.2k https://github.com/cleardusk
28 jasonlong isometric-contributions 3.1k https://github.com/jasonlong
29 cnr-isti-vclab meshlab 2.9k https://github.com/cnr-isti-vclab

As we can see, the function has returned a pandas DataFrame of the top 30 repos from the topic '3d'. Now, we'll make function to save this DataFrame as a csv file if we haven't already created a file on that topic.

Save topic file

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

Putting it all together

  • We have a funciton to get the list of topics
  • We have a function to create a CSV file for scraped repos from a topics page
  • Let's create a function to put them together
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['Title']}")
        scrape_topic(row['URL'], f"data/{row['Title']}.csv")

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

scrape_topics_repos()
Scraping list of topics
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping top repositories for Code review
Scraping top repositories for Compiler
Scraping top repositories for Continuous integration
Scraping top repositories for COVID-19
Scraping top repositories for C++

Summary and Conclusion

As we can see, we have successfully scraped top 30 repositories for 30 topics. And we have saved the information on these top 30 repositories as a csv file for each topic separately. The information that we have scraped for each repository is its title, owner username, star count and its url.

We have scraped repositories for only 30 topics today. This was the number of topics available on the page 1 of https://github.com/topics. But it is easy top scrape more topics. What we just need to do is change the page number in the url https://github.com/topics?page={i} where 'i' is the page number. This way, we can scrape info on top repos for all the topics of GitHub.