GitHub is an increasingly popular programming resource used for code sharing. It's a social networking site for programmers that many companies and organizations use to facilitate project management and collaboration. According to statistics collected in August 2021, it was the most prominent source code host, with over 60 million new repositories created in 2020 and boasting over 67 million total developers.

All the projects on Github are stored as repositories. These repositories can get upvotes which are stored as stars. The stars that a repository gets can give us a guage of how popular the repository is. We can further filter all the repositores on GitHub by the topic they ascribe to. The list of topics is available here.

Thus, we'll scrape GitHub for the top repocistories on each topic and then save that to a csv file for future use. In order to do this, we'll use the following tools:

Python as the programming language
Requests library for downloading the webpage contents
BeautifulSoup library for finding and accessing the relevant information from the downloaded webpage.
Pandas library for saving the accessed information to a csv file.

Introduction about GitHub and the problem statement

Here are the steps we'll follow:

We're going to scrape https://github.com/topics
We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
For each topic, we'll get the top 30 repositories in the topic from the topic page
For each repository, we'll grab the repo name, username, stars and repo URL
For each topic we'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

Setup

Import the required libraries

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

Set up URLs and the user-agent.

topics_url = "https://github.com/topics"
base_url = 'https://github.com'

header = {"User-Agent": "Mozilla/5.0"}

Create variables to store scraped information.

topic_titles = []
topic_desc = []
topic_URL = []

Scrape the list of topics.

Download the topics webpage and create a BeautifulSoup object

Let's write a function to download the page.

def get_soup(url):
    '''
    This function will download the webpage for the url supplied as argument
    and return the BeautifulSoup object for the webpage which can be used to 
    grab required information for the webpage.
    '''
    response = requests.get(url, "html.parser", headers = header)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
        
    soup = BeautifulSoup(response.text)
    
    return soup

# Example
soup = get_soup(topics_url)
type(soup)

bs4.BeautifulSoup

Create a transform function

Let's create some helper functions to parse information from the page.

Get topic titles

To get topic titles, we can pick p tags with the class "f3 lh-condensed mb-0 mt-1 Link--primary"

![My Image alt text](https://i.imgur.com/OnzIdyP.png)

# finding all topic titles
def get_topic_titles(soup):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = soup.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

# Example
titles = get_topic_titles(soup)

len(titles)

30

titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

This is the list of topics on page number 1. We will today scrape information for topics only on this page. In the future, we can scrape information from other pages as well by changing the page number in the url. Now we'll find the topic descriptions similarly.

Get topic descriptions

# finding all topics descriptions
def get_topic_descs(soup):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = soup.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

# Example
topics_descs = get_topic_descs(soup)

len(topics_descs)

30

topics_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

Similary, we'll find the topic urls.

Get topic URLs

def get_topic_urls(soup):
    topic_link_tags = soup.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

# Example
topic_urls = get_topic_urls(soup)
len(topic_urls)

30

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Save all information

We'll put together all this information into a single function and then save the scraped information into a pandas DataFrame.

def scrape_topics():
    topics_url = 'https://github.com/topics'
    soup = get_soup(topics_url)
    topics_dict = {
        'Title': get_topic_titles(soup),
        'Description': get_topic_descs(soup),
        'URL': get_topic_urls(soup)
    }
    return pd.DataFrame(topics_dict)

topics_df = scrape_topics()

topics_df.head()

Scraping for top 30 repos for each topic

Now that we have the topics with their titles, descriptions and url, we can access each topic url to grab information about the top 30 repositories from that topic individually and then save the scraped information for each topic as a separate csv file.

Each topic page looks like this

fdg

From this page, we'll grab information about the top 30 repositories based on their popularity as measured by the number of stars. The repositories are already sorted by popularity by default, so we can grab 30 of them from the first page on each topic itself. We'll begin by writing a function to download each topic page and create its BeautifulSoup object.

Download each topic page and create a BeautifulSoup Object

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url, "html.parser", headers = header)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_soup = BeautifulSoup(response.text)
    return topic_soup

page = get_topic_page('https://github.com/topics/3d')

Transform the topic Beautiful Object

Get all the required information about a repository

All the information that we need about a repository is given under a div tag with class d-flex flex-justify-between my-3. So we will make a function which takes in the content of each repository from these tags as arguement. It will then grab and return the required information from the content.

def get_repo_info(repo):
    # returns all the required info about a repository
    info = repo.find('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'}).find_all('a')
    username = info[0].text.strip()
    repo_name = info[1].text.strip()
    repo_url =  base_url + info[0]['href'].strip()
    stars = repo.find('span', {'id': 'repo-stars-counter-star'}).text.strip()
    return username, repo_name, stars, repo_url

# Example
repo_contents = page.find_all('div', {'class': 'd-flex flex-justify-between my-3'})

get_repo_info(repo_contents[0])

('mrdoob', 'three.js', '80.6k', 'https://github.com/mrdoob')

Here we can see that the function returns the information about the first repository from the topic page. The top repository in this case is 'three.js' with 80.6k stars.

Grab the information from top 30 repos under a topic.

Now, we'll write a function to grab information about repositories within a topic. It will take in a topic soup and return a pandas DataFrame on the top 30 repos in that topic.

def get_topic_repos(topic_soup):

    div_selection_class = 'd-flex flex-justify-between my-3'
    repo_tags = topic_soup.find_all('div', {'class': div_selection_class})

    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        username, repo_name, stars, repo_url = get_repo_info(repo_tags[i])
        topic_repos_dict['username'].append(username)
        topic_repos_dict['repo_name'].append(repo_name)
        topic_repos_dict['stars'].append(stars)
        topic_repos_dict['repo_url'].append(repo_url)
        
    return pd.DataFrame(topic_repos_dict)

# Example
get_topic_repos(page)

As we can see, the function has returned a pandas DataFrame of the top 30 repos from the topic '3d'. Now, we'll make function to save this DataFrame as a csv file if we haven't already created a file on that topic.

Save topic file

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

Putting it all together

We have a funciton to get the list of topics
We have a function to create a CSV file for scraped repos from a topics page
Let's create a function to put them together

def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['Title']}")
        scrape_topic(row['URL'], f"data/{row['Title']}.csv")

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

scrape_topics_repos()

Scraping list of topics
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping top repositories for Code review
Scraping top repositories for Compiler
Scraping top repositories for Continuous integration
Scraping top repositories for COVID-19
Scraping top repositories for C++

Summary and Conclusion

As we can see, we have successfully scraped top 30 repositories for 30 topics. And we have saved the information on these top 30 repositories as a csv file for each topic separately. The information that we have scraped for each repository is its title, owner username, star count and its url.

We have scraped repositories for only 30 topics today. This was the number of topics available on the page 1 of https://github.com/topics. But it is easy top scrape more topics. What we just need to do is change the page number in the url https://github.com/topics?page={i} where 'i' is the page number. This way, we can scrape info on top repos for all the topics of GitHub.

	username	repo_name	stars	repo_url
0	mrdoob	three.js	80.6k	https://github.com/mrdoob
1	libgdx	libgdx	19.8k	https://github.com/libgdx
2	pmndrs	react-three-fiber	17.4k	https://github.com/pmndrs
3	BabylonJS	Babylon.js	16.2k	https://github.com/BabylonJS
4	aframevr	aframe	14k	https://github.com/aframevr
5	ssloy	tinyrenderer	13.3k	https://github.com/ssloy
6	lettier	3d-game-shaders-for-beginners	12.5k	https://github.com/lettier
7	FreeCAD	FreeCAD	11k	https://github.com/FreeCAD
8	metafizzy	zdog	9.1k	https://github.com/metafizzy
9	CesiumGS	cesium	8.5k	https://github.com/CesiumGS
10	timzhang642	3D-Machine-Learning	7.8k	https://github.com/timzhang642
11	a1studmuffin	SpaceshipGenerator	7.1k	https://github.com/a1studmuffin
12	isl-org	Open3D	6.4k	https://github.com/isl-org
13	blender	blender	5.2k	https://github.com/blender
14	domlysz	BlenderGIS	5k	https://github.com/domlysz
15	spritejs	spritejs	4.8k	https://github.com/spritejs
16	openscad	openscad	4.7k	https://github.com/openscad
17	tensorspace-team	tensorspace	4.6k	https://github.com/tensorspace-team
18	jagenjo	webglstudio.js	4.6k	https://github.com/jagenjo
19	YadiraF	PRNet	4.6k	https://github.com/YadiraF
20	AaronJackson	vrn	4.4k	https://github.com/AaronJackson
21	google	model-viewer	4.1k	https://github.com/google
22	ssloy	tinyraytracer	4.1k	https://github.com/ssloy
23	mosra	magnum	3.9k	https://github.com/mosra
24	FyroxEngine	Fyrox	3.5k	https://github.com/FyroxEngine
25	gfxfundamentals	webgl-fundamentals	3.5k	https://github.com/gfxfundamentals
26	tengbao	vanta	3.3k	https://github.com/tengbao
27	cleardusk	3DDFA	3.2k	https://github.com/cleardusk
28	jasonlong	isometric-contributions	3.1k	https://github.com/jasonlong
29	cnr-isti-vclab	meshlab	2.9k	https://github.com/cnr-isti-vclab

	Title	Description	URL
0	3D	3D modeling is the process of virtually develo...	https://github.com/topics/3d
1	Ajax	Ajax is a technique for creating interactive w...	https://github.com/topics/ajax
2	Algorithm	Algorithms are self-contained sequences that c...	https://github.com/topics/algorithm
3	Amp	Amp is a non-blocking concurrency library for ...	https://github.com/topics/amphp
4	Android	Android is an operating system built by Google...	https://github.com/topics/android