Module 2: Extracting code – Build an AI documentation assistant

Welcome to Module 2 of the tutorial series: “Build an AI-powered documentation assistant with Flask & DeepSeek”. In this module, you’ll learn how to extract code from GitHub repositories, parse Python files, and prepare them for automated documentation generation.

Prerequisites

Before starting Module 2, ensure you’ve completed the following steps from Module 1:

Set up the development environment and installed dependencies.
Configured API keys (DeepSeek and GitHub) in the .env file.
Ran the Flask app and tested the /generate-docstring endpoint.
Reviewed the folder structure and key files.

Lesson 3: Fetching code from GitHub

Objective

In this lesson, you’ll use the GitHub API to retrieve repositories, extract Python files, and handle API rate limits and authentication.

Step 1: Install the `requests` Library

Install the requests library to interact with the GitHub API:

pip install requests

Step 2: Update `github_api.py`

Add logic to fetch repository contents, filter Python files, and handle rate limits.

Update: app/utils/github_api.py to contain the following code:

import os
import requests
import time
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")

def fetch_repo_contents(owner, repo):
    """
    Fetch the contents of a GitHub repository.
    """
    url = f"https://api.github.com/repos/{owner}/{repo}/contents"
    headers = {"Authorization": f"token {GITHUB_ACCESS_TOKEN}"}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to fetch repository contents: {response.status_code}")

def filter_python_files(contents):
    """
    Filter out Python files from the repository contents.
    """
    return [file for file in contents if file["name"].endswith(".py")]

def download_file_contents(download_url):
    """
    Download the raw content of a file from GitHub.
    """
    headers = {"Authorization": f"token {GITHUB_ACCESS_TOKEN}"}
    response = requests.get(download_url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        raise Exception(f"Failed to download file: {response.status_code}")

def make_github_request(url):
    """
    Make a GitHub API request with rate limit handling.
    """
    headers = {"Authorization": f"token {GITHUB_ACCESS_TOKEN}"}
    while True:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 403 and "rate limit" in response.text:
            reset_time = int(response.headers["X-RateLimit-Reset"])
            sleep_time = max(reset_time - time.time(), 0) + 1  # Add 1 second buffer
            print(f"Rate limit exceeded. Sleeping for {sleep_time} seconds.")
            time.sleep(sleep_time)
        else:
            raise Exception(f"Failed to make request: {response.status_code}")

This script interacts with the GitHub API to fetch, filter, and download Python files from a given repository while handling rate limits.

Load Environment Variables
The script imports necessary modules (os, requests, time, and dotenv). It then calls load_dotenv() to load environment variables from a .env file. It retrieves the GitHub access token from the environment using os.getenv("GITHUB_ACCESS_TOKEN").
Fetch Repository Contents
The fetch_repo_contents(owner, repo) function constructs a GitHub API URL for fetching the repository’s file contents. It includes an authorization header with the access token. If GitHub returns a successful response (200), the function parses and returns the JSON response. Otherwise, it raises an exception with the status code.
Filter Python Files
The filter_python_files(contents) function iterates through the repository contents and selects only files ending in .py, returning a list of Python files.
Download File Contents
The download_file_contents(download_url) function makes an authenticated request to download a file’s raw content from GitHub. If the request succeeds (200), it returns the file’s text content. Otherwise, it raises an exception.
Handle API Rate Limits
The make_github_request(url) function makes a request to the GitHub API while handling rate limits. If the request succeeds (200), it returns the JSON response. If the request fails due to rate limiting (403), it calculates the wait time using the X-RateLimit-Reset header and pauses execution before retrying. If the request fails for other reasons, it raises an exception.

This script ensures secure API access, filters relevant files, and handles GitHub’s rate limits efficiently.

Step 3: Update `routes.py`

Add a new route to fetch and display repository contents.

Update: app/routes.py to contain the following code:

from flask import Blueprint, jsonify, request
from .utils.docstring_generator import generate_docstring
from app.utils.github_api import fetch_repo_contents, filter_python_files, download_file_contents

main_bp = Blueprint('main', __name__)

@main_bp.route('/')
def home():
    return "Welcome to the AI-Powered Documentation Assistant!"

@main_bp.route('/generate-docstring', methods=['POST'])

def generate_docstring_route():
    code = request.json.get('code')
    if not code:
        return jsonify({"error": "No code provided"}), 400

    try:
        docstring = generate_docstring(code)
        return jsonify({"docstring": docstring})
    except Exception as e:
        return jsonify({"error": str(e)}), 500
    
@main_bp.route("/fetch-repo", methods=["POST"])
def fetch_repo():
    """
    Fetch and display Python files from a GitHub repository.
    """
    data = request.json
    owner = data.get("owner")
    repo = data.get("repo")

    try:
        contents = fetch_repo_contents(owner, repo)
        python_files = filter_python_files(contents)
        return jsonify({"python_files": python_files})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

The fetch_repo() function retrieves Python files from a specified GitHub repository and returns them as a JSON response.

Receive Request Data
The function extracts the repository owner and name from the incoming JSON request using request.json.get("owner") and request.json.get("repo").
Fetch Repository Contents
It calls fetch_repo_contents(owner, repo) to retrieve the contents of the specified GitHub repository.
Filter Python Files
It processes the fetched repository contents using filter_python_files(contents), which selects only files ending in .py.
Return Response
If successful, the function returns a JSON response containing the list of Python files. If an error occurs, it catches the exception and returns an error message with a 500 status code.

This function enables the Flask app to interact with the GitHub API and extract Python files from repositories dynamically.

Lesson 4: Parsing Code for Documentation

Objective

In this lesson, you’ll use Python’s Abstract Syntax Tree (AST) to analyze code, extract metadata, and handle common parsing errors.

Step 1: Update `code_parser.py`

Add logic to parse Python code and extract functions, classes, and metadata.

Update: app/utils/code_parser.py to contain the following code:

import ast

def parse_code(code):
    tree = ast.parse(code)
    functions = [node for node in ast.walk(tree) if isinstance(node, ast.FunctionDef)]
    classes = [node for node in ast.walk(tree) if isinstance(node, ast.ClassDef)]
    return {"functions": functions, "classes": classes}

def extract_functions_and_classes(code):
    """
    Extract functions and classes from Python code using AST.
    """
    tree = ast.parse(code)
    functions = [node for node in ast.walk(tree) if isinstance(node, ast.FunctionDef)]
    classes = [node for node in ast.walk(tree) if isinstance(node, ast.ClassDef)]
    return functions, classes

def extract_function_signature(func_node):
    """
    Extract function signature (name, args, returns).
    """
    args = [arg.arg for arg in func_node.args.args]
    returns = ast.unparse(func_node.returns) if func_node.returns else None
    return {
        "name": func_node.name,
        "args": args,
        "returns": returns
    }

def extract_class_metadata(class_node):
    """
    Extract class metadata (name, methods, docstring).
    """
    methods = [node.name for node in ast.walk(class_node) if isinstance(node, ast.FunctionDef)]
    docstring = ast.get_docstring(class_node)
    return {
        "name": class_node.name,
        "methods": methods,
        "docstring": docstring
    }

This script analyzes Python code using the Abstract Syntax Tree (AST) module to extract functions, classes, and their metadata.

Parse Code and Identify Functions and Classes
- The parse_code(code) function parses the given Python code into an AST.
- It walks through the AST tree to collect function definitions (ast.FunctionDef) and class definitions (ast.ClassDef).
- It returns a dictionary containing lists of functions and classes.
Extract Functions and Classes
- The extract_functions_and_classes(code) function also parses the given Python code into an AST.
- It extracts function and class definitions separately and returns them as two lists.
Extract Function Signature
- The extract_function_signature(func_node) function retrieves the function name, its arguments, and return type.
- It extracts argument names from func_node.args.args.
- If the function has a return type annotation, it un-parses it using ast.unparse(func_node.returns).
- It returns a dictionary containing the function name, argument list, and return type.
Extract Class Metadata
- The extract_class_metadata(class_node) function retrieves the class name, its methods, and its docstring.
- It collects method names by walking through the class node and identifying FunctionDef nodes.
- It extracts the class docstring using ast.get_docstring(class_node).
- It returns a dictionary containing the class name, a list of methods, and the docstring.

This script enables Python code analysis by extracting structural information about functions and classes.

Step 2: Update `routes.py`

Add a new route to parse and display metadata from a Python file.

Update: app/routes.pyto contain the following code:

from flask import Blueprint, jsonify, request
from .utils.docstring_generator import generate_docstring
from app.utils.github_api import fetch_repo_contents, filter_python_files, download_file_contents
from app.utils.code_parser import extract_functions_and_classes, extract_function_signature, extract_class_metadata


main_bp = Blueprint('main', __name__)

@main_bp.route('/')
def home():
    return "Welcome to the AI-Powered Documentation Assistant!"

@main_bp.route('/generate-docstring', methods=['POST'])

def generate_docstring_route():
    code = request.json.get('code')
    if not code:
        return jsonify({"error": "No code provided"}), 400

    try:
        docstring = generate_docstring(code)
        return jsonify({"docstring": docstring})
    except Exception as e:
        return jsonify({"error": str(e)}), 500
    
@main_bp.route("/fetch-repo", methods=["POST"])
def fetch_repo():
    """
    Fetch and display Python files from a GitHub repository.
    """
    data = request.json
    owner = data.get("owner")
    repo = data.get("repo")

    try:
        contents = fetch_repo_contents(owner, repo)
        python_files = filter_python_files(contents)
        return jsonify({"python_files": python_files})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@main_bp.route("/parse-file", methods=["POST"])
def parse_file():
    """
    Parse a Python file and extract metadata.
    """
    data = request.json
    download_url = data.get("download_url")

    try:
        code = download_file_contents(download_url)
        functions, classes = extract_functions_and_classes(code)
        function_metadata = [extract_function_signature(func) for func in functions]
        class_metadata = [extract_class_metadata(cls) for cls in classes]
        return jsonify({
            "functions": function_metadata,
            "classes": class_metadata
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

The parse_file() function retrieves a Python file from a given URL, analyzes its contents, and extracts metadata about its functions and classes.

Receive Request Data
The function reads the JSON request body and extracts the download_url parameter.
Download the File
It calls download_file_contents(download_url) to fetch the raw content of the Python file.
Extract Functions and Classes
It processes the file content using extract_functions_and_classes(code), which returns lists of function and class definitions.
Generate Function Metadata
It iterates over the extracted functions and calls extract_function_signature(func) to retrieve each function’s name, arguments, and return type.
Generate Class Metadata
It iterates over the extracted classes and calls extract_class_metadata(cls) to retrieve each class’s name, methods, and docstring.
Return the Metadata
The function returns a JSON response containing the extracted function and class metadata. If an error occurs, it catches the exception and returns an error message with a 500 status code.

This function automates the process of analyzing Python files and extracting useful metadata for documentation or code analysis.

Step 3: Functional Testing

Test the `/fetch-repo` Endpoint

Start the Flask app:
```
python run.py
```

Call the /fetch-repo endpoint:

Expected output:

{
  "python_files": [
    {
      "_links": {
        "git": "https://api.github.com/repos/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/git/blobs/a4aada7faa6c7ff51f6d1ca34947525097b62c1d",
        "html": "https://github.com/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/blob/master/manage.py",
        "self": "https://api.github.com/repos/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/contents/manage.py?ref=master"
      },
      "download_url": "https://raw.githubusercontent.com/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/master/manage.py",
      "git_url": "https://api.github.com/repos/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/git/blobs/a4aada7faa6c7ff51f6d1ca34947525097b62c1d",
      "html_url": "https://github.com/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/blob/master/manage.py",
      "name": "manage.py",
      "path": "manage.py",
      "sha": "a4aada7faa6c7ff51f6d1ca34947525097b62c1d",
      "size": 538,
      "type": "file",
      "url": "https://api.github.com/repos/henrymbuguak/Shopping-Cart-Using-Django-2.0-and-Python-3.6/contents/manage.py?ref=master"
    }
  ]
}

This output represents the extracted Python files from a GitHub repository. It contains metadata about a single Python file, manage.py, found in the repository.

File Identified
The "python_files" key holds a list of detected Python files. In this case, the list contains one file: "manage.py".
File Metadata
- "name": The file’s name is "manage.py", which is typically a Django project management script.
- "path": The file is located at the repository’s root directory.
- "size": The file is 538 bytes in size.
- "sha": The SHA hash (a4aada7faa6c7ff51f6d1ca34947525097b62c1d) uniquely identifies this file’s content in Git.
URLs for Accessing the File
- "download_url": The direct URL to download the raw file content.
- "html_url": The GitHub web interface link to view the file.
- "git_url": The GitHub API link to retrieve the file’s Git object.
- "self": The API URL to fetch file details.
- "_links": A dictionary containing various reference links, including git, html, and self.

This output confirms that the script successfully fetched Python files from the repository and returned their details.

Call the /parse-file endpoint:
```
curl -X POST http://127.0.0.1:5000/parse-file \
     -H "Content-Type: application/json" \
     -d '{"download_url": "https://raw.githubusercontent.com/muvatech/Shopping-Cart-Using-Django-2.0-and-Python-3.6/refs/heads/master/cart/views.py"}'
```
Expected output:
```
{
  "classes": [],
  "functions": [
    {
      "args": [
        "request",
        "product_id"
      ],
      "name": "cart_add",
      "returns": null
    },
    {
      "args": [
        "request",
        "product_id"
      ],
      "name": "cart_remove",
      "returns": null
    },
    {
      "args": [
        "request"
      ],
      "name": "cart_detail",
      "returns": null
    }
  ]
}
```
This output represents the extracted function metadata from a Python file, showing details about functions but no classes.
1. No Classes Found
  The "classes" key contains an empty list ([]), indicating that the analyzed Python file does not define any classes.
2. Extracted Functions
  The "functions" key contains a list of dictionaries, each describing a function in the file.
3. Function Details
  Each function entry includes:
  - "name": The function’s name (e.g., "cart_add", "cart_remove", "cart_detail").
  - "args": A list of function parameters (e.g., "request", "product_id").
  - "returns": The function’s return type, which is null (indicating no explicit return annotation).
This output shows that the parsed file contains three functions related to managing a shopping cart, likely in a web application.

What You’ve Achieved

You fetched and parsed a real-world repository.
You built API endpoints to extract code and metadata.
You tested the functionality to ensure it works as expected.

Full code for module 2

You can find the complete code for this tutorial in the GitHub repository.

Next Steps

Proceed to Module 3: Learn to generate and improve docstrings using DeepSeek.
Experiment: Fetch and parse more repositories to prepare for docstring generation.
Join the Community: Share your progress and get feedback from other learners!

Prerequisites

Lesson 3: Fetching code from GitHub

Objective

Step 1: Install the requests Library

Step 2: Update github_api.py

Step 3: Update routes.py

Lesson 4: Parsing Code for Documentation

Objective

Step 1: Update code_parser.py

Step 2: Update routes.py

Step 3: Functional Testing

Test the /fetch-repo Endpoint

What You’ve Achieved

Full code for module 2

Next Steps

Facebook Comments

You Might Also Like

Step 1: Install the `requests` Library

Step 2: Update `github_api.py`

Step 3: Update `routes.py`

Step 1: Update `code_parser.py`

Step 2: Update `routes.py`

Test the `/fetch-repo` Endpoint