Print Your Entire Website into PDFs!

How to crawl your website and print out a PDF of each article on your website, with a little Python code

Author: Jeremy Morgan
Published: October 12, 2024

I wrote a book! Check out A Quick Guide to Coding with AI.
Become a super programmer!
Learn how to use Generative AI coding tools as a force multiplier for your career.

Hey there, Python enthusiasts! Today, we’re going to dive into building a Python app that does something pretty cool: extract URLs from an XML sitemap and convert them into PDFs. Sounds useful, right? This is perfect for things like archiving web pages or generating offline documentation.

We’ll be using Python with some great libraries like xml.etree.ElementTree, requests, os, and WeasyPrint. By the end of this, you’ll have a functional app that can take a sitemap, grab the URLs, and output their contents as PDFs. Let’s jump in!

What You’ll Need (Prerequisites)

Before we roll up our sleeves, make sure you have these tools:

Python installed. Grab it here.
pip, Python’s package installer (usually bundled with Python).
WeasyPrint, which is our tool for converting HTML to PDFs. Install it using:
```
pip install WeasyPrint
```
requests, for making HTTP requests. Install it with:
```
pip install requests
```

Setting Up Your Project

Let’s organize things a bit by creating a directory where all the magic will happen:

mkdir url_to_pdf_converter
cd url_to_pdf_converter

Inside, create these files:

sitemap.xml – the XML sitemap that contains your URLs.
urls.txt – where we’ll store the extracted URLs.
pdf_output/ – a folder where our PDFs will go.

Pretty simple setup, right? Now, let’s get to coding!

Understanding the Code

We’ll break the app into three core functions:

Extracting URLs from the sitemap
Converting those URLs into PDFs
A main function to tie everything together

Step 1: Extract URLs from the Sitemap

The first function parses the XML sitemap and pulls out the URLs. Here’s how that looks in Python:

import xml.etree.ElementTree as ET

def extract_urls_from_sitemap(sitemap_path, output_path):
    # Parse the XML file    

    tree = ET.parse(sitemap_path)
    root = tree.getroot()

    # Extract URLs

    urls = []
    for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
        urls.append(url.text.strip())

    # Write URLs to text file

    with open(output_path, 'w') as f:
        for url in urls:
            f.write(f"{url}\n")

    print(f"Extracted {len(urls)} URLs and saved to {output_path}")

This function does a few things:

Parses the XML sitemap.
Finds URLs in <loc> tags.
Writes them to a text file.

Step 2: Convert URLs to PDFs

Now that we have our URLs, we’ll need to convert the content of each URL into a PDF. This is where WeasyPrint comes in handy.

import os
from weasyprint import HTML
from urllib.parse import urlparse

def urls_to_pdfs(input_file, output_folder):
    # Ensure the output folder exists

    os.makedirs(output_folder, exist_ok=True)

    # Read URLs from the input file

    with open(input_file, 'r') as f:
        urls = f.read().splitlines()

    # Process each URL

    for url in urls:
        try:
            # Extract the last part of the URL path for the filename

            parsed_url = urlparse(url)
            path_parts = parsed_url.path.rstrip('/').split('/')
            last_folder = path_parts[-1] if path_parts else 'index'

            # Generate a filename for the PDF

            filename = f"{last_folder}.pdf"
            pdf_path = os.path.join(output_folder, filename)

            # Convert HTML to PDF using WeasyPrint

            HTML(url=url).write_pdf(pdf_path)

            print(f"Converted {url} to {pdf_path}")
        except Exception as e:
            print(f"Error processing {url}: {str(e)}")

Here’s what it does:

Reads URLs from the file.
Converts each URL’s content into a PDF.
Names the PDFs based on the last part of the URL path.

Step 3: The Main Function

Finally, we’ll create a main function that ties everything together. It’ll call the URL extraction and PDF conversion functions in sequence.

def main():
    sitemap_path = 'sitemap.xml'
    urls_file = 'urls.txt'
    pdf_folder = 'pdf_output'

    # Extract URLs from sitemap

    extract_urls_from_sitemap(sitemap_path, urls_file)

    # Convert URLs to PDFs

    urls_to_pdfs(urls_file, pdf_folder)

if __name__ == "__main__":
    main()

Running the App

To get everything working, just make sure you have your sitemap.xml ready. Then, run the Python script like this:

python your_script_name.py

Once it’s done, you should see the URLs saved in urls.txt and PDFs nicely stored in the pdf_output folder.

Frequently Asked Questions (FAQs)

1. What does WeasyPrint do?
It converts HTML documents into PDFs, which is exactly what we’re using it for.

2. Can I use this for any XML sitemap?
Yep! As long as the sitemap is in the standard format, this app will handle it.

3. Can I customize PDF filenames?
Absolutely! Just tweak the logic in urls_to_pdfs for more creative naming schemes.

4. How do I handle errors?
The urls_to_pdfs function already catches errors, so you’ll get a message if something goes wrong during the PDF conversion.

5. How do I install Python libraries again?
Just run this:

pip install WeasyPrint requests

Conclusion

And that’s it! You’ve just built an app that extracts URLs from a sitemap and converts them to PDFs. Not too shabby, huh? This can be a super handy tool for archiving web pages, or even for generating offline documentation.

Happy coding!

I wrote a book! Check out A Quick Guide to Coding with AI.
Become a super programmer!
Learn how to use Generative AI coding tools as a force multiplier for your career.