Print Your Entire Website into PDFs!
How to crawl your website and print out a PDF of each article on your website, with a little Python code
Author: Jeremy Morgan
Published: October 12, 2024
I wrote a book! Check out A Quick Guide to Coding with AI.
Become a super programmer!
Learn how to use Generative AI coding tools as a force multiplier for your career.
Hey there, Python enthusiasts! Today, we’re going to dive into building a Python app that does something pretty cool: extract URLs from an XML sitemap and convert them into PDFs. Sounds useful, right? This is perfect for things like archiving web pages or generating offline documentation.
We’ll be using Python with some great libraries like xml.etree.ElementTree
, requests
, os
, and WeasyPrint
. By the end of this, you’ll have a functional app that can take a sitemap, grab the URLs, and output their contents as PDFs. Let’s jump in!
What You’ll Need (Prerequisites)
Before we roll up our sleeves, make sure you have these tools:
- Python installed. Grab it here.
- pip, Python’s package installer (usually bundled with Python).
- WeasyPrint, which is our tool for converting HTML to PDFs. Install it using:
pip install WeasyPrint
- requests, for making HTTP requests. Install it with:
pip install requests
Setting Up Your Project
Let’s organize things a bit by creating a directory where all the magic will happen:
mkdir url_to_pdf_converter
cd url_to_pdf_converter
Inside, create these files:
sitemap.xml
– the XML sitemap that contains your URLs.urls.txt
– where we’ll store the extracted URLs.pdf_output/
– a folder where our PDFs will go.
Pretty simple setup, right? Now, let’s get to coding!
Understanding the Code
We’ll break the app into three core functions:
- Extracting URLs from the sitemap
- Converting those URLs into PDFs
- A main function to tie everything together
Step 1: Extract URLs from the Sitemap
The first function parses the XML sitemap and pulls out the URLs. Here’s how that looks in Python:
import xml.etree.ElementTree as ET
def extract_urls_from_sitemap(sitemap_path, output_path):
# Parse the XML file
tree = ET.parse(sitemap_path)
root = tree.getroot()
# Extract URLs
urls = []
for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
urls.append(url.text.strip())
# Write URLs to text file
with open(output_path, 'w') as f:
for url in urls:
f.write(f"{url}\n")
print(f"Extracted {len(urls)} URLs and saved to {output_path}")
This function does a few things:
- Parses the XML sitemap.
- Finds URLs in
<loc>
tags. - Writes them to a text file.
Step 2: Convert URLs to PDFs
Now that we have our URLs, we’ll need to convert the content of each URL into a PDF. This is where WeasyPrint comes in handy.
import os
from weasyprint import HTML
from urllib.parse import urlparse
def urls_to_pdfs(input_file, output_folder):
# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)
# Read URLs from the input file
with open(input_file, 'r') as f:
urls = f.read().splitlines()
# Process each URL
for url in urls:
try:
# Extract the last part of the URL path for the filename
parsed_url = urlparse(url)
path_parts = parsed_url.path.rstrip('/').split('/')
last_folder = path_parts[-1] if path_parts else 'index'
# Generate a filename for the PDF
filename = f"{last_folder}.pdf"
pdf_path = os.path.join(output_folder, filename)
# Convert HTML to PDF using WeasyPrint
HTML(url=url).write_pdf(pdf_path)
print(f"Converted {url} to {pdf_path}")
except Exception as e:
print(f"Error processing {url}: {str(e)}")
Here’s what it does:
- Reads URLs from the file.
- Converts each URL’s content into a PDF.
- Names the PDFs based on the last part of the URL path.
Step 3: The Main Function
Finally, we’ll create a main function that ties everything together. It’ll call the URL extraction and PDF conversion functions in sequence.
def main():
sitemap_path = 'sitemap.xml'
urls_file = 'urls.txt'
pdf_folder = 'pdf_output'
# Extract URLs from sitemap
extract_urls_from_sitemap(sitemap_path, urls_file)
# Convert URLs to PDFs
urls_to_pdfs(urls_file, pdf_folder)
if __name__ == "__main__":
main()
Running the App
To get everything working, just make sure you have your sitemap.xml
ready. Then, run the Python script like this:
python your_script_name.py
Once it’s done, you should see the URLs saved in urls.txt
and PDFs nicely stored in the pdf_output
folder.
Frequently Asked Questions (FAQs)
1. What does WeasyPrint do?
It converts HTML documents into PDFs, which is exactly what we’re using it for.
2. Can I use this for any XML sitemap?
Yep! As long as the sitemap is in the standard format, this app will handle it.
3. Can I customize PDF filenames?
Absolutely! Just tweak the logic in urls_to_pdfs
for more creative naming schemes.
4. How do I handle errors?
The urls_to_pdfs
function already catches errors, so you’ll get a message if something goes wrong during the PDF conversion.
5. How do I install Python libraries again?
Just run this:
pip install WeasyPrint requests
Conclusion
And that’s it! You’ve just built an app that extracts URLs from a sitemap and converts them to PDFs. Not too shabby, huh? This can be a super handy tool for archiving web pages, or even for generating offline documentation.
Happy coding!
I wrote a book! Check out A Quick Guide to Coding with AI.
Become a super programmer!
Learn how to use Generative AI coding tools as a force multiplier for your career.