Backup and Restore Files in Python

Let's talk about something every developer eventually needs: backing up and restoring files. Whether you're safeguarding important documents, preserving application data, or creating snapshots of your work, having a reliable backup system is crucial. In this article, we'll explore how to implement backup and restore functionality using Python's built-in modules.

Understanding the Basics

Before we dive into code, it's important to understand what we mean by backup and restore. A backup involves copying files from their original location to a backup destination, while restore means copying them back from the backup to their original (or alternative) location.

Python provides several modules that make file operations straightforward. The main ones we'll use are: - shutil for high-level file operations - os for path manipulation and directory operations - pathlib for modern path handling (Python 3.4+)

Creating a Simple Backup Script

Let's start with a basic backup function that copies files from a source directory to a backup directory:

import shutil
import os
from pathlib import Path
from datetime import datetime

def create_backup(source_dir, backup_dir):
    """
    Create a backup of source_dir in backup_dir with timestamp
    """
    # Create timestamp for backup folder
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = Path(backup_dir) / f"backup_{timestamp}"

    # Create backup directory
    backup_path.mkdir(parents=True, exist_ok=True)

    # Copy all files and subdirectories
    shutil.copytree(source_dir, backup_path)

    return str(backup_path)

# Usage example
source = "/path/to/important/files"
backup_location = "/path/to/backups"
backup_path = create_backup(source, backup_location)
print(f"Backup created at: {backup_path}")

This simple script creates a timestamped backup folder and copies everything from your source directory. The shutil.copytree() function handles recursive copying of directories.

Restoring from Backup

Now let's create a restore function:

def restore_backup(backup_path, restore_location):
    """
    Restore files from backup_path to restore_location
    """
    # Ensure restore location exists
    Path(restore_location).mkdir(parents=True, exist_ok=True)

    # Clear restore location (optional - be careful!)
    # shutil.rmtree(restore_location)

    # Copy backup to restore location
    shutil.copytree(backup_path, restore_location, dirs_exist_ok=True)

    return True

# Usage example
restore_from = "/path/to/backups/backup_20231015_143022"
restore_to = "/path/to/restored/files"
restore_backup(restore_from, restore_to)
print("Restore completed successfully")

The dirs_exist_ok=True parameter in shutil.copytree() allows us to copy into an existing directory, which is useful for incremental restores.

Handling Different Backup Scenarios

Not all backups are created equal. You might want different types of backups depending on your needs:

Incremental backups only copy files that have changed since the last backup Full backups copy everything every time Compressed backups save space by compressing the backup files

Here's a comparison of common backup strategies:

Strategy	Speed	Storage Usage	Complexity	Best For
Full Backup	Slow	High	Low	Small datasets
Incremental Backup	Fast	Low	Medium	Frequent backups
Differential Backup	Medium	Medium	High	Balanced approach
Compressed Backup	Slow	Very Low	Medium	Limited storage

Let's implement a compressed backup using Python's zipfile module:

import zipfile

def create_compressed_backup(source_dir, backup_dir):
    """
    Create a compressed backup as a zip file
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    zip_path = Path(backup_dir) / f"backup_{timestamp}.zip"

    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(source_dir):
            for file in files:
                file_path = Path(root) / file
                # Add file to zip with relative path
                zipf.write(file_path, arcname=file_path.relative_to(source_dir))

    return str(zip_path)

# Restore from compressed backup
def restore_compressed_backup(zip_path, restore_location):
    """
    Restore from a compressed backup
    """
    with zipfile.ZipFile(zip_path, 'r') as zipf:
        zipf.extractall(restore_location)

    return True

Advanced Backup Features

For a more robust backup solution, you might want to include:

Progress tracking during backup/restore operations
Error handling for file permission issues
Logging of backup activities
Verification of backed up files
Scheduling automatic backups

Here's an enhanced version with better error handling:

def safe_backup(source_dir, backup_dir):
    """
    Backup with error handling and logging
    """
    try:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        backup_path = Path(backup_dir) / f"backup_{timestamp}"

        if not Path(source_dir).exists():
            raise FileNotFoundError(f"Source directory {source_dir} not found")

        backup_path.mkdir(parents=True, exist_ok=True)

        # Count files for progress tracking
        total_files = sum([len(files) for _, _, files in os.walk(source_dir)])
        copied_files = 0

        for root, dirs, files in os.walk(source_dir):
            for file in files:
                src_path = Path(root) / file
                dst_path = backup_path / src_path.relative_to(source_dir)

                # Create parent directories if needed
                dst_path.parent.mkdir(parents=True, exist_ok=True)

                # Copy file
                shutil.copy2(src_path, dst_path)
                copied_files += 1

                # Print progress (optional)
                if copied_files % 100 == 0:
                    print(f"Progress: {copied_files}/{total_files} files copied")

        print(f"Backup completed: {copied_files} files copied")
        return str(backup_path)

    except Exception as e:
        print(f"Backup failed: {str(e)}")
        # Clean up partial backup
        if 'backup_path' in locals() and backup_path.exists():
            shutil.rmtree(backup_path)
        return None

Scheduling Automatic Backups

You can combine your backup functions with Python's scheduling capabilities:

import schedule
import time

def scheduled_backup():
    source = "/path/to/important/files"
    backup_dir = "/path/to/backups"
    result = create_backup(source, backup_dir)
    if result:
        print(f"Scheduled backup completed: {result}")
    else:
        print("Scheduled backup failed")

# Schedule daily backup at 2:00 AM
schedule.every().day.at("02:00").do(scheduled_backup)

# Keep the script running
while True:
    schedule.run_pending()
    time.sleep(60)

Best Practices for File Backups

When implementing backup solutions, consider these important practices:

Always verify that your backups actually work by testing restoration
Keep multiple backup versions to protect against corrupted backups
Store backups in different locations to protect against physical damage
Automate the process to ensure consistency
Monitor backup success/failure with notifications
Encrypt sensitive data before backing up

Here's a simple verification function:

def verify_backup(original_dir, backup_dir):
    """
    Verify that backup matches original
    """
    original_files = set()
    backup_files = set()

    # Collect all file paths (relative to their roots)
    for root, _, files in os.walk(original_dir):
        for file in files:
            rel_path = Path(root).relative_to(original_dir) / file
            original_files.add(str(rel_path))

    for root, _, files in os.walk(backup_dir):
        for file in files:
            rel_path = Path(root).relative_to(backup_dir) / file
            backup_files.add(str(rel_path))

    missing_files = original_files - backup_files
    extra_files = backup_files - original_files

    return {
        'verified': len(missing_files) == 0 and len(extra_files) == 0,
        'missing_files': list(missing_files),
        'extra_files': list(extra_files)
    }

Handling Large Files and Performance

When working with large files or many files, performance becomes important. Here are some optimization techniques:

Use shutil.copy2() instead of shutil.copy() to preserve metadata
Consider using rsync-like algorithms for incremental backups
Use multi-threading for parallel file operations
Implement chunked reading/writing for very large files

import threading
from concurrent.futures import ThreadPoolExecutor

def parallel_backup(source_dir, backup_dir, max_workers=4):
    """
    Backup using multiple threads for better performance
    """
    backup_path = Path(backup_dir) / f"backup_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    backup_path.mkdir(parents=True, exist_ok=True)

    file_list = []
    for root, _, files in os.walk(source_dir):
        for file in files:
            src_file = Path(root) / file
            rel_path = src_file.relative_to(source_dir)
            dst_file = backup_path / rel_path
            file_list.append((src_file, dst_file))

    def copy_file(args):
        src, dst = args
        dst.parent.mkdir(parents=True, exist_ok=True)
        shutil.copy2(src, dst)

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        list(executor.map(copy_file, file_list))

    return str(backup_path)

Common Backup Challenges and Solutions

Even with well-written code, you might encounter challenges:

Permission errors often occur when trying to access system files or files owned by other users. Handle these with appropriate try-catch blocks and consider running your script with appropriate permissions.

Network timeouts can interrupt backups over network drives. Implement retry logic for network operations.

Storage space limitations might cause backups to fail. Always check available space before starting large backups.

File locking issues can occur when trying to backup files that are in use. Consider using volume shadow copy on Windows or lvm snapshots on Linux for consistent backups of open files.

Creating a Complete Backup Application

Let's put everything together into a more comprehensive backup class:

class BackupManager:
    def __init__(self, config_file=None):
        self.config = self.load_config(config_file) if config_file else {}
        self.setup_logging()

    def load_config(self, config_file):
        # Load configuration from JSON file
        import json
        with open(config_file, 'r') as f:
            return json.load(f)

    def setup_logging(self):
        import logging
        logging.basicConfig(
            filename='backup.log',
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger(__name__)

    def create_backup(self, source, destination, backup_type='full'):
        self.logger.info(f"Starting {backup_type} backup from {source} to {destination}")

        try:
            if backup_type == 'full':
                result = create_backup(source, destination)
            elif backup_type == 'compressed':
                result = create_compressed_backup(source, destination)
            else:
                raise ValueError(f"Unknown backup type: {backup_type}")

            self.logger.info(f"Backup completed successfully: {result}")
            return result

        except Exception as e:
            self.logger.error(f"Backup failed: {str(e)}")
            raise

    def restore_backup(self, backup_path, restore_location):
        self.logger.info(f"Restoring from {backup_path} to {restore_location}")

        try:
            if backup_path.endswith('.zip'):
                restore_compressed_backup(backup_path, restore_location)
            else:
                restore_backup(backup_path, restore_location)

            self.logger.info("Restore completed successfully")
            return True

        except Exception as e:
            self.logger.error(f"Restore failed: {str(e)}")
            raise

# Usage
manager = BackupManager('backup_config.json')
manager.create_backup('/data', '/backups', 'compressed')

This class provides a foundation for a more sophisticated backup solution that includes configuration management, logging, and multiple backup strategies.

Remember that testing your backups is just as important as creating them. Always verify that you can successfully restore from your backups before you actually need them. Regular testing ensures that your backup strategy is effective and reliable when you need it most.

The techniques we've covered provide a solid foundation for implementing file backup and restore functionality in Python. As you develop your own backup solutions, consider your specific requirements for performance, reliability, and features to create the optimal solution for your needs.