Combining Multiple Files into One

Combining Multiple Files into One

Have you ever found yourself with a bunch of small files and wished you could merge them into a single, more manageable file? Maybe you’re processing logs, consolidating CSV exports, or just organizing your project’s source code. Whatever the reason, combining multiple files is a common task, and Python makes it both straightforward and flexible.

In this article, we’ll walk through several ways to merge files using Python. Whether you’re a beginner or looking to refine your skills, you’ll find methods that suit your needs.

Reading and Writing Files in Python

Before we dive into combining files, let’s quickly review how to read from and write to files in Python. The built-in open() function is your go-to tool for file operations. You can open a file in read mode ('r'), write mode ('w'), append mode ('a'), and others.

Here’s a simple example of reading a file and writing its content to another:

with open('source.txt', 'r') as source:
    content = source.read()

with open('destination.txt', 'w') as dest:
    dest.write(content)

This code opens source.txt, reads its entire content, and writes it to destination.txt. If destination.txt doesn’t exist, it will be created. If it does exist, its previous content will be overwritten.

Combining Text Files

Let’s start with the most common scenario: merging multiple text files. Suppose you have several .txt files in a folder and want to combine them into one.

Using a Simple Loop

You can use a loop to iterate through the files, read each one, and append its content to a master file.

files_to_combine = ['file1.txt', 'file2.txt', 'file3.txt']

with open('combined.txt', 'w') as outfile:
    for filename in files_to_combine:
        with open(filename, 'r') as infile:
            outfile.write(infile.read())
            outfile.write('\n')  # Add a newline between files

This method works well if you have a small, known set of files. But what if you have many files or don’t want to list them manually?

Using glob to Select Files

The glob module helps you find files matching a specific pattern. For example, to combine all .txt files in the current directory:

import glob

txt_files = glob.glob('*.txt')

with open('combined.txt', 'w') as outfile:
    for filename in txt_files:
        with open(filename, 'r') as infile:
            outfile.write(infile.read())
            outfile.write('\n')

Now you don’t need to list each file individually—glob does it for you.

Preserving Original Structure

Sometimes you may want to include the original filename or separate content with a header. Here’s an enhanced version:

import glob

txt_files = glob.glob('*.txt')

with open('combined_with_headers.txt', 'w') as outfile:
    for filename in txt_files:
        with open(filename, 'r') as infile:
            outfile.write(f'--- {filename} ---\n')
            outfile.write(infile.read())
            outfile.write('\n\n')

This adds a header with the source filename before each file’s content, making it easier to identify where each section came from.

Combining CSV Files

CSV files are ubiquitous in data processing. Combining them requires a bit more care to handle headers correctly.

When All CSVs Have the Same Structure

If all your CSV files have the same columns, you can combine them while writing the header only once.

import csv
import glob

csv_files = glob.glob('*.csv')

with open('combined.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    header_written = False

    for filename in csv_files:
        with open(filename, 'r', newline='') as infile:
            reader = csv.reader(infile)
            header = next(reader)
            if not header_written:
                writer.writerow(header)
                header_written = True
            for row in reader:
                writer.writerow(row)

This script reads each CSV, skips the header after the first file, and appends the rows to combined.csv.

Using pandas for CSV Merging

For larger datasets or more complex operations, the pandas library is incredibly useful. If you have it installed, you can combine CSVs with just a few lines.

import pandas as pd
import glob

csv_files = glob.glob('*.csv')
df_list = [pd.read_csv(file) for file in csv_files]
combined_df = pd.concat(df_list, ignore_index=True)
combined_df.to_csv('combined.csv', index=False)

This approach is efficient and handles varying structures more gracefully, though it requires an external library.

Combining Binary Files

Text and CSV files are common, but you might also need to merge binary files, such as images or PDFs. The process is similar but requires working in binary mode.

Appending Binary Data

To combine binary files, open them in binary read ('rb') and write ('wb' or 'ab') modes.

binary_files = ['part1.bin', 'part2.bin', 'part3.bin']

with open('combined.bin', 'wb') as outfile:
    for filename in binary_files:
        with open(filename, 'rb') as infile:
            outfile.write(infile.read())

This works for any binary file, but be cautious with very large files, as reading them entirely into memory may not be efficient.

Handling Large Files

When dealing with large files, reading the entire content at once can consume a lot of memory. Instead, read and write in chunks.

Chunked Reading and Writing

Here’s how you can process files in chunks to reduce memory usage:

chunk_size = 8192  # 8KB chunks

with open('large_combined.txt', 'w') as outfile:
    for filename in ['large1.txt', 'large2.txt']:
        with open(filename, 'r') as infile:
            while True:
                chunk = infile.read(chunk_size)
                if not chunk:
                    break
                outfile.write(chunk)
            outfile.write('\n')

This method reads a small piece of the file at a time, writes it to the output, and repeats until the entire file is processed.

Merging Files with Different Encodings

Occasionally, you might encounter files with different character encodings. Python allows you to specify the encoding when opening a file.

Specifying Encoding

To handle files with, say, UTF-8 and Latin-1 encodings:

files_with_encoding = [
    ('file1.txt', 'utf-8'),
    ('file2.txt', 'latin-1')
]

with open('combined_encoded.txt', 'w', encoding='utf-8') as outfile:
    for filename, encoding in files_with_encoding:
        with open(filename, 'r', encoding=encoding) as infile:
            outfile.write(infile.read())
        outfile.write('\n')

This ensures that each file is read with its correct encoding and written in a consistent encoding (UTF-8 in this case).

Error Handling and Robustness

In real-world scenarios, files might be missing, corrupted, or unreadable. Adding error handling makes your script more robust.

Using Try-Except Blocks

Wrap file operations in try-except blocks to gracefully handle errors.

import glob

txt_files = glob.glob('*.txt')

with open('combined_robust.txt', 'w') as outfile:
    for filename in txt_files:
        try:
            with open(filename, 'r') as infile:
                outfile.write(infile.read())
                outfile.write('\n')
        except IOError as e:
            print(f"Error reading {filename}: {e}")

This way, if one file fails, the script continues processing the others and logs the error.

Advanced: Using pathlib for File Paths

The pathlib module provides an object-oriented approach to handling filesystem paths. It’s modern and often more readable.

Combining Files with pathlib

Here’s how you can use pathlib to combine text files:

from pathlib import Path

txt_files = Path('.').glob('*.txt')

with open('combined_pathlib.txt', 'w') as outfile:
    for filepath in txt_files:
        outfile.write(filepath.read_text())
        outfile.write('\n')

This achieves the same result as earlier examples but with cleaner path handling.

Summary of Methods

Below is a table summarizing the different methods we’ve covered for combining files:

Method Use Case Pros Cons
Simple Loop Few, known files Easy to implement Manual file listing
glob Multiple files by pattern Automatic file selection Pattern must be known
Chunked Reading Large files Memory efficient Slightly more complex
pandas CSV files Handles structure well Requires external library
Binary Mode Non-text files Works with any file type No content interpretation

Key Considerations

When combining files, keep these points in mind:

  • File Order: Ensure files are combined in the correct order, especially if sequence matters.
  • Memory Usage: For large files, use chunked reading to avoid high memory consumption.
  • Encoding: Be aware of character encodings to prevent garbled text.
  • Headers: For structured files like CSV, handle headers appropriately.
  • Error Handling: Make your script robust to handle missing or corrupt files.

Putting It All Together

Let’s create a reusable function that combines text files with several options:

import glob

def combine_text_files(pattern, output_file, add_headers=False, chunk_size=None):
    files = glob.glob(pattern)

    with open(output_file, 'w') as outfile:
        for filename in files:
            if add_headers:
                outfile.write(f'--- {filename} ---\n')

            if chunk_size:
                with open(filename, 'r') as infile:
                    while True:
                        chunk = infile.read(chunk_size)
                        if not chunk:
                            break
                        outfile.write(chunk)
            else:
                with open(filename, 'r') as infile:
                    outfile.write(infile.read())

            outfile.write('\n')

You can call this function with different parameters based on your needs:

# Combine all txt files without headers
combine_text_files('*.txt', 'combined.txt')

# Combine with headers and in chunks
combine_text_files('*.log', 'combined.log', add_headers=True, chunk_size=4096)

This function provides flexibility for various scenarios.

Final Thoughts

Combining files is a practical skill that can save you time and effort in data processing, log analysis, or general file management. Python’s built-in features, along with modules like glob and pandas, offer powerful and efficient ways to merge files of different types and sizes.

Remember to always test your scripts on sample data before running them on important files. Backup your original files to avoid accidental data loss. With the techniques covered here, you’re well-equipped to tackle file combining tasks in your projects.

Happy coding