Reading Multiple Files in Python

Reading Multiple Files in Python

As you progress in your Python journey, you’ll often find yourself working with multiple files—whether you're processing logs, analyzing datasets, or automating tasks across documents. Knowing how to efficiently read multiple files is a must-have skill, and Python provides several powerful ways to do it. Let’s explore the main approaches.

Using os.listdir() to List Files

One of the simplest ways to get started is by using the os.listdir() method. This function returns a list of all files and directories in a specified path. It's straightforward but doesn’t differentiate between files and folders by default—you’ll need to check each item.

Here’s a quick example:

import os

folder_path = './data'
for item in os.listdir(folder_path):
    full_path = os.path.join(folder_path, item)
    if os.pathisfile(full_path):
        with open(full_path, 'r') as file:
            content = file.read()
            print(f"Contents of {item}:\n{content}\n")

This script lists everything in the ./data directory, checks if each item is a file, and prints its content.

Method Returns Recursive?
os.listdir() List of names No
os.walk() Directory tree Yes
glob.glob() List of paths Yes (with **)
pathlib.Path.glob() Path objects Yes (with **)

Keep in mind: os.listdir() doesn’t traverse subdirectories by itself. If your files are nested, you’ll need a recursive approach.

Using glob for Pattern Matching

The glob module is incredibly useful when you want to read files that match a specific pattern—like all .txt files or files following a naming convention. It supports wildcards like * and ?, making it easy to filter what you need.

Try this:

import glob

txt_files = glob.glob('./data/*.txt')
for file_path in txt_files:
    with open(file_path, 'r') as file:
        print(file.read())

This code reads every text file in the ./data folder. You can also use ** for recursive matching:

all_text_files = glob.glob('./data/**/*.txt', recursive=True)

Pro tip: glob returns full paths, so you don’t need to manually join directory and file names like with os.listdir().

Using pathlib for Modern File Handling

If you're using Python 3.4 or above, pathlib offers an object-oriented approach to file system paths. Many developers prefer it for its readability and convenience.

Here’s how you can use it:

from pathlib import Path

folder = Path('./data')
for file_path in folder.glob('*.txt'):
    content = file_path.read_text()
    print(content)

You can also handle subdirectories easily:

all_files = folder.rglob('*.txt')

This method is not only clean but also cross-platform, meaning it works the same on Windows, macOS, and Linux.

Recursive Reading with os.walk()

When your files are organized in multiple nested directories, os.walk() is your best friend. It generates directory tree names, walking through each subdirectory.

Example:

import os

for root, dirs, files in os.walk('./data'):
    for file_name in files:
        if file_name.endswith('.txt'):
            file_path = os.path.join(root, file_name)
            with open(file_path, 'r') as file:
                print(file.read())

This approach gives you full control over directory traversal and is very efficient for deeply nested structures.

To help you choose the best method, here’s a quick comparison:

  • Use os.listdir() for simple, flat directory listings.
  • Choose glob when you need pattern matching without recursion.
  • Opt for pathlib for a modern, readable syntax.
  • Rely on os.walk() for full recursive directory traversal.

Handling Different File Types

So far, we've focused on text files, but you might need to read other formats. Let’s look at a few common ones.

For CSV files, you can use the csv module:

import csv
import glob

for csv_file in glob.glob('*.csv'):
    with open(csv_file, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            print(row)

JSON files are also common:

import json

for json_file in glob.glob('*.json'):
    with open(json_file, 'r') as file:
        data = json.load(file)
        print(data)

Remember: Always handle exceptions when reading files—use try-except blocks to manage missing files or permission errors.

Reading Files in Parallel

If you're dealing with a large number of files, reading them sequentially can be slow. Python’s concurrent.futures module allows you to read files in parallel, significantly speeding up the process.

Here’s a basic example using ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor
import glob

def read_file(file_path):
    with open(file_path, 'r') as file:
        return file.read()

file_paths = glob.glob('*.txt')
with ThreadPoolExecutor() as executor:
    contents = list(executor.map(read_file, file_paths))

for content in contents:
    print(content)

Note: Be cautious with system resource limits—opening too many files at once can cause issues.

Best Practices for Reading Multiple Files

When working with multiple files, it's important to follow some best practices to keep your code efficient and error-free.

  • Always use context managers (with open(...)) to ensure files are closed properly.
  • Handle exceptions for missing files, permission errors, or encoding issues.
  • Use absolute paths for clarity, especially in larger projects.
  • Consider memory usage—for very large files, read line-by-line or in chunks.

Here’s an example with error handling:

import glob

for file_path in glob.glob('*.txt'):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            # process content
    except FileNotFoundError:
        print(f"{file_path} not found.")
    except PermissionError:
        print(f"Permission denied for {file_path}.")
    except UnicodeDecodeError:
        print(f"Encoding issue in {file_path}.")

By anticipating errors, you make your scripts more robust and user-friendly.

Scenario Recommended Approach
Flat directory glob or pathlib
Nested directories os.walk or pathlib.rglob
Pattern-based selection glob
Large number of files concurrent.futures
Cross-platform needs pathlib

Putting It All Together: A Complete Example

Let’s write a script that reads all .txt files in a directory and its subdirectories, processes each one, and handles errors gracefully.

from pathlib import Path
import sys

def process_file(file_path):
    try:
        content = file_path.read_text(encoding='utf-8')
        # Example: count lines
        line_count = len(content.splitlines())
        print(f"{file_path}: {line_count} lines")
    except Exception as e:
        print(f"Error reading {file_path}: {e}")

def main(directory):
    folder = Path(directory)
    txt_files = folder.rglob('*.txt')
    for file_path in txt_files:
        process_file(file_path)

if __name__ == '__main__':
    if len(sys.argv) > 1:
        main(sys.argv[1])
    else:
        main('.')

You can run this from the command line and pass a directory path, or it defaults to the current directory.

Summary of Key Points

  • Use os.listdir() for simple cases but remember to filter files from directories.
  • glob is great for pattern matching and returns full paths.
  • pathlib provides a modern, object-oriented interface.
  • os.walk() is ideal for recursive directory traversal.
  • Handle exceptions to make your code robust.
  • For many files, consider parallel reading with concurrent.futures.

Experiment with these methods to see which fits your use case best. Each has its strengths, and often the best choice depends on your specific needs and personal preference.

Now you're equipped to handle multiple files in Python like a pro!