
Reading Multiple Files in Python
As you progress in your Python journey, you’ll often find yourself working with multiple files—whether you're processing logs, analyzing datasets, or automating tasks across documents. Knowing how to efficiently read multiple files is a must-have skill, and Python provides several powerful ways to do it. Let’s explore the main approaches.
Using os.listdir() to List Files
One of the simplest ways to get started is by using the os.listdir()
method. This function returns a list of all files and directories in a specified path. It's straightforward but doesn’t differentiate between files and folders by default—you’ll need to check each item.
Here’s a quick example:
import os
folder_path = './data'
for item in os.listdir(folder_path):
full_path = os.path.join(folder_path, item)
if os.pathisfile(full_path):
with open(full_path, 'r') as file:
content = file.read()
print(f"Contents of {item}:\n{content}\n")
This script lists everything in the ./data
directory, checks if each item is a file, and prints its content.
Method | Returns | Recursive? |
---|---|---|
os.listdir() | List of names | No |
os.walk() | Directory tree | Yes |
glob.glob() | List of paths | Yes (with **) |
pathlib.Path.glob() | Path objects | Yes (with **) |
Keep in mind: os.listdir()
doesn’t traverse subdirectories by itself. If your files are nested, you’ll need a recursive approach.
Using glob for Pattern Matching
The glob
module is incredibly useful when you want to read files that match a specific pattern—like all .txt
files or files following a naming convention. It supports wildcards like *
and ?
, making it easy to filter what you need.
Try this:
import glob
txt_files = glob.glob('./data/*.txt')
for file_path in txt_files:
with open(file_path, 'r') as file:
print(file.read())
This code reads every text file in the ./data
folder. You can also use **
for recursive matching:
all_text_files = glob.glob('./data/**/*.txt', recursive=True)
Pro tip: glob
returns full paths, so you don’t need to manually join directory and file names like with os.listdir()
.
Using pathlib for Modern File Handling
If you're using Python 3.4 or above, pathlib
offers an object-oriented approach to file system paths. Many developers prefer it for its readability and convenience.
Here’s how you can use it:
from pathlib import Path
folder = Path('./data')
for file_path in folder.glob('*.txt'):
content = file_path.read_text()
print(content)
You can also handle subdirectories easily:
all_files = folder.rglob('*.txt')
This method is not only clean but also cross-platform, meaning it works the same on Windows, macOS, and Linux.
Recursive Reading with os.walk()
When your files are organized in multiple nested directories, os.walk()
is your best friend. It generates directory tree names, walking through each subdirectory.
Example:
import os
for root, dirs, files in os.walk('./data'):
for file_name in files:
if file_name.endswith('.txt'):
file_path = os.path.join(root, file_name)
with open(file_path, 'r') as file:
print(file.read())
This approach gives you full control over directory traversal and is very efficient for deeply nested structures.
To help you choose the best method, here’s a quick comparison:
- Use
os.listdir()
for simple, flat directory listings. - Choose
glob
when you need pattern matching without recursion. - Opt for
pathlib
for a modern, readable syntax. - Rely on
os.walk()
for full recursive directory traversal.
Handling Different File Types
So far, we've focused on text files, but you might need to read other formats. Let’s look at a few common ones.
For CSV files, you can use the csv
module:
import csv
import glob
for csv_file in glob.glob('*.csv'):
with open(csv_file, 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
JSON files are also common:
import json
for json_file in glob.glob('*.json'):
with open(json_file, 'r') as file:
data = json.load(file)
print(data)
Remember: Always handle exceptions when reading files—use try-except blocks to manage missing files or permission errors.
Reading Files in Parallel
If you're dealing with a large number of files, reading them sequentially can be slow. Python’s concurrent.futures
module allows you to read files in parallel, significantly speeding up the process.
Here’s a basic example using ThreadPoolExecutor
:
from concurrent.futures import ThreadPoolExecutor
import glob
def read_file(file_path):
with open(file_path, 'r') as file:
return file.read()
file_paths = glob.glob('*.txt')
with ThreadPoolExecutor() as executor:
contents = list(executor.map(read_file, file_paths))
for content in contents:
print(content)
Note: Be cautious with system resource limits—opening too many files at once can cause issues.
Best Practices for Reading Multiple Files
When working with multiple files, it's important to follow some best practices to keep your code efficient and error-free.
- Always use context managers (
with open(...)
) to ensure files are closed properly. - Handle exceptions for missing files, permission errors, or encoding issues.
- Use absolute paths for clarity, especially in larger projects.
- Consider memory usage—for very large files, read line-by-line or in chunks.
Here’s an example with error handling:
import glob
for file_path in glob.glob('*.txt'):
try:
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
# process content
except FileNotFoundError:
print(f"{file_path} not found.")
except PermissionError:
print(f"Permission denied for {file_path}.")
except UnicodeDecodeError:
print(f"Encoding issue in {file_path}.")
By anticipating errors, you make your scripts more robust and user-friendly.
Scenario | Recommended Approach |
---|---|
Flat directory | glob or pathlib |
Nested directories | os.walk or pathlib.rglob |
Pattern-based selection | glob |
Large number of files | concurrent.futures |
Cross-platform needs | pathlib |
Putting It All Together: A Complete Example
Let’s write a script that reads all .txt
files in a directory and its subdirectories, processes each one, and handles errors gracefully.
from pathlib import Path
import sys
def process_file(file_path):
try:
content = file_path.read_text(encoding='utf-8')
# Example: count lines
line_count = len(content.splitlines())
print(f"{file_path}: {line_count} lines")
except Exception as e:
print(f"Error reading {file_path}: {e}")
def main(directory):
folder = Path(directory)
txt_files = folder.rglob('*.txt')
for file_path in txt_files:
process_file(file_path)
if __name__ == '__main__':
if len(sys.argv) > 1:
main(sys.argv[1])
else:
main('.')
You can run this from the command line and pass a directory path, or it defaults to the current directory.
Summary of Key Points
- Use
os.listdir()
for simple cases but remember to filter files from directories. glob
is great for pattern matching and returns full paths.pathlib
provides a modern, object-oriented interface.os.walk()
is ideal for recursive directory traversal.- Handle exceptions to make your code robust.
- For many files, consider parallel reading with
concurrent.futures
.
Experiment with these methods to see which fits your use case best. Each has its strengths, and often the best choice depends on your specific needs and personal preference.
Now you're equipped to handle multiple files in Python like a pro!