Example output
Here is a Python code snippet that processes CSV files containing customer data using `pandas`. The code is modular, handles potential errors, and is optimized for performance.
```python
import pandas as pd
import os
def load_csv(filepath):
"""
Load CSV file into a Pandas DataFrame. Handle common file-related errors.
:param filepath: Path to the CSV file
:return: DataFrame containing the CSV data or None if an error occurs
"""
try:
if not os.path.exists(filepath):
raise FileNotFoundError(f"File {filepath} does not exist.")
# Read the CSV file with appropriate settings
df = pd.read_csv(filepath, low_memory=False) # low_memory=False helps with mixed data types
print(f"File '{filepath}' loaded successfully.")
return df
except FileNotFoundError as e:
print(f"Error: {e}")
except pd.errors.EmptyDataError:
print("Error: File is empty.")
except pd.errors.ParserError:
print("Error: Parsing the file failed. Ensure it's a well-formatted CSV.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
return None
def clean_data(df):
"""
Clean the DataFrame by handling missing values and standardizing column names.
:param df: Input DataFrame
:return: Cleaned DataFrame
"""
if df is None:
return None
# Drop rows with all missing values
df.dropna(how='all', inplace=True)
# Fill missing values in 'customer_id' with a placeholder
if 'customer_id' in df.columns:
df['customer_id'].fillna('Unknown', inplace=True)
# Strip whitespace from string columns
df.columns = df.columns.str.strip() # Strip any leading/trailing spaces in column names
print("Data cleaning complete.")
return df
def process_customer_data(df):
"""
Process customer data to extract insights or perform transformations.
:param df: Cleaned DataFrame
:return: Processed DataFrame
"""
if df is None:
return None
# Example: Filter out rows where 'customer_id' is missing
if 'customer_id' in df.columns:
df = df[df['customer_id'] != 'Unknown']
# Example: Convert date columns to datetime format
if 'signup_date' in df.columns:
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
# Example: Add a column for customer tenure (if 'signup_date' is available)
if 'signup_date' in df.columns:
df['customer_tenure_days'] = (pd.Timestamp.now() - df['signup_date']).dt.days
print("Customer data processing complete.")
return df
def save_to_csv(df, output_filepath):
"""
Save the processed DataFrame to a CSV file.
:param df: DataFrame to be saved
:param output_filepath: Output path for the CSV file
"""
try:
df.to_csv(output_filepath, index=False)
print(f"Data saved to {output_filepath}")
except Exception as e:
print(f"Error saving the file: {e}")
def main(input_filepath, output_filepath):
"""
Main function to orchestrate data processing steps.
:param input_filepath: Path to the input CSV file
:param output_filepath: Path to the output CSV file
"""
# Step 1: Load CSV data
df = load_csv(input_filepath)
# Step 2: Clean the data
df_cleaned = clean_data(df)
# Step 3: Process the data
df_processed = process_customer_data(df_cleaned)
# Step 4: Save the processed data
if df_processed is not None:
save_to_csv(df_processed, output_filepath)
else:
print("No data to save.")
if __name__ == "__main__":
# Example usage
input_file = 'customers.csv' # Replace with the actual input CSV path
output_file = 'processed_customers.csv' # Replace with the desired output CSV path
main(input_file, output_file)
```
### Key Features:
1. **Modular Functions**:
- Each function handles a specific task: loading, cleaning, processing, and saving data. This makes the code reusable and easy to maintain.
2. **Error Handling**:
- The code handles file-related errors (e.g., missing files, empty files) and parsing issues with `try-except` blocks.
3. **Efficient Memory Usage**:
- The `low_memory=False` parameter in `pd.read_csv()` prevents memory warnings when working with large files.
- Cleaning the data in place using `inplace=True` prevents unnecessary copies in memory.
4. **Edge Case Handling**:
- Missing data is handled gracefully (`fillna`, `dropna`).
- If columns like 'signup_date' or 'customer_id' are missing, the code checks for their existence before attempting operations.
5. **Clear Output**:
- Status messages (`print()`) ensure the user knows when each stage of processing is complete.
- Errors are reported clearly to aid debugging.
6. **Adaptability**:
- The structure of the code makes it easy to adapt to different datasets. You can modify the `process_customer_data()` function to perform additional transformations based on the dataset you are working with.
### Edge Cases Handled:
- Missing columns like `customer_id` and `signup_date`.
- Empty or poorly formatted files.
- Mixed data types in the CSV.
This design ensures the code is maintainable and scalable for future data processing tasks.