Example input
[Dataset Type]: Tabular Datasets, Time Series Datasets, Spatial Datasets, Text Datasets, Image Datasets, Audio Datasets, Video Datasets, Graph Datasets
Example output
# Import necessary libraries
import pandas as pd
# Step 1: Load the dataset
# Replace 'your_dataset.csv' with the actual file path or URL
dataset_path = 'your_dataset.csv'
df = pd.read_csv(dataset_path)
# Step 2: Explore the dataset
# Check basic information about the dataset, such as columns, data types, and missing values
print("Dataset Overview:")
print(df.info())
# Step 3: Handling Missing Values
# Identify and handle missing values in the dataset
# Check for missing values in each column
missing_values = df.isnull().sum()
# Display columns with missing values
print("\nColumns with Missing Values:")
print(missing_values[missing_values > 0])
# Option 1: Remove rows with missing values
df_cleaned_rows = df.dropna()
# Option 2: Impute missing values
# Replace missing values with the mean, median, or a specific value
# Example: Fill missing values in the 'column_name' with the mean
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# Step 4: Handling Duplicate Values
# Identify and handle duplicate values in the dataset
# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
# Display duplicate rows
print("\nDuplicate Rows:")
print(duplicate_rows)
# Remove duplicate rows
df_cleaned_duplicates = df.drop_duplicates()
# Step 5: Save the cleaned dataset
# Save the cleaned dataset to a new file
df_cleaned_duplicates.to_csv('cleaned_dataset.csv', index=False)
# Summary
print("\nCleaning Summary:")
print(f"Original dataset shape: {df.shape}")
print(f"Dataset shape after removing rows with missing values: {df_cleaned_rows.shape}")
print(f"Dataset shape after removing duplicate rows: {df_cleaned_duplicates.shape}")
# End of the cleaning process
print("\nCleaning process completed.")
Explanation of the Steps:
Load the Dataset: Use the pd.read_csv() function from the pandas library to load the dataset into a DataFrame.
Explore the Dataset: Use df.info() to get an overview of the dataset, including column names, data types, and the presence of missing values.
Handling Missing Values:
Identify columns with missing values using df.isnull().sum().
Choose between removing rows with missing values (df.dropna()) or imputing missing values using methods like mean, median, or a specific value.
Handling Duplicate Values:
Identify and display duplicate rows using df[df.duplicated()].
Remove duplicate rows with df.drop_duplicates().
Save the Cleaned Dataset: Save the cleaned dataset to a new CSV file using df.to_csv().
Cleaning Summary: Display the shapes of the dataset at different stages to summarize the cleaning process.