Data Cleaning

Name: Data Cleaning GPT Prompt
Brand: PromptBase
Price: 4.99 USD
Availability: InStock
Author: cognifyai

Favorites

Views

GPT-3.5-TURBO

Tested

Tips

@cognifyai

Data Preprocessor

...more

$4.99

Get prompt

After purchasing, you will gain access to the prompt file which you can use with GPT or the app builder. You'll receive 20 free generation credits with this purchase. By purchasing this prompt, you agree to our terms of service.

Over 1 month ago

Prompt Details

Model

Chat - GPT-3.5 (gpt-3.5-turbo)

Token size

88 ($0.00010 / call)

Example input

[Dataset Type]: Tabular Datasets, Time Series Datasets, Spatial Datasets, Text Datasets, Image Datasets, Audio Datasets, Video Datasets, Graph Datasets

Example output

# Import necessary libraries import pandas as pd # Step 1: Load the dataset # Replace 'your_dataset.csv' with the actual file path or URL dataset_path = 'your_dataset.csv' df = pd.read_csv(dataset_path) # Step 2: Explore the dataset # Check basic information about the dataset, such as columns, data types, and missing values print("Dataset Overview:") print(df.info()) # Step 3: Handling Missing Values # Identify and handle missing values in the dataset # Check for missing values in each column missing_values = df.isnull().sum() # Display columns with missing values print("\nColumns with Missing Values:") print(missing_values[missing_values > 0]) # Option 1: Remove rows with missing values df_cleaned_rows = df.dropna() # Option 2: Impute missing values # Replace missing values with the mean, median, or a specific value # Example: Fill missing values in the 'column_name' with the mean df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Step 4: Handling Duplicate Values # Identify and handle duplicate values in the dataset # Check for duplicate rows duplicate_rows = df[df.duplicated()] # Display duplicate rows print("\nDuplicate Rows:") print(duplicate_rows) # Remove duplicate rows df_cleaned_duplicates = df.drop_duplicates() # Step 5: Save the cleaned dataset # Save the cleaned dataset to a new file df_cleaned_duplicates.to_csv('cleaned_dataset.csv', index=False) # Summary print("\nCleaning Summary:") print(f"Original dataset shape: {df.shape}") print(f"Dataset shape after removing rows with missing values: {df_cleaned_rows.shape}") print(f"Dataset shape after removing duplicate rows: {df_cleaned_duplicates.shape}") # End of the cleaning process print("\nCleaning process completed.") Explanation of the Steps: Load the Dataset: Use the pd.read_csv() function from the pandas library to load the dataset into a DataFrame. Explore the Dataset: Use df.info() to get an overview of the dataset, including column names, data types, and the presence of missing values. Handling Missing Values: Identify columns with missing values using df.isnull().sum(). Choose between removing rows with missing values (df.dropna()) or imputing missing values using methods like mean, median, or a specific value. Handling Duplicate Values: Identify and display duplicate rows using df[df.duplicated()]. Remove duplicate rows with df.drop_duplicates(). Save the Cleaned Dataset: Save the cleaned dataset to a new CSV file using df.to_csv(). Cleaning Summary: Display the shapes of the dataset at different stages to summarize the cleaning process.