Example input
**[Dataset Type]:** Tabular Datasets
**[Details]:**
- File: [your_dataset.csv]
- Columns: Numeric, Date, Text, ...
**[Expectations]:**
- Python script for data cleaning.
- Handle conversions, categorical variables, text encoding, time-related features using pandas.
Example output
Certainly! Below is an extended Python script that focuses on refining data cleaning skills, with an emphasis on converting data types and addressing advanced considerations. The script includes comments to explain each step and provides insights and techniques for handling different scenarios such as dealing with categorical variables, encoding text data, and transforming time-related features. The example uses the popular pandas library for data manipulation and cleaning.
```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
# Load your dataset (replace 'your_dataset.csv' with your actual file)
df = pd.read_csv('your_dataset.csv')
# Display the initial state of the dataset
print("Initial Dataset:")
print(df.head())
# Explore the data types of each column
print("\nData Types:")
print(df.dtypes)
# -----------------------------------------------------------------------------
# Data Type Conversion
# -----------------------------------------------------------------------------
# Convert specific columns to desired data types
df['NumericColumn'] = pd.to_numeric(df['NumericColumn'], errors='coerce')
df['DateColumn'] = pd.to_datetime(df['DateColumn'], errors='coerce')
# -----------------------------------------------------------------------------
# Handling Categorical Variables
# -----------------------------------------------------------------------------
# Identify and convert categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
# Use Label Encoding for categorical columns
label_encoder = LabelEncoder()
for col in categorical_columns:
df[col] = label_encoder.fit_transform(df[col])
# -----------------------------------------------------------------------------
# Encoding Text Data
# -----------------------------------------------------------------------------
# Example: One-Hot Encoding for a specific text column
text_column = 'TextColumn'
df = pd.get_dummies(df, columns=[text_column], prefix=text_column)
# -----------------------------------------------------------------------------
# Transforming Time-Related Features
# -----------------------------------------------------------------------------
# Extract features from DateColumn
df['Year'] = df['DateColumn'].dt.year
df['Month'] = df['DateColumn'].dt.month
df['Day'] = df['DateColumn'].dt.day
df['Weekday'] = df['DateColumn'].dt.weekday
# Drop the original DateColumn if needed
df = df.drop('DateColumn', axis=1)
# -----------------------------------------------------------------------------
# Display the cleaned dataset
# -----------------------------------------------------------------------------
print("\nCleaned Dataset:")
print(df.head())
# Save the cleaned dataset to a new CSV file
df.to_csv('cleaned_dataset.csv', index=False)
print("\nCleaned dataset saved to 'cleaned_dataset.csv'")
```
Make sure to replace `'your_dataset.csv'` with the actual path or URL of your dataset file. Additionally, adjust column names and data types according to your dataset.
This script covers basic data type conversions, handling categorical variables using Label Encoding, one-hot encoding text data, and transforming time-related features. Feel free to customize it further based on your specific dataset and cleaning requirements.