Certainly! Below is a SQL query that performs the required cleaning operations on your table. This query assumes you're using a SQL database that supports common table expressions (CTEs) and window functions, such as PostgreSQL. The cleaning steps include retaining entities based on data completeness, handling outliers using Tukey's method, removing duplicates, and filling in any remaining missing values.
database_engine: sqlite
database_name: bi_records
table_name: fact_customer_revenue_sales
time_component: day
entity_identifier: customer_id
metric_values: 'sales', 'revenue'
data_completeness_ratio: 80%
-- Step 1: Filter entities based on data completeness
WITH entity_counts AS (
SELECT
customer_id,
COUNT(*) AS total_count,
COUNT(sales) AS non_null_sales_count,
COUNT(revenue) AS non_null_revenue_count
FROM
bi_records.fact_customer_revenue_sales
GROUP BY
customer_id
),
valid_entities AS (
SELECT
customer_id
FROM
entity_counts
WHERE
non_null_sales_count >= total_count * 0.80
AND non_null_revenue_count >= total_count * 0.80
),
-- Step 2: Remove duplicates
deduplicated_table AS (
SELECT
day,
customer_id,
sales,
revenue
FROM
bi_records.fact_customer_revenue_sales
WHERE
customer_id IN (SELECT customer_id FROM valid_entities)
GROUP BY
customer_id, day
),
-- Step 3: Calculate quartiles and IQR for each customer
sales_ranks AS (
SELECT
customer_id,
sales,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sales) AS rank,
COUNT(*) OVER (PARTITION BY customer_id) AS total_sales
FROM
deduplicated_table
),
revenue_ranks AS (
SELECT
customer_id,
revenue,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY revenue) AS rank,
COUNT(*) OVER (PARTITION BY customer_id) AS total_revenue
FROM
deduplicated_table
),
sales_quartiles AS (
SELECT
customer_id,
MAX(CASE WHEN rank = (total_sales + 3) / 4 THEN sales END) AS Q1_sales,
MAX(CASE WHEN rank = (total_sales * 3 + 3) / 4 THEN sales END) AS Q3_sales
FROM
sales_ranks
GROUP BY
customer_id
),
revenue_quartiles AS (
SELECT
customer_id,
MAX(CASE WHEN rank = (total_revenue + 3) / 4 THEN revenue END) AS Q1_revenue,
MAX(CASE WHEN rank = (total_revenue * 3 + 3) / 4 THEN revenue END) AS Q3_revenue
FROM
revenue_ranks
GROUP BY
customer_id
),
iqr_values AS (
SELECT
sq.customer_id,
Q1_sales,
Q3_sales,
(Q3_sales - Q1_sales) * 1.5 AS IQR_sales,
Q1_revenue,
Q3_revenue,
(Q3_revenue - Q1_revenue) * 1.5 AS IQR_revenue
FROM
sales_quartiles sq
JOIN
revenue_quartiles rq ON sq.customer_id = rq.customer_id
),
-- Step 4: Replace outliers with quartile values
outlier_replaced AS (
SELECT
dt.day,
dt.customer_id,
CASE
WHEN dt.sales < iq.Q1_sales - iq.IQR_sales THEN iq.Q1_sales
WHEN dt.sales > iq.Q3_sales + iq.IQR_sales THEN iq.Q3_sales
ELSE dt.sales
END AS sales,
CASE
WHEN dt.revenue < iq.Q1_revenue - iq.IQR_revenue THEN iq.Q1_revenue
WHEN dt.revenue > iq.Q3_revenue + iq.IQR_revenue THEN iq.Q3_revenue
ELSE dt.revenue
END AS revenue
FROM
deduplicated_table dt
JOIN
iqr_values iq ON dt.customer_id = iq.customer_id
),
-- Step 5: Fill missing values
filled_missing_values AS (
SELECT
day,
customer_id,
COALESCE(sales, (SELECT AVG(sales) FROM outlier_replaced o WHERE o.customer_id = dt.customer_id)) AS sales,
COALESCE(revenue, (SELECT AVG(revenue) FROM outlier_replaced o WHERE o.customer_id = dt.customer_id)) AS revenue
FROM
outlier_replaced dt
)
-- Step 6: Final output sorted by customer_id and day
SELECT
day,
customer_id,
sales,
revenue
FROM
filled_missing_values
ORDER BY
customer_id, day;
Explanation of the SQL Query:
entity_counts: Calculates the total count and non-NULL count of metric values for each entity.
valid_entities: Filters out entities that do not meet the data completeness ratio.
filtered_table: Filters the main table to include only the valid entities.
deduplicated_table: Removes duplicate rows based on the entity_identifier and time_component, keeping the first occurrence.
metric_stats: Calculates the first and third quartiles (Q1 and Q3) for each entity's metric values.
iqr_values: Computes the interquartile range (IQR) for each entity.
outlier_replaced: Replaces outliers with the Q1 or Q3 values as necessary.
filled_missing_values: Fills any remaining missing values with the average metric value for the respective entity.
Final SELECT: Retrieves the cleaned data, ordered by entity_identifier and time_component.
Additional Notes:
Ensure to replace [database_name], [table_name], [time_component], [entity_identifier], and [metric_values] with your actual table and column names.
This query uses PERCENTILE_CONT for calculating the quartiles, which is standard in PostgreSQL. Adjust the function as needed for other SQL dialects.
The method used to fill missing values is the average of the metric values for each entity.
Transparency and Validation:
You can run additional queries to check the number of rows removed or retained:
sql
Copy code
-- Total rows before cleaning
SELECT COUNT(*) FROM [database_name].[table_name];
-- Total rows after cleaning
SELECT COUNT(*) FROM filled_missing_values;
These queries will give you the number of rows before and after the cleaning process.