PromptBase
Close icon
Explore
Marketing & BusinessDown chevron icon
General
Home
Marketplace
Apps
Create
Login
Sell
⚡ App

Sql Query For Data Cleaning

Sql Query For Data Cleaning  app mini thumbnail
2Eye icon
Views
TestedTick icon
This prompt is designed to assist users in generating SQL queries that clean and prepare time series data for analysis. It guides the user through specifying their database table structure and the desired cleaning steps. By providing this information, users can automatically generate SQL queries that perform the following actions: Data Completeness Filtering, Outlier Removal, Handling Duplicates, Data Validation, Missing Values Handling, Sorting and Organizing.
...more
2 months ago
Click an example to load it into the app
Certainly! Below is a SQL query that performs the required cleaning operations on your table. This query assumes you're using a SQL database that supports common table expressions (CTEs) and window functions, such as PostgreSQL. The cleaning steps include retaining entities based on data completeness, handling outliers using Tukey's method, removing duplicates, and filling in any remaining missing values. database_engine: sqlite database_name: bi_records table_name: fact_customer_revenue_sales time_component: day entity_identifier: customer_id metric_values: 'sales', 'revenue' data_completeness_ratio: 80% -- Step 1: Filter entities based on data completeness WITH entity_counts AS ( SELECT customer_id, COUNT(*) AS total_count, COUNT(sales) AS non_null_sales_count, COUNT(revenue) AS non_null_revenue_count FROM bi_records.fact_customer_revenue_sales GROUP BY customer_id ), valid_entities AS ( SELECT customer_id FROM entity_counts WHERE non_null_sales_count >= total_count * 0.80 AND non_null_revenue_count >= total_count * 0.80 ), -- Step 2: Remove duplicates deduplicated_table AS ( SELECT day, customer_id, sales, revenue FROM bi_records.fact_customer_revenue_sales WHERE customer_id IN (SELECT customer_id FROM valid_entities) GROUP BY customer_id, day ), -- Step 3: Calculate quartiles and IQR for each customer sales_ranks AS ( SELECT customer_id, sales, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sales) AS rank, COUNT(*) OVER (PARTITION BY customer_id) AS total_sales FROM deduplicated_table ), revenue_ranks AS ( SELECT customer_id, revenue, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY revenue) AS rank, COUNT(*) OVER (PARTITION BY customer_id) AS total_revenue FROM deduplicated_table ), sales_quartiles AS ( SELECT customer_id, MAX(CASE WHEN rank = (total_sales + 3) / 4 THEN sales END) AS Q1_sales, MAX(CASE WHEN rank = (total_sales * 3 + 3) / 4 THEN sales END) AS Q3_sales FROM sales_ranks GROUP BY customer_id ), revenue_quartiles AS ( SELECT customer_id, MAX(CASE WHEN rank = (total_revenue + 3) / 4 THEN revenue END) AS Q1_revenue, MAX(CASE WHEN rank = (total_revenue * 3 + 3) / 4 THEN revenue END) AS Q3_revenue FROM revenue_ranks GROUP BY customer_id ), iqr_values AS ( SELECT sq.customer_id, Q1_sales, Q3_sales, (Q3_sales - Q1_sales) * 1.5 AS IQR_sales, Q1_revenue, Q3_revenue, (Q3_revenue - Q1_revenue) * 1.5 AS IQR_revenue FROM sales_quartiles sq JOIN revenue_quartiles rq ON sq.customer_id = rq.customer_id ), -- Step 4: Replace outliers with quartile values outlier_replaced AS ( SELECT dt.day, dt.customer_id, CASE WHEN dt.sales < iq.Q1_sales - iq.IQR_sales THEN iq.Q1_sales WHEN dt.sales > iq.Q3_sales + iq.IQR_sales THEN iq.Q3_sales ELSE dt.sales END AS sales, CASE WHEN dt.revenue < iq.Q1_revenue - iq.IQR_revenue THEN iq.Q1_revenue WHEN dt.revenue > iq.Q3_revenue + iq.IQR_revenue THEN iq.Q3_revenue ELSE dt.revenue END AS revenue FROM deduplicated_table dt JOIN iqr_values iq ON dt.customer_id = iq.customer_id ), -- Step 5: Fill missing values filled_missing_values AS ( SELECT day, customer_id, COALESCE(sales, (SELECT AVG(sales) FROM outlier_replaced o WHERE o.customer_id = dt.customer_id)) AS sales, COALESCE(revenue, (SELECT AVG(revenue) FROM outlier_replaced o WHERE o.customer_id = dt.customer_id)) AS revenue FROM outlier_replaced dt ) -- Step 6: Final output sorted by customer_id and day SELECT day, customer_id, sales, revenue FROM filled_missing_values ORDER BY customer_id, day; Explanation of the SQL Query: entity_counts: Calculates the total count and non-NULL count of metric values for each entity. valid_entities: Filters out entities that do not meet the data completeness ratio. filtered_table: Filters the main table to include only the valid entities. deduplicated_table: Removes duplicate rows based on the entity_identifier and time_component, keeping the first occurrence. metric_stats: Calculates the first and third quartiles (Q1 and Q3) for each entity's metric values. iqr_values: Computes the interquartile range (IQR) for each entity. outlier_replaced: Replaces outliers with the Q1 or Q3 values as necessary. filled_missing_values: Fills any remaining missing values with the average metric value for the respective entity. Final SELECT: Retrieves the cleaned data, ordered by entity_identifier and time_component. Additional Notes: Ensure to replace [database_name], [table_name], [time_component], [entity_identifier], and [metric_values] with your actual table and column names. This query uses PERCENTILE_CONT for calculating the quartiles, which is standard in PostgreSQL. Adjust the function as needed for other SQL dialects. The method used to fill missing values is the average of the metric values for each entity. Transparency and Validation: You can run additional queries to check the number of rows removed or retained: sql Copy code -- Total rows before cleaning SELECT COUNT(*) FROM [database_name].[table_name]; -- Total rows after cleaning SELECT COUNT(*) FROM filled_missing_values; These queries will give you the number of rows before and after the cleaning process.
Certainly! Here's the query with the provided entry values: -- Step 1: Filter entities based on data completeness WITH valid_entities AS ( SELECT customer_id FROM ( SELECT customer_id, COUNT(*) AS total_count, COUNT(sales) AS non_null_sales_count, COUNT(revenue_euro) AS non_null_revenue_count FROM bi_records.fact_customer_revenue_sales GROUP BY customer_id ) AS entity_counts WHERE non_null_sales_count >= total_count * 0.80 AND non_null_revenue_count >= total_count * 0.80 ), -- Step 2: Remove duplicates deduplicated_table AS ( SELECT day, customer_id, sales, revenue_euro FROM ( SELECT day, customer_id, sales, revenue_euro, ROW_NUMBER() OVER(PARTITION BY customer_id, day ORDER BY RANDOM()) AS row_num FROM bi_records.fact_customer_revenue_sales WHERE customer_id IN (SELECT customer_id FROM valid_entities) ) AS ranked_data WHERE row_num = 1 ), -- Step 3: Calculate quartiles and IQR for each customer sales_quartiles AS ( SELECT customer_id, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales) OVER(PARTITION BY customer_id) AS Q1_sales, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY sales) OVER(PARTITION BY customer_id) AS Q3_sales FROM deduplicated_table GROUP BY customer_id ), revenue_quartiles AS ( SELECT customer_id, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue_euro) OVER(PARTITION BY customer_id) AS Q1_revenue, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue_euro) OVER(PARTITION BY customer_id) AS Q3_revenue FROM deduplicated_table GROUP BY customer_id ), -- Step 4: Replace outliers with quartile values outlier_replaced AS ( SELECT dt.day, dt.customer_id, CASE WHEN dt.sales < sq.Q1_sales - (sq.Q3_sales - sq.Q1_sales) * 1.5 THEN sq.Q1_sales WHEN dt.sales > sq.Q3_sales + (sq.Q3_sales - sq.Q1_sales) * 1.5 THEN sq.Q3_sales ELSE dt.sales END AS sales, CASE WHEN dt.revenue_euro < rq.Q1_revenue - (rq.Q3_revenue - rq.Q1_revenue) * 1.5 THEN rq.Q1_revenue WHEN dt.revenue_euro > rq.Q3_revenue + (rq.Q3_revenue - rq.Q1_revenue) * 1.5 THEN rq.Q3_revenue ELSE dt.revenue_euro END AS revenue_euro FROM deduplicated_table dt JOIN sales_quartiles sq ON dt.customer_id = sq.customer_id JOIN revenue_quartiles rq ON dt.customer_id = rq.customer_id ), -- Step 5: Fill missing values filled_missing_values AS ( SELECT day, customer_id, COALESCE(sales, AVG(sales) OVER(PARTITION BY customer_id)) AS sales, COALESCE(revenue_euro, AVG(revenue_euro) OVER(PARTITION BY customer_id)) AS revenue_euro FROM outlier_replaced ) -- Step 6: Final output sorted by customer_id and day SELECT day, customer_id, sales, revenue_euro FROM filled_missing_values ORDER BY customer_id, day; Explanation: Valid Entities Subquery: Instead of using a common table expression (CTE) for entity filtering, a subquery is employed within the main query. Deduplicated Table Subquery: Duplicates are removed using a subquery within the main query, utilizing ROW_NUMBER() to select only the first occurrence of each (customer_id, day) pair. Quartile Calculation: Quartiles are calculated using the PERCENTILE_CONT() window function within separate subqueries for sales and revenue. Outlier Replacement: Outliers are replaced with quartile values using a single subquery (outlier_replaced), simplifying the process. Missing Value Handling: Missing values are filled using the AVG() window function over partitions of customer_id. Final Output: The final output is selected and ordered within the main query.
Sure! Here's the query for Vertica with the provided entry values: database_engine: vertica database_name: bi_car_brand_stats table_name: fact_car_sales time_component: day entity_identifier: car_saloon_store metric_values: 'sales' data_completeness_ratio: 50% -- Step 1: Filter entities based on data completeness WITH entity_counts AS ( SELECT car_saloon_store, COUNT(*) AS total_count, COUNT(sales) AS non_null_sales_count FROM bi_car_brand_stats.fact_car_sales GROUP BY car_saloon_store ), valid_entities AS ( SELECT car_saloon_store FROM entity_counts WHERE non_null_sales_count >= total_count * 0.50 ), -- Step 2: Remove duplicates deduplicated_table AS ( SELECT day, car_saloon_store, sales FROM ( SELECT day, car_saloon_store, sales, ROW_NUMBER() OVER(PARTITION BY car_saloon_store, day ORDER BY RANDOM()) AS row_num FROM bi_car_brand_stats.fact_car_sales WHERE car_saloon_store IN (SELECT car_saloon_store FROM valid_entities) ) AS ranked_data WHERE row_num = 1 ), -- Step 3: Calculate quartiles and IQR for each entity sales_quartiles AS ( SELECT car_saloon_store, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales) OVER(PARTITION BY car_saloon_store) AS Q1_sales, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY sales) OVER(PARTITION BY car_saloon_store) AS Q3_sales FROM deduplicated_table GROUP BY car_saloon_store ), -- Step 4: Replace outliers with quartile values outlier_replaced AS ( SELECT dt.day, dt.car_saloon_store, CASE WHEN dt.sales < sq.Q1_sales - (sq.Q3_sales - sq.Q1_sales) * 1.5 THEN sq.Q1_sales WHEN dt.sales > sq.Q3_sales + (sq.Q3_sales - sq.Q1_sales) * 1.5 THEN sq.Q3_sales ELSE dt.sales END AS sales FROM deduplicated_table dt JOIN sales_quartiles sq ON dt.car_saloon_store = sq.car_saloon_store ), -- Step 5: Fill missing values filled_missing_values AS ( SELECT day, car_saloon_store, COALESCE(sales, AVG(sales) OVER(PARTITION BY car_saloon_store)) AS sales FROM outlier_replaced ) -- Step 6: Final output sorted by car_saloon_store and day SELECT day, car_saloon_store, sales FROM filled_missing_values ORDER BY car_saloon_store, day; Explanation: Valid Entities Subquery: Instead of using a common table expression (CTE) for entity filtering, a subquery is employed within the main query. Deduplicated Table Subquery: Duplicates are removed using a subquery within the main query, utilizing ROW_NUMBER() to select only the first occurrence of each (customer_id, day) pair. Quartile Calculation: Quartiles are calculated using the PERCENTILE_CONT() window function within separate subqueries for sales and revenue. Outlier Replacement: Outliers are replaced with quartile values using a single subquery (outlier_replaced), simplifying the process. Missing Value Handling: Missing values are filled using the AVG() window function over partitions of customer_id. Final Output: The final output is selected and ordered within the main query.
Certainly! Here's an SQL query that meets your requirements, with these variable values: database_engine: postgresql database_name: bi_raw table_name: fact_country_gdp time_component: week entity_identifier: contry_code metric_values: 'gdp' data_completeness_ratio: 90% Certainly! Here's an SQL query that meets your requirements, with these variable values: database_engine: postgresql database_name: bi_raw table_name: fact_country_gdp time_component: week entity_identifier: country_code metric_values: 'gdp' data_completeness_ratio: 90% WITH -- Step 1: Calculate completeness ratio for each entity entity_completeness AS ( SELECT country_code, COUNT(*) AS total_points, COUNT(gdp) AS non_null_points, COUNT(gdp) * 1.0 / COUNT(*) AS completeness_ratio FROM bi_raw.fact_country_gdp GROUP BY country_code ), -- Step 2: Calculate Tukey's limits for outlier replacement tukey_limits AS ( SELECT country_code, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY gdp) OVER (PARTITION BY country_code) AS first_quartile, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY gdp) OVER (PARTITION BY country_code) AS third_quartile FROM bi_raw.fact_country_gdp ), -- Step 3: Apply cleaning rules and remove duplicates cleaned_data AS ( SELECT t.*, tl.first_quartile, tl.third_quartile FROM bi_raw.fact_country_gdp t JOIN tukey_limits tl ON t.country_code = tl.country_code JOIN entity_completeness ec ON t.country_code = ec.country_code WHERE ec.completeness_ratio >= COALESCE(90, 0.8) AND (t.gdp IS NULL OR (t.gdp >= tl.first_quartile - 1.5*(tl.third_quartile - tl.first_quartile) AND t.gdp <= tl.third_quartile + 1.5*(tl.third_quartile - tl.first_quartile))) QUALIFY ROW_NUMBER() OVER (PARTITION BY t.country_code, t.week ORDER BY t.week) = 1 ), -- Step 4: Impute missing values imputed_data AS ( SELECT *, AVG(gdp) OVER (PARTITION BY country_code) AS imputed_value FROM cleaned_data WHERE gdp IS NULL ) -- Step 5: Final output with sorting SELECT * FROM imputed_data ORDER BY country_code, week; This query is tailored for PostgreSQL and specific to the provided database, table, column names, and parameters. It's ready for execution in a PostgreSQL environment.
Related item

History iconInformation icon
Database Name
Enter the name of the database
Table Name
Enter the name of the table
Database Engine
Choose a database engine
Time Component
Choose a time component
Entity Identifier
Enter an appropriate entity identifier
Metric Value
Select a relevant metric value
Data Completeness Ratio
Enter the percentage of data completeness
Entity Identifier
Choose a unique identifier for the entity
Time Component
Choose a time component like a day or week
Entity Identifier
Enter a unique identifier for the entity
Time Component
Choose a time-related component
Play icon
This app costs 78 credits to runPromptBase mini credit icon78
Browse Marketplace