Query to Get Rows Based on Dates from Two Tables in Athena
Overview
In this article, we’ll explore how to query data from two tables in Amazon Athena and join them based on specific conditions. The goal is to retrieve rows from the master_tbl table that have a corresponding row in the anom_table with non-zero values within a one-day interval.
Prerequisites
Before we dive into the code, make sure you’re familiar with SQL and Amazon Athena’s query syntax.
Background
Athena is an open-source, columnar relational database management system developed by Amazon. It provides fast query performance, scalability, and cost-effectiveness. When working with Athena, it’s essential to understand how to write efficient queries using SQL.
In this example, we’ll use the master_tbl and anom_table tables provided in the question. These tables contain date, id, country, and value columns, respectively.
Querying the Data
To solve the problem, we can use a combination of inner joins, subqueries, and conditionals to filter the data based on the specified criteria.
Here’s the original query provided by the user:
FROM
(SELECT t2.value,
t1.id,
t1.country AS country,
cast(t1.date AS DATE) AS orig_date
FROM
(SELECT id,
country,
date
FROM anom_tbl) t1
JOIN master_tbl t2
ON t2.id=t1.id
AND t2.country= t1.country
AND t2.date=t1.date) t3
JOIN master_tbl t2
ON t3.id=t2.id
AND t3.country=t2.country
where t2.date IN(GETDATE()-1)
However, this query is not correct. Let’s break down the errors and correct them.
Corrected Query
The main issue with the original query is that it uses IN to filter the dates within a one-day interval. However, IN requires an array of values, but in this case, we want to match any date within a range (one day before and after).
To fix this, we’ll use EXISTS instead, which allows us to check for the existence of rows that satisfy specific conditions.
Here’s the corrected query:
SELECT m.*
FROM master_tbl m
WHERE EXISTS (
SELECT 1
FROM anom_tbl a
WHERE
a.anoms <> 0
AND a.id = m.id
AND a.country = m.country
AND m.date >= a.date - INTERVAL '1' DAY
AND m.date <= a.date + INTERVAL '1' DAY
)
Let’s break down the changes:
- We replaced
t2.valuewithm.*to include all columns from themaster_tbl. - We used
EXISTSinstead ofIN. - We specified the conditions for the subquery using
WHERE. The main query usesANDto combine these conditions. - We introduced
INTERVALto define the date range.
Explanation
The corrected query works as follows:
- It selects all columns (
m.*) from themaster_tbltable. - For each row in
master_tbl, it checks if there exists a corresponding row inanom_tblthat satisfies two conditions:- The value in
anomsis not zero. - The date in
a.datefalls within one day before and after the date inm.date.
- The value in
- If such a row exists, it includes all columns from
master_tbl(m.*) in the result set.
Result
The final query produces the desired output:
| Date | Id | Country | Value |
|---|---|---|---|
| 2017-01-02 | 26 | US | 2 |
| 2017-01-03 | 26 | US | 9 |
| 2017-01-04 | 26 | US | 4 |
| 2017-01-08 | 26 | US | 3 |
| 2017-01-09 | 26 | US | 100 |
| 2017-01-10 | 26 | US | 4 |
The output includes only the rows from master_tbl where a corresponding row in anom_tbl exists with non-zero values within one day.
Conclusion
In this article, we explored how to query data from two tables in Amazon Athena and join them based on specific conditions. We corrected an original query that didn’t work as expected and replaced it with a new query using EXISTS. The new query produces the desired output by checking for the existence of corresponding rows in anom_tbl within one day.
Last modified on 2025-03-18