How to Query Data from Two Tables in Amazon Athena Based on Dates

Query to Get Rows Based on Dates from Two Tables in Athena

Overview

In this article, we’ll explore how to query data from two tables in Amazon Athena and join them based on specific conditions. The goal is to retrieve rows from the master_tbl table that have a corresponding row in the anom_table with non-zero values within a one-day interval.

Prerequisites

Before we dive into the code, make sure you’re familiar with SQL and Amazon Athena’s query syntax.

Background

Athena is an open-source, columnar relational database management system developed by Amazon. It provides fast query performance, scalability, and cost-effectiveness. When working with Athena, it’s essential to understand how to write efficient queries using SQL.

In this example, we’ll use the master_tbl and anom_table tables provided in the question. These tables contain date, id, country, and value columns, respectively.

Querying the Data

To solve the problem, we can use a combination of inner joins, subqueries, and conditionals to filter the data based on the specified criteria.

Here’s the original query provided by the user:

FROM 
    (SELECT t2.value,
         t1.id,
         t1.country AS country,
         cast(t1.date AS DATE) AS orig_date
    FROM 
        (SELECT id,
         country,
         date
        FROM anom_tbl) t1
        JOIN master_tbl t2
            ON t2.id=t1.id
                AND t2.country= t1.country
                AND t2.date=t1.date) t3
    JOIN master_tbl t2
    ON t3.id=t2.id
        AND t3.country=t2.country 
        where t2.date IN(GETDATE()-1)

However, this query is not correct. Let’s break down the errors and correct them.

Corrected Query

The main issue with the original query is that it uses IN to filter the dates within a one-day interval. However, IN requires an array of values, but in this case, we want to match any date within a range (one day before and after).

To fix this, we’ll use EXISTS instead, which allows us to check for the existence of rows that satisfy specific conditions.

Here’s the corrected query:

SELECT m.*
FROM master_tbl m
WHERE EXISTS (
    SELECT 1
    FROM anom_tbl a
    WHERE 
        a.anoms <> 0
        AND a.id = m.id 
        AND a.country = m.country
        AND m.date >= a.date - INTERVAL '1' DAY
        AND m.date <= a.date + INTERVAL '1' DAY
)

Let’s break down the changes:

  • We replaced t2.value with m.* to include all columns from the master_tbl.
  • We used EXISTS instead of IN.
  • We specified the conditions for the subquery using WHERE. The main query uses AND to combine these conditions.
  • We introduced INTERVAL to define the date range.

Explanation

The corrected query works as follows:

  1. It selects all columns (m.*) from the master_tbl table.
  2. For each row in master_tbl, it checks if there exists a corresponding row in anom_tbl that satisfies two conditions:
    • The value in anoms is not zero.
    • The date in a.date falls within one day before and after the date in m.date.
  3. If such a row exists, it includes all columns from master_tbl (m.*) in the result set.

Result

The final query produces the desired output:

DateIdCountryValue
2017-01-0226US2
2017-01-0326US9
2017-01-0426US4
2017-01-0826US3
2017-01-0926US100
2017-01-1026US4

The output includes only the rows from master_tbl where a corresponding row in anom_tbl exists with non-zero values within one day.

Conclusion

In this article, we explored how to query data from two tables in Amazon Athena and join them based on specific conditions. We corrected an original query that didn’t work as expected and replaced it with a new query using EXISTS. The new query produces the desired output by checking for the existence of corresponding rows in anom_tbl within one day.


Last modified on 2025-03-18