Menu

Session-Level Attribution in GA4 with BigQuery: A Practical Guide
Session-Level Attribution in GA4 with BigQuery: A Practical Guide

Session-Level Attribution in GA4 with BigQuery: A Practical Guide

This guide from Napkyn explores how to recreate session-level attribution in Google Analytics 4 (GA4) using BigQuery. It outlines challenges like building session-level metrics from event-level data, selecting the correct traffic source fields, and replicating GA4’s last-click attribution model. Practical SQL examples are provided for solving these issues, and the guide also addresses the complexities of handling gclid/dclid parameters in GA4 exports. This empowers analysts to gain accurate insights from their GA4 data.

Shreya Banker

Data Scientist

Data Analyst enthusiast. More than 7 years of exposure in Data Analysis and Software programming. I am a highly motivated, versatile IT professional with experience in Data Analysis, Visualization and Database Management. I look for the hardest problem to solve and where I can learn and develop the most. I love a challenge and never run from a difficult task. I'm determined to succeed, and I look forward to what life has to offer.

For marketers and digital analysts, gaining deep insights into user behavior across multiple sessions is pivotal for making data-driven decisions. Google Analytics 4 (GA4) has introduced a fresh perspective on data collection and analysis, yet it brings new challenges, particularly when dealing with BigQuery (BQ) exports. A common hurdle is recreating session-level attribution in BQ that aligns with the metrics observed in the GA4 interface. In this guide, we’ll break down this complex process and provide practical examples to help you master session-level attribution in GA4 with BigQuery.

Challenges and Solutions

Rebuilding Session-Level Data

GA4’s BigQuery export primarily captures data at the event and user level, making session-level metrics unavailable by default. Unlike Universal Analytics, where session-level data was readily accessible, GA4 requires you to build this data manually.

Solution: You can aggregate event-level data to simulate session-level metrics. For instance, using SQL queries, you can group events by session identifiers and timestamps.

SELECT
  user_pseudo_id,event_timestamp,
  MIN(event_timestamp) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_start_time
FROM
  `your_project.your_dataset.ga4_events_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20240101' AND '20240101'

This query helps to identify the start time of each session for a given user, which can be the foundation for session-level analysis.

Using the Correct Traffic Source

GA4 provides three fields related to traffic sources in BigQuery: traffic_source,  session_traffic_source_last_click and collected_traffic_source. It’s crucial to reference the correct field to ensure accurate session-level attribution.

Solution: Use the collected_traffic_source field, which captures traffic source details for each event. This enables you to build accurate session-level attribution data.

User Scope

Based on first session per user:

  • traffic_source.name(campaign)

  • traffic_source.medium

  • traffic_source.source

Event-scoped fields are preferred for understanding multiple ad or source interactions within a session, or for computing session-level values for dates before the addition of them on July 17.

  • collected_traffic_source.manual_campaign_id

  • collected_traffic_source.manual_campaign_name

  • collected_traffic_source.manual_source

  • collected_traffic_source.manual_medium

  • collected_traffic_source.manual_term

  • collected_traffic_source.manual_content

  • collected_traffic_source.gclid

  • collected_traffic_source.dclid

  • Collected_traffic_source.srsltid

Session-scoped data is available after July 17, 2024 and should be the preferred method for handling sessions that occurred after that date. One issue to consider here is that rows can have "google / organic" as the source / medium, but still also include a Google Ads campaign.

  • session_traffic_source_last_click.manual_campaign.campaign_id

  • session_traffic_source_last_click.manual_campaign.campaign_name

  • session_traffic_source_last_click.manual_campaign.source

  • session_traffic_source_last_click.manual_campaign.medium

  • session_traffic_source_last_click.manual_campaign.term

  • session_traffic_source_last_click.manual_campaign.content

  • session_traffic_source_last_click.manual_campaign.source_platform

  • session_traffic_source_last_click.manual_campaign.creative_format

  • session_traffic_source_last_click.manual_campaign.marketing_tactic

  • session_traffic_source_last_click.google_ads_campaign

  • session_traffic_source_last_click.google_ads_campaign.customer_id

  • session_traffic_source_last_click.google_ads_campaign.account_name

  • session_traffic_source_last_click.google_ads_campaign.campaign_id

  • session_traffic_source_last_click.google_ads_campaign.campaign_name

  • session_traffic_source_last_click.google_ads_campaign.ad_group_id

  • session_traffic_source_last_click.google_ads_campaign.ad_group_name

Leveraging the Last-Click Attribution Model

GA4 defaults to the last non-direct click attribution model, assigning 100% of conversion credit to the last non-direct touchpoint before a conversion occurs.

Solution: To replicate this model in BigQuery, trace user journeys backward from the conversion event, ignoring direct traffic unless it is the only interaction. Note that this solution will only work  for dates after July 2024, for earlier dates a valid comparison for source breakdowns is required.

Example Scenario: Imagine a user interacts with your website as follows:

  • Day 1: Clicks on a display ad.

  • Day 2: Arrives via organic search.

  • Day 3: Clicks on a paid search ad.

  • Day 4: Arrives directly on the site and completes a purchase.

In this case, the last non-direct click model would attribute 100% of the conversion credit to the paid search ad on Day 4.

SELECT
  user_pseudo_id,event_name,source,medium,
  IF(source IS NOT NULL, source, LAG(source) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp)) AS last_non_direct_source
FROM
  `your_project.your_dataset.ga4_events_*`
WHERE event_name = 'purchase'

Adjusting for the Key Event Lookback Window

The lookback window for key events in GA4 defines how far back in time interactions are considered for attribution credit. The default is 30 days, but it can be extended.

Solution: Ensure your BigQuery calculations align with the selected lookback window. For example, if you’ve chosen a 60-day lookback window, your queries should account for user interactions within this period when assigning attribution.

Considering Reporting Identity

The reporting identity (Blended, Observed or Device-based) impacts how user-based metrics are calculated. Incorrect use of identity in BigQuery can lead to discrepancies between BQ data and the GA4 interface.

Solution: Adjust your queries based on the reporting identity set in GA4. For Blended identities, combine user_pseudo_id and user_id to create unique device identifiers. Note: This solution will not work  for properties where modeling is applied.

Handling Issues with gclid/dclid in GA4 Exports

When working with Google Analytics 4 (GA4) data exports in BigQuery, there is a known issue that can cause incorrect attribution of traffic source data, particularly when gclid (Google Ads click identifier) or dclid (Display & Video 360 click identifier) parameters are involved. This problem predominantly affects sessions coming from Google Ads, Display & Video 360, and Campaign Manager 360 traffic. The result is that traffic that should be attributed to Google/cpc (Google Ads) or related sources is sometimes mislabeled as organic or direct traffic.

This issue arises in the BigQuery exports and is related to how the collected_traffic_source field is processed. Although the GA4 UI and API correctly handle and display traffic source attribution, the BigQuery export schema presents challenges when analyzing this data at the session level.

Why This Happens:

The issue stems from the processing of the gclid and dclid parameters in GA4 BigQuery exports. The collected_traffic_source field contains valuable traffic source information for each event, but if this data is not correctly associated with the session, it can result in incorrect or missing attribution for certain traffic sources, especially those from Google Ads.

Until Google fully resolves this issue in the BigQuery export schema, a practical SQL-based workaround is available:

Steps to Address the Issue:

  1. Aggregate Traffic Source Data at the Session Level: Start by compiling all traffic source information at the session level. This gives you a broader view of the traffic sources associated with each session.

  2. Identify Sessions with gclid Parameters: Next, filter sessions that contain gclid parameters in the page_location field. These parameters indicate that the session originated from Google Ads.

  3. Overwrite Source and Medium Values: For sessions identified in the previous step, overwrite the source and medium fields with fixed values such as “google/cpc”. This adjustment helps correct the attribution for these specific sessions.

While this method resolves source and medium attribution issues, it may not fully correct campaign data, especially for newer ad formats like Performance Max campaigns. To achieve more accurate campaign attribution, consider implementing Google Ads Data Transfer.

Enhancing Campaign Data Accuracy:

  • Extract gclid data: Start by extracting the gclid parameter for each session where it’s present.

  • Join with Google Ads data: Use the Google Ads Data Transfer in BigQuery to join this gclid data with detailed campaign information.

  • Retrieve campaign details: This process will allow you to retrieve accurate campaign names and other relevant details on a session-by-session basis.

Important Prerequisite: Before proceeding with this solution, ensure that your Google Ads Data Transfer to BigQuery is properly set up.

Conclusion

Navigating the complexities of session-level attribution in Google Analytics 4 using BigQuery may seem daunting, but with a clear understanding of the challenges and the right tools at your disposal, it becomes a manageable and rewarding task. By rebuilding session-level data, correctly referencing traffic sources, and leveraging GA4's last-click attribution model, you can replicate the metrics seen in the GA4 interface with remarkable accuracy.

Moreover, addressing common issues, such as those related to gclid/dclid parameters, further enhances your ability to derive meaningful insights from your data. While GA4’s shift in data handling requires a new approach, the flexibility and power of BigQuery allow you to customize and optimize your analysis to meet your specific needs.

By mastering these techniques, you'll be well-equipped to extract valuable insights from your GA4 data, enabling more informed, data-driven decisions that can significantly impact your business outcomes.

About Napkyn

Napkyn specializes in empowering businesses with tailored analytics solutions, offering expertise in GA4 and BigQuery to transform data into actionable insights. Their custom BigQuery reports are designed to meet unique business needs, providing deep insights into user behavior, campaign performance, and e-commerce trends. By combining advanced analytics with robust reporting, Napkyn helps organizations make data-driven decisions to optimize marketing strategies and drive growth.


Looking to harness the full potential of GA4 and BigQuery for your business? Let Napkyn guide you with custom reporting solutions that deliver real results. Get in touch with us today to start transforming your data into your most valuable business asset!

References:

Session-Level Attribution in GA4 with BigQuery: A Practical Guide

Session-Level Attribution in GA4 with BigQuery: A Practical Guide

This guide from Napkyn explores how to recreate session-level attribution in Google Analytics 4 (GA4) using BigQuery. It outlines challenges like building session-level metrics from event-level data, selecting the correct traffic source fields, and replicating GA4’s last-click attribution model. Practical SQL examples are provided for solving these issues, and the guide also addresses the complexities of handling gclid/dclid parameters in GA4 exports. This empowers analysts to gain accurate insights from their GA4 data.

Shreya Banker

Data Scientist

Data Analyst enthusiast. More than 7 years of exposure in Data Analysis and Software programming. I am a highly motivated, versatile IT professional with experience in Data Analysis, Visualization and Database Management. I look for the hardest problem to solve and where I can learn and develop the most. I love a challenge and never run from a difficult task. I'm determined to succeed, and I look forward to what life has to offer.

For marketers and digital analysts, gaining deep insights into user behavior across multiple sessions is pivotal for making data-driven decisions. Google Analytics 4 (GA4) has introduced a fresh perspective on data collection and analysis, yet it brings new challenges, particularly when dealing with BigQuery (BQ) exports. A common hurdle is recreating session-level attribution in BQ that aligns with the metrics observed in the GA4 interface. In this guide, we’ll break down this complex process and provide practical examples to help you master session-level attribution in GA4 with BigQuery.

Challenges and Solutions

Rebuilding Session-Level Data

GA4’s BigQuery export primarily captures data at the event and user level, making session-level metrics unavailable by default. Unlike Universal Analytics, where session-level data was readily accessible, GA4 requires you to build this data manually.

Solution: You can aggregate event-level data to simulate session-level metrics. For instance, using SQL queries, you can group events by session identifiers and timestamps.

SELECT
  user_pseudo_id,event_timestamp,
  MIN(event_timestamp) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_start_time
FROM
  `your_project.your_dataset.ga4_events_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20240101' AND '20240101'

This query helps to identify the start time of each session for a given user, which can be the foundation for session-level analysis.

Using the Correct Traffic Source

GA4 provides three fields related to traffic sources in BigQuery: traffic_source,  session_traffic_source_last_click and collected_traffic_source. It’s crucial to reference the correct field to ensure accurate session-level attribution.

Solution: Use the collected_traffic_source field, which captures traffic source details for each event. This enables you to build accurate session-level attribution data.

User Scope

Based on first session per user:

  • traffic_source.name(campaign)

  • traffic_source.medium

  • traffic_source.source

Event-scoped fields are preferred for understanding multiple ad or source interactions within a session, or for computing session-level values for dates before the addition of them on July 17.

  • collected_traffic_source.manual_campaign_id

  • collected_traffic_source.manual_campaign_name

  • collected_traffic_source.manual_source

  • collected_traffic_source.manual_medium

  • collected_traffic_source.manual_term

  • collected_traffic_source.manual_content

  • collected_traffic_source.gclid

  • collected_traffic_source.dclid

  • Collected_traffic_source.srsltid

Session-scoped data is available after July 17, 2024 and should be the preferred method for handling sessions that occurred after that date. One issue to consider here is that rows can have "google / organic" as the source / medium, but still also include a Google Ads campaign.

  • session_traffic_source_last_click.manual_campaign.campaign_id

  • session_traffic_source_last_click.manual_campaign.campaign_name

  • session_traffic_source_last_click.manual_campaign.source

  • session_traffic_source_last_click.manual_campaign.medium

  • session_traffic_source_last_click.manual_campaign.term

  • session_traffic_source_last_click.manual_campaign.content

  • session_traffic_source_last_click.manual_campaign.source_platform

  • session_traffic_source_last_click.manual_campaign.creative_format

  • session_traffic_source_last_click.manual_campaign.marketing_tactic

  • session_traffic_source_last_click.google_ads_campaign

  • session_traffic_source_last_click.google_ads_campaign.customer_id

  • session_traffic_source_last_click.google_ads_campaign.account_name

  • session_traffic_source_last_click.google_ads_campaign.campaign_id

  • session_traffic_source_last_click.google_ads_campaign.campaign_name

  • session_traffic_source_last_click.google_ads_campaign.ad_group_id

  • session_traffic_source_last_click.google_ads_campaign.ad_group_name

Leveraging the Last-Click Attribution Model

GA4 defaults to the last non-direct click attribution model, assigning 100% of conversion credit to the last non-direct touchpoint before a conversion occurs.

Solution: To replicate this model in BigQuery, trace user journeys backward from the conversion event, ignoring direct traffic unless it is the only interaction. Note that this solution will only work  for dates after July 2024, for earlier dates a valid comparison for source breakdowns is required.

Example Scenario: Imagine a user interacts with your website as follows:

  • Day 1: Clicks on a display ad.

  • Day 2: Arrives via organic search.

  • Day 3: Clicks on a paid search ad.

  • Day 4: Arrives directly on the site and completes a purchase.

In this case, the last non-direct click model would attribute 100% of the conversion credit to the paid search ad on Day 4.

SELECT
  user_pseudo_id,event_name,source,medium,
  IF(source IS NOT NULL, source, LAG(source) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp)) AS last_non_direct_source
FROM
  `your_project.your_dataset.ga4_events_*`
WHERE event_name = 'purchase'

Adjusting for the Key Event Lookback Window

The lookback window for key events in GA4 defines how far back in time interactions are considered for attribution credit. The default is 30 days, but it can be extended.

Solution: Ensure your BigQuery calculations align with the selected lookback window. For example, if you’ve chosen a 60-day lookback window, your queries should account for user interactions within this period when assigning attribution.

Considering Reporting Identity

The reporting identity (Blended, Observed or Device-based) impacts how user-based metrics are calculated. Incorrect use of identity in BigQuery can lead to discrepancies between BQ data and the GA4 interface.

Solution: Adjust your queries based on the reporting identity set in GA4. For Blended identities, combine user_pseudo_id and user_id to create unique device identifiers. Note: This solution will not work  for properties where modeling is applied.

Handling Issues with gclid/dclid in GA4 Exports

When working with Google Analytics 4 (GA4) data exports in BigQuery, there is a known issue that can cause incorrect attribution of traffic source data, particularly when gclid (Google Ads click identifier) or dclid (Display & Video 360 click identifier) parameters are involved. This problem predominantly affects sessions coming from Google Ads, Display & Video 360, and Campaign Manager 360 traffic. The result is that traffic that should be attributed to Google/cpc (Google Ads) or related sources is sometimes mislabeled as organic or direct traffic.

This issue arises in the BigQuery exports and is related to how the collected_traffic_source field is processed. Although the GA4 UI and API correctly handle and display traffic source attribution, the BigQuery export schema presents challenges when analyzing this data at the session level.

Why This Happens:

The issue stems from the processing of the gclid and dclid parameters in GA4 BigQuery exports. The collected_traffic_source field contains valuable traffic source information for each event, but if this data is not correctly associated with the session, it can result in incorrect or missing attribution for certain traffic sources, especially those from Google Ads.

Until Google fully resolves this issue in the BigQuery export schema, a practical SQL-based workaround is available:

Steps to Address the Issue:

  1. Aggregate Traffic Source Data at the Session Level: Start by compiling all traffic source information at the session level. This gives you a broader view of the traffic sources associated with each session.

  2. Identify Sessions with gclid Parameters: Next, filter sessions that contain gclid parameters in the page_location field. These parameters indicate that the session originated from Google Ads.

  3. Overwrite Source and Medium Values: For sessions identified in the previous step, overwrite the source and medium fields with fixed values such as “google/cpc”. This adjustment helps correct the attribution for these specific sessions.

While this method resolves source and medium attribution issues, it may not fully correct campaign data, especially for newer ad formats like Performance Max campaigns. To achieve more accurate campaign attribution, consider implementing Google Ads Data Transfer.

Enhancing Campaign Data Accuracy:

  • Extract gclid data: Start by extracting the gclid parameter for each session where it’s present.

  • Join with Google Ads data: Use the Google Ads Data Transfer in BigQuery to join this gclid data with detailed campaign information.

  • Retrieve campaign details: This process will allow you to retrieve accurate campaign names and other relevant details on a session-by-session basis.

Important Prerequisite: Before proceeding with this solution, ensure that your Google Ads Data Transfer to BigQuery is properly set up.

Conclusion

Navigating the complexities of session-level attribution in Google Analytics 4 using BigQuery may seem daunting, but with a clear understanding of the challenges and the right tools at your disposal, it becomes a manageable and rewarding task. By rebuilding session-level data, correctly referencing traffic sources, and leveraging GA4's last-click attribution model, you can replicate the metrics seen in the GA4 interface with remarkable accuracy.

Moreover, addressing common issues, such as those related to gclid/dclid parameters, further enhances your ability to derive meaningful insights from your data. While GA4’s shift in data handling requires a new approach, the flexibility and power of BigQuery allow you to customize and optimize your analysis to meet your specific needs.

By mastering these techniques, you'll be well-equipped to extract valuable insights from your GA4 data, enabling more informed, data-driven decisions that can significantly impact your business outcomes.

About Napkyn

Napkyn specializes in empowering businesses with tailored analytics solutions, offering expertise in GA4 and BigQuery to transform data into actionable insights. Their custom BigQuery reports are designed to meet unique business needs, providing deep insights into user behavior, campaign performance, and e-commerce trends. By combining advanced analytics with robust reporting, Napkyn helps organizations make data-driven decisions to optimize marketing strategies and drive growth.


Looking to harness the full potential of GA4 and BigQuery for your business? Let Napkyn guide you with custom reporting solutions that deliver real results. Get in touch with us today to start transforming your data into your most valuable business asset!

References:

Sign Up For Our Newsletter

Napkyn Inc.
204-78 George Street, Ottawa, Ontario, K1N 5W1, Canada

Napkyn US
6 East 32nd Street, 9th Floor, New York, NY 10016, USA

212-247-0800 | info@napkyn.com