Session-Level Attribution in GA4 with BigQuery: A Practical Guide
Shreya Banker
Data Scientist
Data Analyst enthusiast. More than 7 years of exposure in Data Analysis and Software programming. I am a highly motivated, versatile IT professional with experience in Data Analysis, Visualization and Database Management. I look for the hardest problem to solve and where I can learn and develop the most. I love a challenge and never run from a difficult task. I'm determined to succeed, and I look forward to what life has to offer.
For marketers and digital analysts, gaining deep insights into user behavior across multiple sessions is pivotal for making data-driven decisions. Google Analytics 4 (GA4) has introduced a fresh perspective on data collection and analysis, yet it brings new challenges, particularly when dealing with BigQuery (BQ) exports. A common hurdle is recreating session-level attribution in BQ that aligns with the metrics observed in the GA4 interface. In this guide, we’ll break down this complex process and provide practical examples to help you master session-level attribution in GA4 with BigQuery.
Challenges and Solutions
Rebuilding Session-Level Data
GA4’s BigQuery export primarily captures data at the event and user level, making session-level metrics unavailable by default. Unlike Universal Analytics, where session-level data was readily accessible, GA4 requires you to build this data manually.
Solution: You can aggregate event-level data to simulate session-level metrics. For instance, using SQL queries, you can group events by session identifiers and timestamps.
This query helps to identify the start time of each session for a given user, which can be the foundation for session-level analysis.
Using the Correct Traffic Source
GA4 provides three fields related to traffic sources in BigQuery: traffic_source, session_traffic_source_last_click and collected_traffic_source. It’s crucial to reference the correct field to ensure accurate session-level attribution.
Solution: Use the collected_traffic_source field, which captures traffic source details for each event. This enables you to build accurate session-level attribution data.
User Scope
Based on first session per user:
traffic_source.name(campaign)
traffic_source.medium
traffic_source.source
Event-scoped fields are preferred for understanding multiple ad or source interactions within a session, or for computing session-level values for dates before the addition of them on July 17.
collected_traffic_source.manual_campaign_id
collected_traffic_source.manual_campaign_name
collected_traffic_source.manual_source
collected_traffic_source.manual_medium
collected_traffic_source.manual_term
collected_traffic_source.manual_content
collected_traffic_source.gclid
collected_traffic_source.dclid
Collected_traffic_source.srsltid
Session-scoped data is available after July 17, 2024 and should be the preferred method for handling sessions that occurred after that date. One issue to consider here is that rows can have "google / organic" as the source / medium, but still also include a Google Ads campaign.
session_traffic_source_last_click.manual_campaign.campaign_id
session_traffic_source_last_click.manual_campaign.campaign_name
session_traffic_source_last_click.manual_campaign.source
session_traffic_source_last_click.manual_campaign.medium
session_traffic_source_last_click.manual_campaign.term
session_traffic_source_last_click.manual_campaign.content
session_traffic_source_last_click.manual_campaign.source_platform
session_traffic_source_last_click.manual_campaign.creative_format
session_traffic_source_last_click.manual_campaign.marketing_tactic
session_traffic_source_last_click.google_ads_campaign
session_traffic_source_last_click.google_ads_campaign.customer_id
session_traffic_source_last_click.google_ads_campaign.account_name
session_traffic_source_last_click.google_ads_campaign.campaign_id
session_traffic_source_last_click.google_ads_campaign.campaign_name
session_traffic_source_last_click.google_ads_campaign.ad_group_id
session_traffic_source_last_click.google_ads_campaign.ad_group_name
Leveraging the Last-Click Attribution Model
GA4 defaults to the last non-direct click attribution model, assigning 100% of conversion credit to the last non-direct touchpoint before a conversion occurs.
Solution: To replicate this model in BigQuery, trace user journeys backward from the conversion event, ignoring direct traffic unless it is the only interaction. Note that this solution will only work for dates after July 2024, for earlier dates a valid comparison for source breakdowns is required.
Example Scenario: Imagine a user interacts with your website as follows:
Day 1: Clicks on a display ad.
Day 2: Arrives via organic search.
Day 3: Clicks on a paid search ad.
Day 4: Arrives directly on the site and completes a purchase.
In this case, the last non-direct click model would attribute 100% of the conversion credit to the paid search ad on Day 4.
Adjusting for the Key Event Lookback Window
The lookback window for key events in GA4 defines how far back in time interactions are considered for attribution credit. The default is 30 days, but it can be extended.
Solution: Ensure your BigQuery calculations align with the selected lookback window. For example, if you’ve chosen a 60-day lookback window, your queries should account for user interactions within this period when assigning attribution.
Considering Reporting Identity
The reporting identity (Blended, Observed or Device-based) impacts how user-based metrics are calculated. Incorrect use of identity in BigQuery can lead to discrepancies between BQ data and the GA4 interface.
Solution: Adjust your queries based on the reporting identity set in GA4. For Blended identities, combine user_pseudo_id and user_id to create unique device identifiers. Note: This solution will not work for properties where modeling is applied.
Handling Issues with gclid/dclid in GA4 Exports
When working with Google Analytics 4 (GA4) data exports in BigQuery, there is a known issue that can cause incorrect attribution of traffic source data, particularly when gclid (Google Ads click identifier) or dclid (Display & Video 360 click identifier) parameters are involved. This problem predominantly affects sessions coming from Google Ads, Display & Video 360, and Campaign Manager 360 traffic. The result is that traffic that should be attributed to Google/cpc (Google Ads) or related sources is sometimes mislabeled as organic or direct traffic.
This issue arises in the BigQuery exports and is related to how the collected_traffic_source field is processed. Although the GA4 UI and API correctly handle and display traffic source attribution, the BigQuery export schema presents challenges when analyzing this data at the session level.
Why This Happens:
The issue stems from the processing of the gclid and dclid parameters in GA4 BigQuery exports. The collected_traffic_source field contains valuable traffic source information for each event, but if this data is not correctly associated with the session, it can result in incorrect or missing attribution for certain traffic sources, especially those from Google Ads.
Until Google fully resolves this issue in the BigQuery export schema, a practical SQL-based workaround is available:
Steps to Address the Issue:
Aggregate Traffic Source Data at the Session Level: Start by compiling all traffic source information at the session level. This gives you a broader view of the traffic sources associated with each session.
Identify Sessions with gclid Parameters: Next, filter sessions that contain gclid parameters in the page_location field. These parameters indicate that the session originated from Google Ads.
Overwrite Source and Medium Values: For sessions identified in the previous step, overwrite the source and medium fields with fixed values such as “google/cpc”. This adjustment helps correct the attribution for these specific sessions.
While this method resolves source and medium attribution issues, it may not fully correct campaign data, especially for newer ad formats like Performance Max campaigns. To achieve more accurate campaign attribution, consider implementing Google Ads Data Transfer.
Enhancing Campaign Data Accuracy:
Extract gclid data: Start by extracting the gclid parameter for each session where it’s present.
Join with Google Ads data: Use the Google Ads Data Transfer in BigQuery to join this gclid data with detailed campaign information.
Retrieve campaign details: This process will allow you to retrieve accurate campaign names and other relevant details on a session-by-session basis.
Important Prerequisite: Before proceeding with this solution, ensure that your Google Ads Data Transfer to BigQuery is properly set up.
Conclusion
Navigating the complexities of session-level attribution in Google Analytics 4 using BigQuery may seem daunting, but with a clear understanding of the challenges and the right tools at your disposal, it becomes a manageable and rewarding task. By rebuilding session-level data, correctly referencing traffic sources, and leveraging GA4's last-click attribution model, you can replicate the metrics seen in the GA4 interface with remarkable accuracy.
Moreover, addressing common issues, such as those related to gclid/dclid parameters, further enhances your ability to derive meaningful insights from your data. While GA4’s shift in data handling requires a new approach, the flexibility and power of BigQuery allow you to customize and optimize your analysis to meet your specific needs.
By mastering these techniques, you'll be well-equipped to extract valuable insights from your GA4 data, enabling more informed, data-driven decisions that can significantly impact your business outcomes.