199 lines
9.1 KiB
Markdown
199 lines
9.1 KiB
Markdown
---
|
|
stage: Analytics
|
|
group: Product Intelligence
|
|
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
|
|
---
|
|
|
|
# Snowplow
|
|
|
|
Snowplow is an enterprise-grade marketing and Product Intelligence platform that tracks how users engage with our website and application.
|
|
|
|
[Snowplow](https://snowplowanalytics.com) consists of several loosely-coupled sub-systems:
|
|
|
|
- **Trackers** fire Snowplow events. Snowplow has twelve trackers that cover web, mobile, desktop, server, and IoT.
|
|
- **Collectors** receive Snowplow events from trackers. We use different event collectors that synchronize events to Amazon S3, Apache Kafka, or Amazon Kinesis.
|
|
- **Enrich** cleans raw Snowplow events, enriches them, and puts them into storage. There is a Hadoop-based enrichment process, and a Kinesis-based or Kafka-based process.
|
|
- **Storage** stores Snowplow events. We store the Snowplow events in a flat file structure on S3, and in the Redshift and PostgreSQL databases.
|
|
- **Data modeling** joins event-level data with other data sets, aggregates them into smaller data sets, and applies business logic. This produces a clean set of tables for data analysis. We use data models for Redshift and Looker.
|
|
- **Analytics** are performed on Snowplow events or on aggregate tables.
|
|
|
|
![snowplow_flow](../img/snowplow_flow.png)
|
|
|
|
## Enable Snowplow tracking
|
|
|
|
Tracking can be enabled at:
|
|
|
|
- The instance level, which enables tracking on both the frontend and backend layers.
|
|
- The user level. User tracking can be disabled on a per user basis.
|
|
GitLab respects the [Do Not Track](https://www.eff.org/issues/do-not-track) standard, so any user who has enabled the Do Not Track option in their browser is not tracked at a user level.
|
|
|
|
Snowplow tracking is enabled on GitLab.com, and we use it for most of our tracking strategy.
|
|
|
|
To enable Snowplow tracking on a self-managed instance:
|
|
|
|
1. On the top bar, select **Menu > Admin**, then select **Settings > General**.
|
|
Alternatively, go to `admin/application_settings/general` in your browser.
|
|
|
|
1. Expand **Snowplow**.
|
|
|
|
1. Select **Enable Snowplow tracking** and enter your Snowplow configuration information. For example:
|
|
|
|
| Name | Value |
|
|
|--------------------|-------------------------------|
|
|
| Collector hostname | `your-snowplow-collector.net` |
|
|
| App ID | `gitlab` |
|
|
| Cookie domain | `.your-gitlab-instance.com` |
|
|
|
|
1. Select **Save changes**.
|
|
|
|
## Snowplow request flow
|
|
|
|
The following example shows a basic request/response flow between the following components:
|
|
|
|
- Snowplow JS / Ruby Trackers on GitLab.com
|
|
- [GitLab.com Snowplow Collector](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/snowplow/index.md)
|
|
- The GitLab S3 Bucket
|
|
- The GitLab Snowflake Data Warehouse
|
|
- Sisense:
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Snowplow JS (Frontend)
|
|
participant Snowplow Ruby (Backend)
|
|
participant GitLab.com Snowplow Collector
|
|
participant S3 Bucket
|
|
participant Snowflake DW
|
|
participant Sisense Dashboards
|
|
Snowplow JS (Frontend) ->> GitLab.com Snowplow Collector: FE Tracking event
|
|
Snowplow Ruby (Backend) ->> GitLab.com Snowplow Collector: BE Tracking event
|
|
loop Process using Kinesis Stream
|
|
GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Log raw events
|
|
GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Enrich events
|
|
GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Write to disk
|
|
end
|
|
GitLab.com Snowplow Collector ->> S3 Bucket: Kinesis Firehose
|
|
Note over GitLab.com Snowplow Collector, S3 Bucket: Pseudonymization
|
|
S3 Bucket->>Snowflake DW: Import data
|
|
Snowflake DW->>Snowflake DW: Transform data using dbt
|
|
Snowflake DW->>Sisense Dashboards: Data available for querying
|
|
```
|
|
|
|
For more details about the architecture, see [Snowplow infrastructure](infrastructure.md).
|
|
|
|
## Structured event taxonomy
|
|
|
|
Click events must be consistent. If each feature captures events differently, it can be difficult
|
|
to perform analysis.
|
|
|
|
Each click event provides attributes that describe the event.
|
|
|
|
| Attribute | Type | Required | Description |
|
|
| --------- | ------- | -------- | ----------- |
|
|
| category | text | true | The page or backend section of the application. Unless infeasible, use the Rails page attribute by default in the frontend, and namespace + class name on the backend. |
|
|
| action | text | true | The action the user takes, or aspect that's being instrumented. The first word must describe the action or aspect. For example, clicks must be `click`, activations must be `activate`, creations must be `create`. Use underscores to describe what was acted on. For example, activating a form field is `activate_form_input`, an interface action like clicking on a dropdown is `click_dropdown`, a behavior like creating a project record from the backend is `create_project`. |
|
|
| label | text | false | The specific element or object to act on. This can be one of the following: the label of the element, for example, a tab labeled 'Create from template' for `create_from_template`; a unique identifier if no text is available, for example, `groups_dropdown_close` for closing the Groups dropdown in the top bar; or the name or title attribute of a record being created. |
|
|
| property | text | false | Any additional property of the element, or object being acted on. |
|
|
| value | decimal | false | Describes a numeric value (decimal) directly related to the event. This could be the value of an input. For example, `10` when clicking `internal` visibility. |
|
|
|
|
### Examples
|
|
|
|
| Category* | Label | Action | Property** | Value |
|
|
|-------------|------------------|-----------------------|----------|:-----:|
|
|
| `[root:index]` | `main_navigation` | `click_navigation_link` | `[link_label]` | - |
|
|
| `[groups:boards:show]` | `toggle_swimlanes` | `click_toggle_button` | - | `[is_active]` |
|
|
| `[projects:registry:index]` | `registry_delete` | `click_button` | - | - |
|
|
| `[projects:registry:index]` | `registry_delete` | `confirm_deletion` | - | - |
|
|
| `[projects:blob:show]` | `congratulate_first_pipeline` | `click_button` | `[human_access]` | - |
|
|
| `[projects:clusters:new]` | `chart_options` | `generate_link` | `[chart_link]` | - |
|
|
| `[projects:clusters:new]` | `chart_options` | `click_add_label_button` | `[label_id]` | - |
|
|
|
|
_* If you choose to omit the category you can use the default._<br>
|
|
_** Use property for variable strings._
|
|
|
|
### Reference SQL
|
|
|
|
#### Last 20 `reply_comment_button` events
|
|
|
|
```sql
|
|
SELECT
|
|
session_id,
|
|
event_id,
|
|
event_label,
|
|
event_action,
|
|
event_property,
|
|
event_value,
|
|
event_category,
|
|
contexts
|
|
FROM legacy.snowplow_structured_events_all
|
|
WHERE
|
|
event_label = 'reply_comment_button'
|
|
AND event_action = 'click_button'
|
|
-- AND event_category = 'projects:issues:show'
|
|
-- AND event_value = 1
|
|
ORDER BY collector_tstamp DESC
|
|
LIMIT 20
|
|
```
|
|
|
|
#### Last 100 page view events
|
|
|
|
```sql
|
|
SELECT
|
|
-- page_url,
|
|
-- page_title,
|
|
-- referer_url,
|
|
-- marketing_medium,
|
|
-- marketing_source,
|
|
-- marketing_campaign,
|
|
-- browser_window_width,
|
|
-- device_is_mobile
|
|
*
|
|
FROM legacy.snowplow_page_views_30
|
|
ORDER BY page_view_start DESC
|
|
LIMIT 100
|
|
```
|
|
|
|
#### Top 20 users who fired `reply_comment_button` in the last 30 days
|
|
|
|
```sql
|
|
SELECT
|
|
count(*) as hits,
|
|
se_action,
|
|
se_category,
|
|
gsc_pseudonymized_user_id
|
|
FROM legacy.snowplow_gitlab_events_30
|
|
WHERE
|
|
se_label = 'reply_comment_button'
|
|
AND gsc_pseudonymized_user_id IS NOT NULL
|
|
GROUP BY gsc_pseudonymized_user_id, se_category, se_action
|
|
ORDER BY count(*) DESC
|
|
LIMIT 20
|
|
```
|
|
|
|
#### Query JSON formatted data
|
|
|
|
```sql
|
|
SELECT
|
|
derived_tstamp,
|
|
contexts:data[0]:data:extra:old_format as CURRENT_FORMAT,
|
|
contexts:data[0]:data:extra:value as UPDATED_FORMAT
|
|
FROM legacy.snowplow_structured_events_all
|
|
WHERE event_action in ('wiki_format_updated')
|
|
ORDER BY derived_tstamp DESC
|
|
LIMIT 100
|
|
```
|
|
|
|
### Web-specific parameters
|
|
|
|
Snowplow JavaScript adds [web-specific parameters](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/snowplow-tracker-protocol/#Web-specific_parameters) to all web events by default.
|
|
|
|
## Related topics
|
|
|
|
- [Snowplow data structure](https://docs.snowplowanalytics.com/docs/understanding-your-pipeline/canonical-event/)
|
|
- [Our Iglu schema registry](https://gitlab.com/gitlab-org/iglu)
|
|
- [List of events used in our codebase (Event Dictionary)](https://metrics.gitlab.com/snowplow/)
|
|
- [Product Intelligence Guide](https://about.gitlab.com/handbook/product/product-intelligence-guide/)
|
|
- [Service Ping Guide](../service_ping/index.md)
|
|
- [Product Intelligence Direction](https://about.gitlab.com/direction/product-intelligence/)
|
|
- [Data Analysis Process](https://about.gitlab.com/handbook/business-technology/data-team/#data-analysis-process/)
|
|
- [Data for Product Managers](https://about.gitlab.com/handbook/business-technology/data-team/programs/data-for-product-managers/)
|
|
- [Data Infrastructure](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/)
|