debian-mirror-gitlab/doc/development/snowplow/index.md
2022-08-13 15:12:31 +05:30

199 lines
9.1 KiB
Markdown

---
stage: Analytics
group: Product Intelligence
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Snowplow
Snowplow is an enterprise-grade marketing and Product Intelligence platform that tracks how users engage with our website and application.
[Snowplow](https://snowplowanalytics.com) consists of several loosely-coupled sub-systems:
- **Trackers** fire Snowplow events. Snowplow has twelve trackers that cover web, mobile, desktop, server, and IoT.
- **Collectors** receive Snowplow events from trackers. We use different event collectors that synchronize events to Amazon S3, Apache Kafka, or Amazon Kinesis.
- **Enrich** cleans raw Snowplow events, enriches them, and puts them into storage. There is a Hadoop-based enrichment process, and a Kinesis-based or Kafka-based process.
- **Storage** stores Snowplow events. We store the Snowplow events in a flat file structure on S3, and in the Redshift and PostgreSQL databases.
- **Data modeling** joins event-level data with other data sets, aggregates them into smaller data sets, and applies business logic. This produces a clean set of tables for data analysis. We use data models for Redshift and Looker.
- **Analytics** are performed on Snowplow events or on aggregate tables.
![snowplow_flow](../img/snowplow_flow.png)
## Enable Snowplow tracking
Tracking can be enabled at:
- The instance level, which enables tracking on both the frontend and backend layers.
- The user level. User tracking can be disabled on a per user basis.
GitLab respects the [Do Not Track](https://www.eff.org/issues/do-not-track) standard, so any user who has enabled the Do Not Track option in their browser is not tracked at a user level.
Snowplow tracking is enabled on GitLab.com, and we use it for most of our tracking strategy.
To enable Snowplow tracking on a self-managed instance:
1. On the top bar, select **Menu > Admin**, then select **Settings > General**.
Alternatively, go to `admin/application_settings/general` in your browser.
1. Expand **Snowplow**.
1. Select **Enable Snowplow tracking** and enter your Snowplow configuration information. For example:
| Name | Value |
|--------------------|-------------------------------|
| Collector hostname | `your-snowplow-collector.net` |
| App ID | `gitlab` |
| Cookie domain | `.your-gitlab-instance.com` |
1. Select **Save changes**.
## Snowplow request flow
The following example shows a basic request/response flow between the following components:
- Snowplow JS / Ruby Trackers on GitLab.com
- [GitLab.com Snowplow Collector](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/snowplow/index.md)
- The GitLab S3 Bucket
- The GitLab Snowflake Data Warehouse
- Sisense:
```mermaid
sequenceDiagram
participant Snowplow JS (Frontend)
participant Snowplow Ruby (Backend)
participant GitLab.com Snowplow Collector
participant S3 Bucket
participant Snowflake DW
participant Sisense Dashboards
Snowplow JS (Frontend) ->> GitLab.com Snowplow Collector: FE Tracking event
Snowplow Ruby (Backend) ->> GitLab.com Snowplow Collector: BE Tracking event
loop Process using Kinesis Stream
GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Log raw events
GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Enrich events
GitLab.com Snowplow Collector ->> GitLab.com Snowplow Collector: Write to disk
end
GitLab.com Snowplow Collector ->> S3 Bucket: Kinesis Firehose
Note over GitLab.com Snowplow Collector, S3 Bucket: Pseudonymization
S3 Bucket->>Snowflake DW: Import data
Snowflake DW->>Snowflake DW: Transform data using dbt
Snowflake DW->>Sisense Dashboards: Data available for querying
```
For more details about the architecture, see [Snowplow infrastructure](infrastructure.md).
## Structured event taxonomy
Click events must be consistent. If each feature captures events differently, it can be difficult
to perform analysis.
Each click event provides attributes that describe the event.
| Attribute | Type | Required | Description |
| --------- | ------- | -------- | ----------- |
| category | text | true | The page or backend section of the application. Unless infeasible, use the Rails page attribute by default in the frontend, and namespace + class name on the backend. |
| action | text | true | The action the user takes, or aspect that's being instrumented. The first word must describe the action or aspect. For example, clicks must be `click`, activations must be `activate`, creations must be `create`. Use underscores to describe what was acted on. For example, activating a form field is `activate_form_input`, an interface action like clicking on a dropdown is `click_dropdown`, a behavior like creating a project record from the backend is `create_project`. |
| label | text | false | The specific element or object to act on. This can be one of the following: the label of the element, for example, a tab labeled 'Create from template' for `create_from_template`; a unique identifier if no text is available, for example, `groups_dropdown_close` for closing the Groups dropdown in the top bar; or the name or title attribute of a record being created. |
| property | text | false | Any additional property of the element, or object being acted on. |
| value | decimal | false | Describes a numeric value (decimal) directly related to the event. This could be the value of an input. For example, `10` when clicking `internal` visibility. |
### Examples
| Category* | Label | Action | Property** | Value |
|-------------|------------------|-----------------------|----------|:-----:|
| `[root:index]` | `main_navigation` | `click_navigation_link` | `[link_label]` | - |
| `[groups:boards:show]` | `toggle_swimlanes` | `click_toggle_button` | - | `[is_active]` |
| `[projects:registry:index]` | `registry_delete` | `click_button` | - | - |
| `[projects:registry:index]` | `registry_delete` | `confirm_deletion` | - | - |
| `[projects:blob:show]` | `congratulate_first_pipeline` | `click_button` | `[human_access]` | - |
| `[projects:clusters:new]` | `chart_options` | `generate_link` | `[chart_link]` | - |
| `[projects:clusters:new]` | `chart_options` | `click_add_label_button` | `[label_id]` | - |
_* If you choose to omit the category you can use the default._<br>
_** Use property for variable strings._
### Reference SQL
#### Last 20 `reply_comment_button` events
```sql
SELECT
session_id,
event_id,
event_label,
event_action,
event_property,
event_value,
event_category,
contexts
FROM legacy.snowplow_structured_events_all
WHERE
event_label = 'reply_comment_button'
AND event_action = 'click_button'
-- AND event_category = 'projects:issues:show'
-- AND event_value = 1
ORDER BY collector_tstamp DESC
LIMIT 20
```
#### Last 100 page view events
```sql
SELECT
-- page_url,
-- page_title,
-- referer_url,
-- marketing_medium,
-- marketing_source,
-- marketing_campaign,
-- browser_window_width,
-- device_is_mobile
*
FROM legacy.snowplow_page_views_30
ORDER BY page_view_start DESC
LIMIT 100
```
#### Top 20 users who fired `reply_comment_button` in the last 30 days
```sql
SELECT
count(*) as hits,
se_action,
se_category,
gsc_pseudonymized_user_id
FROM legacy.snowplow_gitlab_events_30
WHERE
se_label = 'reply_comment_button'
AND gsc_pseudonymized_user_id IS NOT NULL
GROUP BY gsc_pseudonymized_user_id, se_category, se_action
ORDER BY count(*) DESC
LIMIT 20
```
#### Query JSON formatted data
```sql
SELECT
derived_tstamp,
contexts:data[0]:data:extra:old_format as CURRENT_FORMAT,
contexts:data[0]:data:extra:value as UPDATED_FORMAT
FROM legacy.snowplow_structured_events_all
WHERE event_action in ('wiki_format_updated')
ORDER BY derived_tstamp DESC
LIMIT 100
```
### Web-specific parameters
Snowplow JavaScript adds [web-specific parameters](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/snowplow-tracker-protocol/#Web-specific_parameters) to all web events by default.
## Related topics
- [Snowplow data structure](https://docs.snowplowanalytics.com/docs/understanding-your-pipeline/canonical-event/)
- [Our Iglu schema registry](https://gitlab.com/gitlab-org/iglu)
- [List of events used in our codebase (Event Dictionary)](https://metrics.gitlab.com/snowplow/)
- [Product Intelligence Guide](https://about.gitlab.com/handbook/product/product-intelligence-guide/)
- [Service Ping Guide](../service_ping/index.md)
- [Product Intelligence Direction](https://about.gitlab.com/direction/product-intelligence/)
- [Data Analysis Process](https://about.gitlab.com/handbook/business-technology/data-team/#data-analysis-process/)
- [Data for Product Managers](https://about.gitlab.com/handbook/business-technology/data-team/programs/data-for-product-managers/)
- [Data Infrastructure](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/)