An example of Robust AI
The Platform Event Tracking Pipeline at GetYourGuide
Data engineering deployed.
By Bora Kaplan, Data Engineer, and Thiago Rigo, Engineering Manager, GetYourGuide
GetYourGuide is a startup unicorn for travel experiences, with over 45 million tickets sold for tours and activities in 150+ countries. Users book and enjoy incredible experiences. We have to know what incredible experiences mean for our users for that to happen. We learn about our users with a dedicated pipeline and services for tracking events of user behavior on the platform.
Dr. Fateme Kamali, Data Scientist and AI Guild member
“Thiago and Bora demonstrate the evolution of the event tracking pipeline by providing snapshots of the architecture in 2016, 2018, and 2021. Tracking user events is essential to enable data analytics and machine learning.”
Let’s outline users’ events tracking: Knowing the actions users take on web and app versions of the GetYourGuide platform: from humble beginnings with schemaless JSON, going through strongly typed tracking with Thrift, and back to JSON with OpenAPI and AsyncAPI. Read each approach’s pain points and benefits and see the tech stack and architecture. Finally,read what GetYourGuide is planning next.
2016: Logs parsing and Ping API
Starting with a straightforward approach, our infrastructure team set up streaming to our data lake for all the webserver logs. The server logs were in the format of TSV files, and by parsing them, we already had enough data to capture user behavior on the website.
We used an API called Ping for the mobile app that received events data and wrote it as JSON to the data lake. For example, what we call an event: If you go and open the app, it triggers an event called “app open.” We do not maintain a lot of infrastructures, and it allows the capture of richer data than just the database data.
However, we did have some shortcomings: Server logs data isn’t rich enough, and it can change. Also, the app data was not of high quality. The Ping service simply received the event and wrote it directly to the data lake without further checking or verification. Hence, the data wouldn’t always conform to a schema. Lastly, the Ping service was not maintained anymore, which prevented us from extending the service to add extra data, like geolocation.
Picture: JSON & Access logs architecture
2018: Focus on events and Thrift
Leaving behind the Ping API, we built a v2 architecture focused on the events only. We introduced three components:
- Thrift Schemas: Apache Thrift is a robust open-source software framework that defines data types and service interfaces in a simple definition file.
- Collector: An API we created, replacing Ping. It receives events data from the app and validates the data against the Thrift definition, checking the data types. Only the correct data flows downstream.
- Analytics Quickline: Reads the Thrift binary data from Kafka and writes it out as parquet. Parquet is a more efficient data lake format than TSV or JSON.
This new pipeline benefits from the strong schema definition, which ensures the data quality by the strict type definition. And since we own the API, we can do basic data enrichment, for example, with geolocation, which is essential for us. It is a more stable source as opposed to logs.
Some shortcomings: There was an explosion of event types. We went from 30 to 250 types, so our team became a bottleneck controlling the schema definitions. We observed low ownership with the producer teams regarding the quality of their data since our engineering team was usually the first team to be contacted when something went wrong. Moreover, how we used Thrift, all the properties were marked as optional. Additional properties were often not sent by the payload, and we did not have a good way to monitor this.
Bora Kaplan, Data Engineer, GetYourGuide
“To be able to provide people incredible experiences we have to learn about them. And to do that, we cannot just rely on our services’ databases because it’s not going to track every single action a user takes. That is why it is paramount to build specific pipelines and services for the purpose of users’ events tracking.”
2021: Real-time enrichment and decentralized schemas
We took all we learned throughout the years and developed a new approach using OpenAPI & AsyncAPI, and this allowed each team to own their schema and allowed us to do real-time enrichment. We do that by keeping the Collector and Quickline from the previous architecture and adding two new components:
- Schemapi Registry: Contains all the OpenAPI & AsyncAPI definitions
- Enrichment Pipeline: Multiple streaming applications
With this approach, we have better data quality and more metadata available. UsingOpenAPI and AsyncAPI from the get-go, we already have richer property validation: synthetic and semantic validation. Through the Analytics Enrichment Pipeline, we provide real-time enrichment of the data. And we are using JSON instead of Thrift because it is a company standard, besides being human-readable.
The result is that there is better monitoring because everything is decentralized. Teams are more independent: Each can create their schema, publish it, and have alerts, which leads to better monitoring and control.
The shortcomings are different: This approach is more complex since we added more components and more code to maintain. Each team needs to publish and validate their schemas, and we need to introduce new tools around that.
Analytics and ML use cases
These events fuel many different use cases at GetYourGuide. One relevant metric that can be calculated with user events is the click-through rate of each tour. That allows understanding of what tours perform well for each different profile of users, and this supports the recommendation algorithm.
Another big internal use case is marketing attribution. The attribution tracking event is triggered every time a user arrives at GetYourGuide. It contains essential marketing campaign information such as the referral channel.
And based on the events, we have built an experimentation platform. It is an in-house tool for A/B experiments to understand better which variations are working.
Engineering is an ongoing project; thus, we want to ensure a smooth migration phase and provide documentation for producers and consumers.
We will reduce the complexity and grey areas of some tools, e.g., OpenAPI and AsyncAPI are great tools, but they can be a black box sometimes.
Provide tooling for data discoverability because we collect many events for people to use and create value. So we want to make sure they know these data exist and know how to use them correctly.
And after the migration phase is complete, we need to ensure data quality and automate anomaly detection.
This data engineering use case in production is part of the AI Guild #datalift series advancing AI adoption for startups and corporates. The AI Guild works with companies and practitioners on best practices for deploying data analytics and machine learning.