

It provides the flexibility to edit the existing schema created by the crawler.Ĥ. To control the data scanned, we started to create the data catalog based on request for the events by the end user.ĭata catalog stores the metadata and the schema details of S3 files. Once files are structured in S3, we need to schedule a crawler that can infer the schema on a periodic basis and writes/updates the metadata to the data catalog.Ĭrawler infers the schema of the S3 data and writes the metadata to the data catalog, everytime new files are dropped in the corresponding S3 bucket. AWS Glue AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. Once source S3 key is identified, final step was to load the file to destination bucket.ģ.We provided a check and logged info to deal with DAG file load failures.Once we get an S3 object, next step would be to pick the particular S3 file (organized by event_name for DAG run-day) and organize in the destination bucket.We fetched the S3 object for the particular event file.Since we need to deal with multiple projects in Clevertap comprising 100’s of events, we wrote a custom Airflow plugin that can organise the files based on event name when passed as a parameter in the Airflow custom operator.īucket_name/clevertap/project_name/event_name/year/month/yyyy-mm-dd.parquet So we decided to write an Airflow DAG that could copy these files and organise in a manner that can easily be crawled and create a data catalog in AWS Glue. It is not possible to crawl the data using AWS Glue if different schemas of files are present in the same S3 folder. Airflow DAG Airflow is a platform to programmatically author, schedule and monitor workflows.Īll events files were exported to an S3 folder for connecting to the Clevertap dashboard. The detailed documentation can be available here - Clevertap Data Export to S3. CleverTap CleverTap enables you to integrate app analytics and marketing to increase user engagement.Ĭlevertap provides a feature to export the data directly to S3. Query data using either Athena/Redshift Spectrum.Created AWS Glue crawler to infer the schema that creates/updates the data catalog.Airflow dag deployed to organise the files in the destination bucket.Scheduled data export from the Clevertap dashboard to S3.

The above architecture depicts the end-to-end pipeline for exporting Clevertap data to S3 and making it queryable via Athena/Redshift Spectrum.

The reason for choosing parquet format was: We decided to have parquet file format with some partitioning based on date (i.e year or month) for an efficient read while accessing the data. We did a PoC on what data formats would be best suited for efficient processing and cost-saving in terms of storage. How did we start to build a scalable end-to-end pipeline Having Redshift Spectrum in place removed the need to have to load the data in Redshift while also be able to join the S3 data with other transactional data relying on Redshift. So, instead, we planed to build a data lake where Clevertap events data can be stored in S3 and we leverage the features of Redshift Spectrum or Amazon Athena to query this data. Loading this data directly to our Redshift cluster was very expensive as it needed the significant scaling of our existing Redshift cluster to accommodate it. Currently, all our transactional data relies on Redshift consolidating data from different micro services (TeleConsultation, Pharmacy Delivery, Appointments, Insurance, and so on).Ĭlevertap captures and stores the clickstream data, producing huge volumes of data on a daily basis. To get insights on the user's journey in its entirety, we need to merge Clevertap events data with our backend transactional data. The next step, however, was to gain a better idea of the behavioural aspects of their sessions.Īt Halodoc, we use Clevertap to retain user engagement in our platform. We started with understanding the user's buying behaviour and then bucketing them into cohorts. As the number of services on our platform grows, mapping a user's journey across the platform has become one of our most critical initiatives in the Data Engineering space. Data has always been a pivotal aspect of Halodoc's decision making.
