Setting up Data Collection

Setting up Data Collection

Newsletter Issue 5: Pick a data collection setup that suits your organization’s capabilities and budget.

Cliché: Without data, there can be no data science.

But it is true.

While learning data science, we mostly use public data sets or scrape data off the web. But in ML-assisted products, most of the data is generated and collected through business applications.

The first step in any data pipeline is instrumenting your application to:

  • Capture needed data when an interesting event happens in the application

  • Ingest the captured data into your data storage (typically an event queue like Kafka)

This sequence of data is commonly known as event-stream or click-stream. The data quality depends on the accuracy and completeness of the data you capture and ingest.

There are alternatives for capturing and ingesting a click-stream.

Do It Yourself (DIY)

Write a small library in the language of your application that captures the event and sends it to a microservice endpoint or a cloud function for further processing and storage (such as AWS, Google Cloud, Azure, Snowflake, Databricks).

This is the most flexible alternative that you can optimize to the need of your application and data requirements.

It also takes the most development effort. You need to write code to process the data and store it in a data lake or data warehouse. If you use any 3rd-party analytics/ML application, most of these can consume data from a popular lake or warehouse.

Fully Outsource It

If you are doing analytics or business intelligence, you may use a tool like Google Analytics, MixPanel, Amplitude, or Heap.

This is the quickest and easiest approach to get started. These tools provide an SDK with simple APIs to dump the data. These tools can also compute and show common analytics charts.

This approach is also the least flexible. I recommend it for analytics, but not for collecting data for data science or machine learning.

You should carefully examine the cost structure for data quantity slabs for deciding whether it is optimal for your data load.

The Middle Path

There are a number of tools that provide a library to send an event/data, and also a rich list of connectors to filter, lightly process, and route that data to multiple destinations (e.g. data lakes, warehouses, and popular 3rd party tools).

What is the best solution for you?

The convenience and rich connectors offered by tools like Fivetran, RudderStack, etc. are valuable. But it depends on how diverse your needs are and how deep your pockets are.

Source: Fivetran

Source: RudderStack

I recommend Do It Yourself if:

  • you have a high volume of events/data (convenience will most likely be expensive), or

  • your data processing is limited to a single cloud provider.

Only if you are collecting a moderate amount of data with a typical schema, and mostly doing analytics, you may consider Fully Outsource It.

For the rest of the use cases, tradeoffs will depend upon in-house data engineering expertise and the diversity of data sources and processors. I suggest checking out Snowplow and RudderStack GitHub repositories.

ML4Devs Newsletter - Issue 05, published on 15 Mar 2022.