Cliché: Without data, there can be no data science.
But it is true.
While learning data science, we mostly use public data sets or scrape data off the web. But in ML-assisted products, most of the data is generated and collected through business applications.
The first step in any data pipeline is instrumenting your application to:
-
Capture needed data when an interesting event happens in the application
-
Ingest the captured data into your data storage (typically an event queue like Kafka)
This sequence of data is commonly known as event-stream or click-stream. The data quality depends on the accuracy and completeness of the data you capture and ingest.
There are alternatives for capturing and ingesting a click-stream.