This talk was born from some of our greatest victories won and worst losses suffered while designing and implementing data lakes, with a focus on real-time processing and machine learning pipeline integration. We will go through the various design problems spawned from the specific integrations and solutions we have used—from caching to avert the Slowly Changing Dimension problem through operational and analytical cluster separation to the fully-fledged MLOps process. We will showcase, using real examples, how those use cases are reflected in the data lake architecture, both when building from scratch and evolving an existing solution.
For the data architect, this session will provide a greater understanding of available design patterns. To a data scientist, it will provide a better understanding of the soon-to-be working environment.