A: Data lake architecture is the system imposed on a data lake to organize and structure the data.
The first component you need for a data lake is a place to store all your data, whether it's relational data coming from a line of business or your nonrelational data coming from mobile apps, IOT devices, or social media. However, not all data repositories are built the same. One best practice is to not use the cheapest data storage option available. A strong option will be durable, scale up to a petabyte storage capability, be secure from an encryption storage perspective, be fault tolerant, and have redundancy to protect the data from being lost. For example, when customers use a data repository like Dropbox or iCloud, there’s a sense of trust that the service won’t lose the data or allow it to be compromised. Any good data lake should offer the same level of trust.
A catalog, or a way to organize and find the data, is another important feature. If you keep adding data to a repository that lacks a usable architecture, your data lake turns into a data swamp. A catalog can prevent a data lake from becoming a disorganized collection of information. It allows you to quickly discover the contents of a data lake, and figure out what information the data provides, where the data came from, when it was last refreshed, and any other necessary metrics. It also creates a system of governance controls, specifying who can use the data and for what purpose.
Cataloging can be done either manually or with machine learning. Some companies write their own program or service to catalogue the data. Other solutions constantly learn about the data to catalogue it and provide better insights.
Having a governance framework is critical. Many times, a marketer, for example, will bring in data from a third party for a specific purpose. A data scientist might then try to use that data for a somewhat different purpose. If they just look at the data, they don't know whether they can use it or not. But if you have a governance framework, you can quickly figure out how the data may or may not be used.
Adobe, for example, uses the Data Usage Labeling and Enforcement (DULE) framework for data governance in Experience Platform. The DULE framework simplifies the process of organizing and categorizing data, and it provides information about how the data may be used and what restrictions may exist.
It’s also critical to have a data access layer on top of the lake to ensure users can easily consume the data. One solution is to have a single application programming interface (API) or software development kit (SDK) that people can use, which makes it simple for anyone to access the data at any time. Having an SQL link on top of the API or SDK allows a user to also query and analyze the data.