Data Lake

Data lake

Quick definition: A data lake is a central repository for all of the raw customer data a company collects, as well as any relevant third-party data.

Key takeaways:

The following information was provided during an interview with Anurag Dodeja, group product manager for Adobe Experience Platform.

What is a data lake?
What is the purpose of a data lake?
Why are data lakes important?
How did data lakes originate?
What is data lake architecture?
How do companies build data lake architecture?
What best practices can help companies more successfully use a data lake?
Does every company need to use a data lake?
What are the limitations of data lakes?
How will data lakes continue to evolve over time?

What is data lake?

A data lake is a centralized repository where you can store all kinds of data, both structured and unstructured. You can use a data lake to store data as-is to begin processing analytics, visualize patterns and sequences, and use machine learning to make better decisions.

What is the purpose of a data lake?

A data lake is like a storage repository, where you have all your structured, semi-structured, and unstructured data across an enterprise in a single place.

Once the data is in the lake, a company can perform a variety of different functions, including machine learning, analytics, and activation. And because of the rise of big data, a data lake provides a solution for storing and managing massive amounts of data.

Why are data lakes important?

Data is kind of the new oil. It’s an essential and valuable resource. Most organizations are generating more and more business decisions, and they use insights pulled from data to make those decisions.

Companies expect that if they make data-driven decisions, they will outperform their peers. The movement to make choices based on data has fueled a need to not only have more data, but to bring data together from different silos into one central location.

With the data in a single location that everyone in the organization can access, all decision makers have the information they need to move forward, and there aren’t lapses in communication.

How did data lakes originate?

Traditionally, data has been stored in databases and data warehouses, but over the years, the amount of data companies work with has increased.

The data that companies collect and work with increases each year, and many IT departments and practitioners need a place where they can store all that data and use it to gain better knowledge and insights.

That's where we have gone – from databases and data warehouses – which were optimized for analysis, to a data lake, which provides more of a general storage for all your structured, semi-structured, and unstructured data.

What is data lake architecture?

Data lake architecture is the system imposed on a data lake to organize and structure the data.

The first component you need for a data lake is a place to store all your data, whether its relational data coming from a line of business or your nonrelational data coming from mobile apps, IOT (Internet of Things) devices, or social media.

However, not all data repositories are built the same. One best practice is to not use the cheapest data storage option available. A strong option will be durable, scale up to a petabyte storage capability, be secure from an encryption storage perspective, be fault-tolerant, and have redundancy to protect the data from being lost.

For example, when customers use a data repository like Dropbox or iCloud, there is a sense of trust that the service won’t lose the data or allow it to be compromised. Any good data lake should offer the same level of trust.

A catalog, or a way to organize and find the data, is another important feature. If you keep adding data to a repository that lacks a usable architecture, your data lake turns into a data swamp. A catalog can prevent a data lake from becoming a disorganized collection of information.

It allows you to quickly discover the contents of a data lake, and figure out what information the data provides, where the data came from when it was last refreshed, and any other necessary metrics. It also creates a system of governance controls, specifying who can use the data and for what purpose.

Cataloging can be done either manually or with machine learning. Some companies write their own program or service to catalog the data. Other solutions constantly learn about the data to catalog it and provide better insights.

Having a governance framework is critical. Many times, a marketer, for example, will bring in data from a third party for a specific purpose. A data scientist might then try to use that data for a different purpose.

If they just look at the data, they don't know whether they can use it or not. But if you have a governance framework, you can quickly figure out how the data may or may not be used.

Adobe, for example, uses the Data Usage Labeling and Enforcement (DULE) framework for data governance in Adobe Experience Platform.

The DULE framework simplifies the process of organizing and categorizing data, and it provides information about how the data may be used and what restrictions may exist.

It’s also critical to have a data access layer on top of the lake to ensure users can easily consume the data.

One solution is to have a single application programming interface (API) or software development kit (SDK) that people can use, which makes it simple for anyone to access the data at any time. Having an SQL link on top of the API or SDK allows a user to also query and analyze the data.

How do companies build data lake architecture?

Some solutions, like Adobe Experience Platform, provide a storage repository with architecture already built in. Other customers might prefer point solutions that they put together themselves.

For example, Apache provides open-source solutions for cataloging, and other commercially available products provide governance solutions.

Many companies get a data lake from a source like Amazon, Google Cloud, or Azure Storage and then manually bring in separate solutions for the cataloguing, governance, and data access components.

While it is a viable option, it usually requires a lot of IT investment to make all those pieces fit together and work.

What best practices can help companies more successfully use a data lake?

First, make sure that your company aligns on what the data means and how it will be used. The purpose of a data lake is to bring in data from different sources and make it easily usable by different parts of the organization.

If there's not a single philosophy, or if different teams across the company don't speak the same language, it's difficult for different people accessing the data to communicate with one another. Accessing the same data lake doesn't automatically ensure all members of the company speak the same language.

Modeling the language and producing a single definition is a critical part of data lake management that a company needs to spend time on. They need to ask, “How do we structure it in a way that anybody can make sense of it?”

Companies should also pay attention to the controls that need to be put into place with the data. There should be a consensus within the company on who can use the data in a lake and for what purposes, in addition to the built-in contractual restrictions that come with the data.

Most data have limits on how long you can keep it before it must be deleted. The data lake should account for that, and when a company adds data to the lake, they need to make sure all the contractual information is brought over as well.

And, rather than entering that information manually, the best data lakes will have a software layer that automatically manages the contracts.

Lastly, companies need to be proactive about making sure the data lake doesn’t become a swamp.

When you don’t have good lineage to understand where the data came from, who plugged it in, or when it came in, and people make copies of the data and manipulate it, the data can easily get lost and become impossible to use.

You need to make sure enough metadata is in place and the right controls are established so the data can continue to be discovered and used as necessary.

Does every company need to use a data lake?

When deciding whether they need to invest in a data lake, an organization needs to consider their business objectives and goals. A very small business owner likely won’t need a data lake to manage inventory.

However, if that small business is now impacted by COVID-19, and suddenly needs to manage selling goods via multiple platforms like Etsy and Facebook, and they run marketing campaigns on Facebook, Twitter, and Google, they will need a place to bring all the necessary data together.

By using a data lake, the business owner can figure out how and where to invest their money and who their customers are. The only way to do any advanced analytics on all of that data coming from different places in different formats is to collect it all in a single place.

Generally, if a company doesn’t need to use a data lake, it doesn’t make sense to invest in one. However, data lakes usually aren’t cost prohibitive, especially if you work with a provider that offers the data lake as a complete package.

If you choose to build your own data lake architecture from multiple different solutions, the costs can start to stack up, especially if you need to invest in additional IT talent to put everything together.

What are the limitations of data lakes?

Data lakes are purpose-built for analytical workloads. It has to be able to run machine learning on the data, learn from the data, and derive insights from the data.

However, a data lake is not meant to be your transaction system. It's not meant to be your system where you store bank transactions. It’s also not a general-purpose storage for any kind of data, especially something that's a mission-critical operating system.

How will data lakes continue to evolve over time?

In the future, more vendors will offer a complete data lake solution, including storage, cataloging, governance, and data access systems, instead of selling each piece individually.

Data lakes may also change to allow the option of being a transaction system.

The number of digital transactions is increasing, along with the amount of data, so data lakes may evolve to become more operational over time.

People also view