Big Data

Big data

Quick definition: The term “big data” refers to large data sets, usually measured in terabytes or petabytes, that are analyzed to provide business insights.

Key takeaways:

The following questions were addressed during an interview with Anshuman Nangia, product manager at Adobe.

What is big data?
What are the three Vs of big data?
Is big data open source?
What is the big data life cycle?
What technologies are required for big data?
Why is big data important?
How long should companies store data?
What problems do companies face when working with big data?
How will the use of big data continue to evolve?

What is big data?

The definition of big data is fluid, but it generally refers to a data set that is too big to be housed on a single machine. If your team is working with data within the confines of a single box, either physical or virtual, then that's not big data.

The conversation shifts toward big data when talking about dozens of machines acting in concert. They hold on to the data and process it locally within those machines, then converge the output into something that can lead to better decision making.

Big data can include structured data, unstructured data, and semi-structured data, although fully structured data is rare when dealing with big data.

What is the difference between structured, unstructured, and semi-structured data?

Generally, there are three common types of data:

Another important aspect of big data is the technology developed in the last 10-15 years used to process and analyze large data sets. When a flood of information started coming in, companies needed to create tools to ensure successful data storage and to find value in the data.

Many organizations in the IT space, especially in those in the San Francisco Bay area, have focused on creating frameworks primarily to deal with big data. These frameworks, like Hadoop, were created to deal with scenarios where there is so much data it can’t possibly be processed by a small number of machines.

What are the three Vs of big data?

The three V s of big data science are variety, velocity, and volume. Data comes in a variety of different formats, including images, videos, emails, text messages, social media posts, and sequel tables.

Variety means the various composition of data sets. Structured, unstructured, and semi-structured data are examples of variety within data.

Velocity describes how quickly data becomes available to the organization collecting it. Adobe, for example, collects over 250 trillion transactions a year, which comes out to around 475 million transactions a minute.

Volume refers to the pure amount of data collected. If YouTube subscribers upload 380,000 hours of data an hour, that is a high volume of data. If an organization is dealing with 380,000 emails an hour, the volume of the data is significantly less, but the velocity is still high.

What is the big data life cycle?

The data life cycle starts with information collection from data sources and ends with pulling insights from the collected data. The first step, data collection, involves creating an infrastructure that is responsible for collecting all the data points coming in. The infrastructure will depend on the type of data, but the raw data always persists somewhere so that further analysis can happen as needed.

The next step after data collection is determining where the data should persist and how to catalog it so other systems know it exists. Data is only as useful as the metadata that describes it. If an organization has large volumes of data, but no way of discovering that data or informing someone what that data is about, the data has no benefit.

After data is stored and managed, it can then be analyzed for insights and patterns. The insights derived from big  data analytics  can then be visualized to inform stakeholders of the findings and make recommendations for the organization’s next steps.

What technologies are required for big data?

Specific technologies have been purpose-built to deal with big data — analytics  is one example. The whole world of analytics is essentially governed by big data. And dealing with large volumes of data at high velocity with some fairly tight time tolerances requires purpose-built machinery. You can't pull together disparate machines using consumer software and expect to adequately work with big data.

This is where the technology landscape of big data processing comes into the picture. The available technology includes analytics engines like Apache Spark or Databricks, which make it easier to manage large amounts of stored data, as well as big data technologies built around messaging, like Kafka, which specializes in processing streaming data that is continuously generated. An organization may also choose to build and manage their own custom framework.

Why is big data important?

Essentially, the more data an organization has, the better decisions they can make, and big data processing expands on that concept. Companies want to understand how their customers are interacting with their brand, and for organizations with enormous global audiences, that requires large volumes of data.

One increasingly important use of big data is to better understand and meet customer needs. To provide a premier customer experience and to continue to evolve to meet the needs of the customers, organizations need to understand where their customers come from, what they do on the website, how much time they spend on the website, and how often they complete a transaction or convert.

Behavioral data is collected from customer behavior on a website and other channels such as mobile, email, etc. Transactional and personal information may also be collected. Understanding this data can give you important insights about how to improve sales velocity and how to optimize different digital interactions.

Many decisions around optimization boil down to the amount of data available and the insights that can be pulled from that data.

Is big data open source?

Some data sets are open source, but these data sets have been made open source as the result of an explicit decision by a government agency or a private company. Data sets are made publicly available if it’s believed they will provide some good for humanity.

How long should companies store data?

Companies generally have contractual obligations that specify how long they can hold on to data. These contractual obligations will vary quite a bit and are governed by regulations in different geographic locations and across various vertical industries.

Some data a company doesn’t ever want to lose — a banking website will never want to lose transaction data, for example. But as a digital marketer, there might be certain regulations that say you can’t hold onto customer data for longer than 36 months and at that point, you must delete it.

Under new privacy regulations in certain areas, customers may also ask for their data to be purged. When these requests come in, companies have to make sure that they scrub their system of any data that might be related to that customer so that they can stay in compliance with the privacy regulations.

In addition to contractual or privacy obligations, data can be deleted after it becomes irrelevant. Most data becomes less useful over time. At some point, most companies will delete outdated data to save money, or they will extract the signals they care about most from the data, hold those signals in a less data-intensive format, and delete the original information.

What problems do companies face when working with big data?

Collecting data is not enough. Organizations must be able to access, analyze, and shape the data. Unstructured and semi-structured data is often difficult to work with. Without proper management, the data can eat into costs without really providing any value. The right set of technologies can help companies make sense of their data and can help confirm or refute initial instincts about taking a course of action.

The ability to use big data has also become more democratized in recent years, and more individuals are able to work with big data, even if they aren't data scientists. While this has many advantages, there is also more scope for mistakes to be made. The number of people getting involved with the data and the speed at which it can be processed can lead to decisions based on incomplete information and result in suboptimal outcomes.

Better big data management comes with maturity. If an organization is starting to explore data for the first time, they may want to slow down and make sure they are asking the right questions. There can also be biases or anomalies in the data, which may not be apparent when first using big data.

Companies also have to be careful about how they use the data they collect. For example, they may collect personally identifiable information (PII) like credit card numbers or email addresses, but they may not want or be allowed to use that information for certain marketing actions or make it available via unsecure locations.

Having a proper framework for data governance will help prevent mistakes of improper data access and use, and maintain compliance with regulations, by ensuring that data is properly labeled for its intended use.

How will the use of big data continue to evolve?

One trend for the use of big data is the increased speed at which insights can be available, decisions made, and action taken.

Companies need to react to customer behavior in real time. Years ago, organizations may have been able to sit on the data they collected for 24 or 48 hours, but organizations must now respond in an instant, which requires the technology to be able to run queries against large amounts of data as it becomes available.

There is a significant shift happening in the world of big data as it moves from a batch-oriented way of thinking about things to something that takes place in real time. Machine learning and artificial intelligence (AI) are instrumental in increasing the speed of big data analysis.

People also view