The term “big data” refers to large data sets, usually measured in terabytes or petabytes, that are analyzed to provide business insights.
Big data is defined by its variety (the different types or formats of data), velocity (the speed at which data becomes available), and volume (the amount of data collected).
Specific technologies have emerged to support the collection and analysis of big data sets, like Hadoop and Apache Spark, as traditional data processing software solutions are incapable of handling massive amounts of data.
After collecting data, it must be appropriately stored and categorized so organizations can derive insight from the information. If a company can’t understand the data or find the information they need, the data has no value.
Big data is shifting to become more available in real time, as companies can no longer afford to collect data without immediately gaining insights for timely and relevant action.
Anshuman Nangia is a product manager with 8+ years of engineering and product experience across the enterprise and consumer startup space. At Adobe, Anshuman drives the strategy and execution of streaming services for the digital experiences business to deliver billions of digital experiences in as close to real-time as possible.
Q: What is big data?
A: The definition of big data is fluid, but it generally refers to a data set that is too big to be housed on a single machine. If your team is working with data within the confines of a single box, either physical or virtual, then that's not big data.
The conversation shifts toward big data when talking about dozens of machines acting in concert. They hold on to the data and process it locally within those machines, then try to converge the output into something that can lead to better decision making.
Big data can include structured data, unstructured data, and semi-structured data, although fully structured data is rare when dealing with big data.
Structured data refers to data displayed in a well-defined table.
Unstructured data, which includes data points like logins, website clicks, page views, or video views, is data that is not organized in a pre-defined model. Formats like video, log files, and emails are generally unstructured data. They can be arbitrarily long or short or have words that can’t be predicted beforehand.
Semi-structured data includes data that contains a mix of structured and unstructured information. The logs generated by a machine are a good example of semi-structured data. Every time a person goes to a website, in addition to generating data based on content viewed in the browser, the server is generating data in the background. It is keeping track of when a request came, which IP address the request came from, which browser was used, etc. Some of this information is logged in a structured way, but within that, there is information which is unstructured, like the request parameters and the response payload.
Another important aspect of big data is the technology developed in the last 10-15 years used to process and analyze large data sets. When a flood of information started coming in, companies needed to create tools to ensure successful data storage and to find value in the data. Many organizations in the IT space, especially in those in the San Francisco Bay area, have focused on creating frameworks primarily to deal with big data. These frameworks, like Hadoop, were created to deal with scenarios where there is so much data it can’t possibly be processed by a small number of machines.
Q: Why is big data important?
A: Essentially, the more data an organization has, the better decisions they can make, and big data processing expands on that concept. Companies want to understand how their customers are interacting with their brand, and for organizations with enormous global audiences, that requires large volumes of data.
One increasingly important use of big data is to better understand and meet customer needs. To provide a premier customer experience and to continue to evolve to meet the needs of the customers, organizations need to understand where their customers come from, what they do on the website, how much time they spend on the website, and how often they complete a transaction or convert.
Behavioral data is collected from customer behavior on a website and other channels such as mobile, email, etc. Transactional and personal information may also be collected. Understanding this data can give a company important insights about how to improve sales velocity and how to optimize different digital interactions.
Many decisions around optimization boil down to the amount of data available and the insights that can be pulled from that data.
Q: What are the three Vs of big data?
A: The three Vs of big data science are variety, velocity, and volume. Data comes in a variety of different formats, including images, videos, emails, text messages, social media posts, and sequel tables. Structured, unstructured, and semi-structured data are examples of variety within data.
The second V, velocity, describes how quickly data becomes available to the organization collecting it. Adobe, for example, collects over 250 trillion transactions a year, which comes out to around 475 million transactions a minute.
The third V, volume, refers to the pure amount of data collected. If Youtube subscribers upload 380,000 hours of data an hour, that is a high volume of data. If an organization is dealing with 380,000 emails an hour, the volume of the data is significantly less, but the velocity is still high.
Q: Is big data open source?
A: Some data sets are open source, but these data sets have been made open source as the result of an explicit decision by a government agency or a private company. Data sets are made publicly available if it’s believed they will provide some good for humanity.
Q: What is the big data life cycle?
A: The data life cycle starts with information collection from data sources and ends with pulling insights from the collected data. The first step, data collection, involves creating an infrastructure that is responsible for collecting all the data points coming in. The infrastructure will depend on the type of data, but the raw data always persists somewhere so that further analysis can happen as needed.
The next step after data collection is determining where the data should persist and how to catalog it so other systems know it exists. Data is only as useful as the metadata that describes it. If an organization has large volumes of data, but no way of discovering that data or informing someone what that data is about, the data has no benefit. After data is stored and managed, it can then be analyzed for insights and patterns. The insights derived from big data analytics can then be visualized to inform stakeholders of the findings and make recommendations for the organization’s next steps.
Q: What technologies are required for big data?
A: Specific technologies have been purpose-built to deal with big data—analytics is one example. The whole world of analytics is essentially governed by big data. And dealing with large volumes of data at high velocity with some fairly tight time tolerances requires purpose-built machinery. You can't pull together disparate machines using consumer software and expect to adequately work with big data.
This is where the technology landscape of big data processing comes into the picture. The available technology includes analytics engines like Apache Spark or Databricks, which make it easier to manage large amounts of stored data, as well as big data technologies built around messaging, like Kafka, which specializes in processing streaming data that is continuously generated. An organization may also choose to build and manage their own custom framework.
Q: How long should companies store data?
A: Companies generally have contractual obligations that specify how long they can hold on to data. These contractual obligations will vary quite a bit and are governed by regulations in different geographic locations and across various vertical industries. Some data a company doesn’t ever want to lose — a banking website will never want to lose transaction data, for example. But as a digital marketer, there might be certain regulations that say you can’t hold onto customer data for longer than 36 months and at that point, you must delete it.
Under new privacy regulations in certain areas, customers may also ask for their data to be purged. When these requests come in, companies have to make sure that they scrub their system of any data that might be related to that customer so that they can stay in compliance with the privacy regulations.
In addition to contractual or privacy obligations, data can be deleted after it becomes irrelevant. Most data does become less useful over time. At some point, most companies will delete outdated data to save money, or they will extract the signals they care about most from the data, hold those signals in a less data-intensive format, and delete the original information.
Q: What problems do companies face when working with big data?
A: Collecting data is not enough. Organizations must be able to access, analyze, and shape the data. Unstructured and semi-structured data is often difficult to work with. Without proper management, the data can eat into costs without really providing any value. The right set of technologies can help companies make sense of their data and can help confirm or refute initial instincts about taking a course of action.
The ability to use big data has also become more democratized in recent years, and more individuals are able to work with big data, even if they aren't data scientists. While this has many advantages, there is also more scope for mistakes to be made. The number of people getting involved with the data and the speed at which it can be processed can lead to decisions based on incomplete information and result in suboptimal outcomes.
Better big data management comes with maturity. If an organization is starting to explore data for the first time, they may want to slow down and make sure they are asking the right questions. There can also be biases or skews in the data, which may not be apparent when first using big data.
Companies also have to be careful about how they use the data they collect. For example, they may collect personally identifiable information (PII) like credit card numbers or email addresses, but they may not want or be allowed to use that information for certain marketing actions or make it available via unsecure locations. Having a proper framework for data governance will help prevent mistakes of improper data access and use, and maintain compliance with regulations, by ensuring that data is properly labeled for its intended use.
Q: How will the use of big data continue to evolve?
A: One trend for the use of big data is the increased speed at which insights can be available, decisions made, and action taken.
Companies need to react to customer behavior in real time. Five years ago, organizations may have been able to sit on the data they collected for 24 or 48 hours, but organizations must now respond in an instant, which requires the technology to be able to run queries against large amounts of data as it becomes available. There is a significant shift happening in the world of big data as it moves from a batch-oriented way of thinking about things to something that takes place in real time. Machine learning and artificial intelligence (AI) are instrumental in increasing the speed of big data analysis.