A: The definition of big data is fluid, but it generally refers to a data set that is too big to be housed on a single machine. If your team is working with data within the confines of a single box, either physical or virtual, then that's not big data.
The conversation shifts toward big data when talking about dozens of machines acting in concert. They hold on to the data and process it locally within those machines, then try to converge the output into something that can lead to better decision making.
Big data can include structured data, unstructured data, and semi-structured data, although fully structured data is rare when dealing with big data.
Structured data refers to data displayed in a well-defined table.
Unstructured data, which includes data points like logins, website clicks, page views, or video views, is data that is not organized in a pre-defined model. Formats like video, log files, and emails are generally unstructured data. They can be arbitrarily long or short or have words that can’t be predicted beforehand.
Semi-structured data includes data that contains a mix of structured and unstructured information. The logs generated by a machine are a good example of semi-structured data. Every time a person goes to a website, in addition to generating data based on content viewed in the browser, the server is generating data in the background. It is keeping track of when a request came, which IP address the request came from, which browser was used, etc. Some of this information is logged in a structured way, but within that, there is information which is unstructured, like the request parameters and the response payload.
Another important aspect of big data is the technology developed in the last 10-15 years used to process and analyze large data sets. When a flood of information started coming in, companies needed to create tools to ensure successful data storage and to find value in the data. Many organizations in the IT space, especially in those in the San Francisco Bay area, have focused on creating frameworks primarily to deal with big data. These frameworks, like Hadoop, were created to deal with scenarios where there is so much data it can’t possibly be processed by a small number of machines.