Big Unstructured Data v/s Structured Relational Data

Structured Data: Any data which is organized physically and logically and in a way such that inputting it into a relational database will be without any issues and readily searchable by simple, straightforward search engine algorithms or other search operations. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases. Structured data is information, usually text files, displayed in titled columns and rows which can easily be ordered and processed by data mining tools. This could be visualized as a perfectly organized filing cabinet where everything is identified, labeled and easy to access.

Some examples of structured data:

Machine Generated

Sensory Data - GPS data, manufacturing sensors, medical devices
Point-of-Sale Data - Credit card information, location of sale, product information
Call Detail Records - Time of call, caller and recipient information
Web Server Logs - Page requests, other server activity

Human Generated

Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Unstructured Data:

Unstructured Data - Social Media

Unstructured data is raw and unorganized and organizations store it all. Ideally, all of this information would be converted into structured data however, this would be costly and time consuming. Also, not all types of unstructured data can easily be converted into a structured model. For example, an email holds information such as the time sent, subject, and sender (all uniform fields), but the content of the message is not so easily broken down and categorized. This can introduce some compatibility issues with the structure of a relational database system. Email is an example of unstructured data.

Also all social media data is considered to be unstructured

In addition to social media there are many other common forms of unstructured data:

Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
Audio Files - Customer service recordings, voicemails, 911 phone calls
Presentations - PowerPoints, SlideShares
Videos - Police dash cam, personal video, YouTube uploads
Images - Pictures, illustrations, memes
Messaging - Instant messages, text messages

Data Warehouse – Analyzing different data types

A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organization's performance. It is designed for query and analysis rather than for transaction processing, and usually contains historical data derived from transaction data, but can include data from other sources.

In addition to a relational database, a data warehouse environment can include an extraction, transportation, transformation, and loading (ETL) solution, statistical analysis, reporting, data mining capabilities, client analysis tools, and other applications that manage the process of gathering data, transforming it into useful, actionable information, and delivering it to business users.

To achieve the goal of enhanced business intelligence, the data warehouse works with data collected from multiple sources. The source data may come from internally developed systems, purchased applications, third-party data syndicators and other sources. It may involve transactions, production, marketing, human resources and more. In today's world of big data, the data may be many billions of individual clicks on web sites or the massive data streams from sensors built into complex machinery. A data warehouse usually stores many months or years of data to support historical analysis. The data in a data warehouse is typically loaded through an extraction, transformation, and loading (ETL) process from multiple data sources.

Data warehouse will generally store these types of data as given below. These types of data are discussed individually.

Data Warehousing Architecture

1) Historical Data

A data warehouse typically contains several years of historical data. The amount of data that you decide to make available depends on available disk space and the types of analysis that you want to support. This data can come from your transactional database archives or other sources.

Some applications might perform analyses that require data at lower levels than users typically view it. You will need to check with the application builder or the application's documentation for those types of data requirements.

2) Derived Data

Derived data is generated from existing data using a mathematical operation or a data transformation. It can be created as part of a database maintenance operation or generated at run-time in response to a query.

3) Metadata

Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.

Limitations of Data warehousing in analyzing different types of data

1) Data drill-down and processing time with unstructured data: Data warehouses tend to have static data sets with minimal ability to "drill down" to specific solutions. The data is imported and filtered through a schema, and it is often days or weeks old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc queries and are thus notoriously difficult to tune for processing speed and query speed. While the queries are often ad hoc, the queries are limited by what data relations were set when the aggregation was assembled.

2) Frequently Changing Data: Difficult to accommodate changes in data types and ranges, data source schema, indexes and queries. Limited flexibility of use and types of users - requires multiple separate data marts for multiple uses and types of users

3) Efforts required to maintain the data changes: Major data schema transforms from each of the data sources to one schema in the data warehouse, which can represent more than 50% of the total data warehouse effort. Data owners lose control over their data, raising ownership (responsibility and accountability), security and privacy issues. Long initial implementation time and associated high cost. Also, adding new data sources takes time and associated high cost.

Future of Data Warehousing

The big data revolution has brought profound changes to how companies collect, store, manage, and analyze their data. Advances in data warehousing have empowered companies to take millions of rows of disparate bits of information and generate on-demand, real-time insights to help make smarter, data-driven decisions. But the next frontier for Data Warehousing comes from doing predictive analysis in a more advanced way rather than informative analytics. Enterprise Data Warehouses (EDW) will continue to adjust their standing due to Hadoop. EDWs will face strong competition from the rising “data lake” architecture based on Hadoop. Data lakes provide cost savings on software and storage. Newer organizations will adopt this strategy for the economic reasons. Cloudera, MapR and to an extent HortonWorks are embracing this approach.

Enterprises will build “operational data warehouses” to combine data from multiple sources in real time and go beyond dashboards and reports to actually use their data in day-to-day operations. There are three trends driving the move to a more agile model. First is the trend towards wanting to move faster and accommodate more data quickly. Waiting months to develop a schema and build the required ETL is no longer acceptable. Second is the trend towards discovery-based analytics, driven by the consumer experience with search technologies. Business analysts today want a search-based paradigm that allows them to formulate new questions to ask the data based on the results of the question they just asked a few seconds ago, and they want the results in real-time so they can figure out the question they want to ask next. Third is the trend towards operationalizing the data from the data warehouse. This means building data services that can combine data from multiple sources and provide that data securely and performant to an operational process so that process can complete in real time. Fraud detection, eligibility for benefits, and customer onboarding are all examples of use cases that used to be performed offline but now need to be performed online in real-time.

Because newer generations of data warehouses are designed to federate structured and unstructured data, they may provide enterprises with a 360-degree view of their operations and, with that broader perspective, the ability to make better decisions about the future. Companies running legacy data warehouses don’t have to junk their infrastructure and start anew. They can add capabilities to their existing data warehouse infrastructure that can allow it to grow into an “analytics warehouse.”

Fundamentally, the analytics warehouse functions as a central repository for an enterprise’s structured and unstructured data. In a traditional data warehousing architecture, structured data from ERP systems, CRM systems, file shares, and line of business applications is batch processed into the enterprise data warehouse using ETL (extract, transform, load) database processes. Software for running ad hoc queries and business intelligence systems take data from the warehouse environment, which may include operational data stores and data marts, to generate reports for users.

VamsiVoletiBI

Thursday, March 3, 2016

Big Unstructured Data v/s Structured Relational Data

Unstructured Data:

Data Warehouse – Analyzing different data types

Limitations of Data warehousing in analyzing different types of data

Future of Data Warehousing

References

3 comments: