Wednesday, March 30, 2016

Presentation and Visualization Methods


Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.

We see a sudden demand for visuals and info-graphics. There are a lot of visualization software and tools coming up in the market. So why exactly is visualization important?
Our mind can process huge amount of data if visually present than in a raw data format.


  •        It is easy to identify patterns and compare it with others if data is visually represented.
  •        Easy to identify trends
  •        Pie charts, bar graphs, line graphs etc are some of the common visualizations used.
Lets observe what are some interesting visualizations that can be built to understand wide variety of data across following industries:

  •        Healthcare
  •        E-commerce
  •        Insurance


Healthcare


Pictures can provide a large amount of information in a short amount of time. Most individuals are moved to action by strong emotions. When they have a visceral response to a stimulus they are often more willing to take action. Healthcare visualizations are similar to today’s popular infographics except the data they show is specific to healthcare. With their easy-to-understand graphical representation of complex data, healthcare visualizations can be used by anyone involved in healthcare improvements.

Figure 1:

This info-graphic shows pictorial representation of clinical decision support system adoption across united states.

Data points presented in this graphic are:

1)      CDSS adoption by state in 2007 and 2010
2)      Percentage changes in adoption between 2007 and 2010
3)      Adoption rate by hospital size in number of beds
4)      Total change in CDSS adoption









Figure 2: 

This info-graphic shows a dashboard with multiple visualizations encompassing Bloodbank Inventory Management system for a Laboratory Information System.








Data points presented in this graphic are:

1)      Wasted units because of expirations with type and month
2)      Successfully transferred units
3)      Status of present units by type and quantity
4)      Units ordered by type and months


E-commerce


Once ecommerce businesses gather data from a customer’s activity, account information, geolocation, or social media accounts, they can use data visualization to make comparisons, identify patterns, or show relationships. If businesses can see who is visiting and engaging, they can optimize their conversion process to not only increase sales or leads, but to create comprehensive strategies.

Figure 1:


This infographic shows the Sales performance of an e-commerce company over a period of time with demographics of customers all over India.

Data points presented here are:

1)      Gross Sales by State
2)      Male and Female Customers share sub divided into age groups
3)      Male and Female Customers share sub divided by payment method
4)      Male and Female Customers share sub divided by device/platform used
5)      Percentage changes over a period of time in metrics like No.of customers, No. of orders, Items/Order, Returns rate etc.,

Figure 2:

This infographic shows the daily brand mentions of the brands sold on e-commerce websites all over India.

Data points presented here are:

1)      Brand mentions over a period of time with a trend dotted line
2)      Identifying peaks and a reason being presented with the exact tweet related to a sales or contest by an e-commerce website

Insurance


Policies. Premiums. Claims. Payouts. Every transaction is a data point ripe for analysis and action. Realizing how quickly we understand and internalize what we see is at the foundation of what makes data visualization such an important aspect of how we analyze information and make better decisions.

Figure 1:


This infographic shows the dashboard of insurance manager with Claims analysis for the month of January.








Data points presented here are:

1)      Claims by type of insurance
2)      Claimant information summary with details like ID, Age, Gender, Status etc.,
3)      Claims by date of claim
4)      Claims by status

Figure 2:

This infographic shows the dashboard for a competitive analysis of insurance industry by state of United States.

Data points presented here are:

1)      Largest Insurers market share
2)      Insurance with >5% market share
3)      HHI
4)      Individual HHI index by state
5)      Small group and large group HHI index by state






References




Thursday, March 3, 2016




Big Unstructured Data v/s Structured Relational Data


Structured Data: Any data which is organized physically and logically and in a way such that inputting it into a relational database will be without any issues and readily searchable by simple, straightforward search engine algorithms or other search operations. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases. Structured data is information, usually text files, displayed in titled columns and rows which can easily be ordered and processed by data mining tools. This could be visualized as a perfectly organized filing cabinet where everything is identified, labeled and easy to access.

Some examples of structured data:

Machine Generated
  •        Sensory Data - GPS data, manufacturing sensors, medical devices
  •        Point-of-Sale Data - Credit card information, location of sale, product information
  •        Call Detail Records - Time of call, caller and recipient information
  •        Web Server Logs - Page requests, other server activity

Human Generated
  •       Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Unstructured Data:

Unstructured Data - Social Media
Unstructured data is raw and unorganized and organizations store it all. Ideally, all of this information would be converted into structured data however, this would be costly and time consuming. Also, not all types of unstructured data can easily be converted into a structured model. For example, an email holds information such as the time sent, subject, and sender (all uniform fields), but the content of the message is not so easily broken down and categorized. This can introduce some compatibility issues with the structure of a relational database system. Email is an example of unstructured data.
Also all social media data is considered to be unstructured


In addition to social media there are many other common forms of unstructured data:
  •        Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio      and video transcripts
  •        Audio Files - Customer service recordings, voicemails, 911 phone calls
  •        Presentations - PowerPoints, SlideShares
  •        Videos - Police dash cam, personal video, YouTube uploads
  •        Images - Pictures, illustrations, memes
  •        Messaging - Instant messages, text messages


Data Warehouse – Analyzing different data types

A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organization's performance. It is designed for query and analysis rather than for transaction processing, and usually contains historical data derived from transaction data, but can include data from other sources.

In addition to a relational database, a data warehouse environment can include an extraction, transportation, transformation, and loading (ETL) solution, statistical analysis, reporting, data mining capabilities, client analysis tools, and other applications that manage the process of gathering data, transforming it into useful, actionable information, and delivering it to business users.

To achieve the goal of enhanced business intelligence, the data warehouse works with data collected from multiple sources. The source data may come from internally developed systems, purchased applications, third-party data syndicators and other sources. It may involve transactions, production, marketing, human resources and more. In today's world of big data, the data may be many billions of individual clicks on web sites or the massive data streams from sensors built into complex machinery. A data warehouse usually stores many months or years of data to support historical analysis. The data in a data warehouse is typically loaded through an extraction, transformation, and loading (ETL) process from multiple data sources.

Data warehouse will generally store these types of data as given below. These types of data are discussed individually.

Data Warehousing Architecture

1) Historical Data
A data warehouse typically contains several years of historical data. The amount of data that you decide to make available depends on available disk space and the types of analysis that you want to support. This data can come from your transactional database archives or other sources.
Some applications might perform analyses that require data at lower levels than users typically view it. You will need to check with the application builder or the application's documentation for those types of data requirements.
      2) Derived Data
Derived data is generated from existing data using a mathematical operation or a data transformation. It can be created as part of a database maintenance operation or generated at run-time in response to a query.
      3) Metadata
Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.

Limitations of Data warehousing in analyzing different types of data


    1) Data drill-down and processing time with unstructured data: Data warehouses tend to have static data sets with minimal ability to "drill down" to specific solutions. The data is imported and filtered through a schema, and it is often days or weeks old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc queries and are thus notoriously difficult to tune for processing speed and query speed. While the queries are often ad hoc, the queries are limited by what data relations were set when the aggregation was assembled.
    2) Frequently Changing Data: Difficult to accommodate changes in data types and ranges, data source schema, indexes and queries. Limited flexibility of use and types of users - requires multiple separate data marts for multiple uses and types of users
    3) Efforts required to maintain the data changes: Major data schema transforms from each of the data sources to one schema in the data warehouse, which can represent more than 50% of the total data warehouse effort. Data owners lose control over their data, raising ownership (responsibility and accountability), security and privacy issues. Long initial implementation time and associated high cost. Also, adding new data sources takes time and associated high cost.

Future of Data Warehousing


The big data revolution has brought profound changes to how companies collect, store, manage, and analyze their data. Advances in data warehousing have empowered companies to take millions of rows of disparate bits of information and generate on-demand, real-time insights to help make smarter, data-driven decisions. But the next frontier for Data Warehousing comes from doing predictive analysis in a more advanced way rather than informative analytics. Enterprise Data Warehouses (EDW) will continue to adjust their standing due to Hadoop. EDWs will face strong competition from the rising “data lake” architecture based on Hadoop. Data lakes provide cost savings on software and storage. Newer organizations will adopt this strategy for the economic reasons. Cloudera, MapR and to an extent HortonWorks are embracing this approach.

Enterprises will build “operational data warehouses” to combine data from multiple sources in real time and go beyond dashboards and reports to actually use their data in day-to-day operations. There are three trends driving the move to a more agile model. First is the trend towards wanting to move faster and accommodate more data quickly. Waiting months to develop a schema and build the required ETL is no longer acceptable.  Second is the trend towards discovery-based analytics, driven by the consumer experience with search technologies.  Business analysts today want a search-based paradigm that allows them to formulate new questions to ask the data based on the results of the question they just asked a few seconds ago, and they want the results in real-time so they can figure out the question they want to ask next. Third is the trend towards operationalizing the data from the data warehouse. This means building data services that can combine data from multiple sources and provide that data securely and performant to an operational process so that process can complete in real time. Fraud detection, eligibility for benefits, and customer onboarding are all examples of use cases that used to be performed offline but now need to be performed online in real-time.

Because newer generations of data warehouses are designed to federate structured and unstructured data, they may provide enterprises with a 360-degree view of their operations and, with that broader perspective, the ability to make better decisions about the future. Companies running legacy data warehouses don’t have to junk their infrastructure and start anew. They can add capabilities to their existing data warehouse infrastructure that can allow it to grow into an “analytics warehouse.”

Fundamentally, the analytics warehouse functions as a central repository for an enterprise’s structured and unstructured data. In a traditional data warehousing architecture, structured data from ERP systems, CRM systems, file shares, and line of business applications is batch processed into the enterprise data warehouse using ETL (extract, transform, load) database processes. Software for running ad hoc queries and business intelligence systems take data from the warehouse environment, which may include operational data stores and data marts, to generate reports for users.

References


  1. http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/
  2. http://www.dataversity.net/next-level-data-warehousing-real-time-integration/
  3. https://www.betterbuys.com/bi/future-of-data-warehousing/
  4. http://docs.oracle.com/cd/B10501_01/olap.920/a95295/designd5.htm
  5. https://docs.oracle.com/database/121/DWHSG/concept.htm#DWHSG8071
  6. http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html
  7. http://www.whamtech.com/adv_disadv_dw.htm
  8. http://searchstorage.techtarget.com/feature/What-is-unstructured-data-and-how-is-it-different-from-structured-data-in-the-enterprise
  9. http://www.sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it/
  10. http://www.smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data
  11. http://www.brightplanet.com/2012/06/structured-vs-unstructured-data/



Wednesday, February 17, 2016

Business Intelligence for eCommerce

Business Intelligence for eCommerce


Company/Industry


Since I have worked with eCommerce industry as a business user, I would like to analyze my company latestone.com and eCommerce industry in general for the purpose of this blog.

Description

Latestone.com is an eCommerce website founded in 2013 by Palred Technologies Limited. It specializes in the niche market of procuring, designing and selling electronic accessories all over india. It fulfills around 3000 orders per day across india and has 2 warehouses 1 in delhi and 1 in Hyderabad to store its inventory. Sourcing happens from various cities within India like Mumbai, Delhi & Hyderabad and also from outside India, specifically China. The technology that runs the website and backend admin panel was inhouse developed by software team whereas database and ERP are purchased through a license.

Performance Metrics

Teams at latestone.com are divided by functionality like Procurement, Operations (Shipping etc.,), Warehousing, Technology, Sales & Marketing, HR, Content Management etc., So each team reports to either CEO, CTO, COO or CFO. So ill pickup the metrics for two departments here: Marketing and Warehousing. Sales & Marketing team basically has various avenues of marketing like Social media ads – facebook and google, newspaper ads, radio ads etc.,

Sales and Marketing Metrics:

  • No of unique visitors per day
  • Visitors through different online ads
  • Return on Investment through facebook vs google ads
  • ROI for online vs offline advertising
  • Bounce rate per day/hour
  • What type of payment options is most common? By size of purchase? By socioeconomic level?
  • Of multiple product orders is there any correlation between the purchases of any products? Can we make combo products out of this?
  •  Most searched/ordered/time spend products
  • Since our last price schedule adjustment which products have improved and which have deteriorated?
  • What are average order amounts from different kinds of marketing across days
  • What are the top 5 most profitable products by product category and demographic location?

Warehousing metrics:


  • No.of products received/picked/packed/invoiced/delivered per day
  • Turnaround time for picking/packing/invoicing/printing
  • Comparision of Proof of delivery time reported by different carriers
  • Most sold/unsold products
  • Inventory space/aisles movements per day
  • What is the average time from ordering date to shipping date? Does this vary by product?
  • Replenishment levels in different bins from bulk locations/pick locations
  • Is there a change in delivery type at different times of the year, i.e. preceding major holidays?
  • Re-ordering analysis/alerts for different products every day
  • Do we have adequate inventory for a particular product to meet anticipate demand?


Dimensional modelling


For analyzing the above queries I wanted to build a bus architecture 1st to get a clear understanding of which dimensions are suitable for which business process/team/context.

Bus Matrix:

Business Process/ Dimensions
Date
Time
Product
Customer
Promotion
Vendor
Carrier
Payment mode
Employee
Sales
x
x
x
x
x
x
x
Advertising
x
x
x
x
x
x
Inventory
x
x
x
x
x
x
x
Procurement
x
x
x
x
x
Warehousing
x
x
x
x
x
Shipping
x
x
x
x
x
x
x
Finance
x
x
x
x
x
x
Technology
x
x
x
x


After the business processes are identified, forming the grain for each individual context should be focused. After the grain is identified, the facts and dimensions should be established for those processes.

Some of the dimensions will be as shown in figures below:


Advertising dimension:

Customer Dimension:


So the fact table for Sales process will be something like:




So above model is a transaction type of model as it is based on individual order/transaction. In the eCommerce scenario, we can also have an accumulating and periodic snapshot as the CXO’s might be interested in knowing metrics every month or what is the status of something till date. 


References: