Big Data: Definition, Technologies, Analysis, and Utility

Definition

The meaning of data in our lives can be described simply as below.

Data → Information → Knowledge → Intelligence

Without respect and care of and for data, we would only have the following.

Incorrect Data → False Information → Insanity → Ruin

“Big Data” seems to be defined by an increasing list of Vs. And it helps to refer to and determine which applies in a project deemed to involve “Big Data”.
  • Volume – a large amount of data
  • Variety – many different types and storage technologies
  • Velocity – static or real time/stream data
  • Veracity – data correctness and cleanliness
  • Variability (Random Variability)
  • Visualization – helps to understand the meaning of data
  • Value – data can be useless, ensure value

Technologies

The best technologies depends on applications and business needs, and requires technical administration and designers to understand this and the data very well.
File Storage

NAS (Network Attached Storage)

Block Storage

SCSI (Small Computer System Object Storage
Disk Storage vs. SSD (Solid State Drives)

Object Storage

RESTful API

Block Storage

Disk Databases

  • SQL MS SQL Server, Oracle, MySQL, etc. (based on Codd's relational theory 1969)
  • NoSQL (Not Only SQL, Key Value, Document, Graph, Columnar, GeoSpatial) Databases MongoDB, Flare, Cassandra, etc. (unstructured/semi-structured data, JSON)

IMDB/MMDB (In-Memory/Main Memory Databases)

  • Spark Hybrids Databases

  • WebDNA

Map and Reduce

  • Hadoop

Analysis

Statistics
  • hypothesis testing
  • inference
  • regression
  • categorical
Data Mining
  • C4.5 <=
  • K-Means
  • SVM (Support Vector Machines)
  • Apriori
  • EM (Expectation Maximization)
  • PageRank
  • AdaBoost
  • K-Nearest Neighbors
  • Naive Bayes
  • CART
Analytical Programming Languages and Packages
Proprietary
  • MATLAB
  • S
Open Source
  • Octave (similar to MATLAB)
  • R (similar to S)
  • NumPy (Python)
  • SciPy (Python)
  • Weka (Java)

Utilization

The Vs become clear when considering how “Big Data” can and is being used.

  • Volume was a natural result to which businesses are exposed with regards to data over the global Internet, then mobility, and soon IoT (Internet of Things). This is a trove.
  • Variety is due to the many new applications reliant on new types of data and technologies for optimization. NoSQL Databases and JSON (JavaScript Object Notation) are just two. There's certain to be more in the future.
  • Velocity of data should be the first thought when considering video/audio application, such as YouTube and Spotify. Video/Audio can be even be mission critical over CCTV in security applications, such as required at airports and utility centers to recognize movement in restricted areas and alert staff.
  • Veracity is required to ensure correct analytical results and ease of analysis. Often data might be reliable; however, buried deep inside other just useless noisy data. A phrase often heard is ETL (extraction, transformation, and loading). Mission critical system administration may not be able to recognize very gradual drifting of systems into danger before it's too late. Extracting data from dirty log files and analyzing the data could spot drift well before criticality.
  • Variability was addressed by statistics long before “Big Data”. Data could change in value; and if over time, is referred to as temporal.
  • Visualization with tools such as Tableau and plotting in languages like are help with analysis and reporting to executives; however, beyond 3 dimensions isn't very useful.
  • Value of “Big Data” isn't guaranteed and doesn't mean important information. Be careful not to expensively collect and store useless data. Define needed information and analytics first. e.g. “Supercalifragilisticexpialidocious” is an adjective meaning great, requires more storage and bandwidth. The indefinite and definite articles (i.e. a and the) have the shortest and nearly shortest word lengths in the English language, yet have much meaning, would take almost no storage space and bandwidth, and are extremely efficient at conveying information.

*This article comes from and is a digest of a recent lecture given to engineering and marketing staff.