The meaning of data in our lives can be described simply as below.
Data → Information → Knowledge → Intelligence
Without respect and care of and for data, we would only have the following.
Incorrect Data → False Information → Insanity → Ruin
“Big Data” seems to be defined by an increasing list of Vs. And it helps to refer to and determine which applies in a project deemed to involve “Big Data”.
- Volume – a large amount of data
- Variety – many different types and storage technologies
- Velocity – static or real time/stream data
- Veracity – data correctness and cleanliness
- Variability (Random Variability)
- Visualization – helps to understand the meaning of data
- Value – data can be useless, ensure value
The best technologies depends on applications and business needs, and requires technical administration and designers to understand this and the data very well.
NAS (Network Attached Storage)
SCSI (Small Computer System Object Storage
Disk Storage vs. SSD (Solid State Drives)
- SQL MS SQL Server, Oracle, MySQL, etc. (based on Codd's relational theory 1969)
- NoSQL (Not Only SQL, Key Value, Document, Graph, Columnar, GeoSpatial) Databases MongoDB, Flare, Cassandra, etc. (unstructured/semi-structured data, JSON)
IMDB/MMDB (In-Memory/Main Memory Databases)
Spark Hybrids Databases
Map and Reduce
- hypothesis testing
- C4.5 <=
- SVM (Support Vector Machines)
- EM (Expectation Maximization)
- K-Nearest Neighbors
- Naive Bayes
Analytical Programming Languages and Packages
- Octave (similar to MATLAB)
- R (similar to S)
- NumPy (Python)
- SciPy (Python)
- Weka (Java)
The Vs become clear when considering how “Big Data” can and is being used.
- Volume was a natural result to which businesses are exposed with regards to data over the global Internet, then mobility, and soon IoT (Internet of Things). This is a trove.
- Velocity of data should be the first thought when considering video/audio application, such as YouTube and Spotify. Video/Audio can be even be mission critical over CCTV in security applications, such as required at airports and utility centers to recognize movement in restricted areas and alert staff.
- Veracity is required to ensure correct analytical results and ease of analysis. Often data might be reliable; however, buried deep inside other just useless noisy data. A phrase often heard is ETL (extraction, transformation, and loading). Mission critical system administration may not be able to recognize very gradual drifting of systems into danger before it's too late. Extracting data from dirty log files and analyzing the data could spot drift well before criticality.
- Variability was addressed by statistics long before “Big Data”. Data could change in value; and if over time, is referred to as temporal.
- Visualization with tools such as Tableau and plotting in languages like are help with analysis and reporting to executives; however, beyond 3 dimensions isn't very useful.
- Value of “Big Data” isn't guaranteed and doesn't mean important information. Be careful not to expensively collect and store useless data. Define needed information and analytics first. e.g. “Supercalifragilisticexpialidocious” is an adjective meaning great, requires more storage and bandwidth. The indefinite and definite articles (i.e. a and the) have the shortest and nearly shortest word lengths in the English language, yet have much meaning, would take almost no storage space and bandwidth, and are extremely efficient at conveying information.
*This article comes from and is a digest of a recent lecture given to engineering and marketing staff.