We know Big Data. We know Hadoop. We know Apache Spark. Some of us are using it. A lot of us have learned it. Most of us understand bits and pieces of it. This slice of reading is for those of you who want to know a short history and a quick refresher!
How big does data have to be to be qualified as big data? The general industry standard is to call all sizes from and above petabytes as big data (1000 raised to 5). From peta, we go on to exa, zetta, yotta and so on.
Like most things in nature, the way data is structured us differentiate them. There are four such types of data.
Structured Data – Data can be organized into rows and columns (RDBMS).
Data that cannot be organized into rows and columns (Files containing text).
Semi-Structured Data – Data that are unstructured but can be made to sit in a schema.
Data between structured and semi-structured. (XML files)
The structure of data has a lot to do with its collection or generation methods. Often the collected data from sources like sensors, images, videos etc. may not have a structure but will be massive in volume. Hence around 80% to 90% of data that qualifies as big data are unstructured.
Big data does not just mean huge volume of data. Apart from volume, theoretically, other characteristics of big data are velocity, variety, veracity, variability, and value.
Before Hadoop, we used RDBMS, parallel processing, and supercomputers to manage data in such large volume. However, the tabular RDBMS was difficult to scale, parallel processing had overhead time issues and supercomputers did not have a general OS to manage its processes and vendor lock-in for hardware among many other issues.
Hadoop overcomes the challenges of supercomputers. It’s open source framework gives a feel of a general OS. Functioning on commodity (Hardware from multiple vendors that are not too expensive and are easy to procure and maintain.) hardware helps mid-sized organizations can afford to use.
Other Features of Hadoop in a Capsule
1. It sits on another OS – Versions of Linux – Only Hortonworks support Windows OS.
2. Schema on-read.
3. Write once, read many times.
4. Batch processing.
5. Can process unstructured and semi-structured data.
6. It is data intensive and can handle petabytes of data.
7. Linear scaling.
8. It moves code to data thereby reducing overheads.
9. It uses a master-slave model – All nodes are connected by ethernet cables, optic fibers or Wi-Fi (rare).
If there are 1000 systems, 997 will be data nodes (hard drive intensive) and 3 will be Name Node, Secondary Name Node, and Job Tracker.
10. Native Host Operating Systems – Running on all nodes in the system. Linux based. Hadoop is installed on all machines.
11. Hadoop Vendors – For large organizations it is difficult to maintain such an open source framework without a large number of employees working on maintaining it. Hence software like Cloudera, MS Azure, IBM, MapR, Hortonworks etc. provide packaged solutions on top of Hadoop.
Big data and how we use big data is one of the most important fields of science that’s shaping this century. It is not just useful to understand it because it is interesting but also imperative to do so because of the role it is going to play in our lives going forward.
E-mail us at she@shedrivesdata.com to inspire our readers with your story – be it your success story or a lesson learned, share what you learned or send some love to a friend. We would love to hear from you!