Big Data Tutorials

Big Data Systems environments can use solutions from data processing and storage providers or we create our own environment.

In my classes at the Instituto Tecnológico de Aeronáutica (ITA), I teach big data infrastructure and analysis using free and open-source tolls shared by the Apache Project.

Figure 1: Data Pipeline Phases and Related Apache Tools.

The phases presented on Figure 1, are relateed to our data pipeline framework phases:

Figure 2: Data Pipeline Framework.

___________________________________________________________________________________

For each tool of the Hadoop ecosystem, presented in Figure 1, a tutorial to install and configure was created to simulate a big data environment in a desktop with Windows operating system. This simulated environment is able to implement a data pipeline and used in my classes. They are written in portuguese, which is not a problem now-a-days ... just translate them to your prefered language.

Apache Tools Tutorials:

  • Hadoop - the distributed file system and data processing environmet
  • HBase - the NoSQL Colunm Family for Hadoop 
  • Hive - the Data Warehouse storage for Hadoop
  • Spark - the data processing and analysis interative tool for Hadoop
  • Kafka - the distributed event streaming tool used for high-performance data streams.
  • Flink - the stateful computations tool over unbounded and bounded data streams
  • AirFlow - the tool used to programmatically author, schedule and monitor data workflows.

Note: The AirFlow tool is needed to schedule and monitor data processing jobs.
___________________________________________________________________________________


MapReduce Programs:
  • Word Count - a complete MapReduce tutorial, java progam and datataset. 
  • Sales - MapReduce DIY: instuctions, java programs and dataset.

Comments