Big Data Tutorials

Big Data Systems environments can use solutions from data processing and storage providers or we create our own environment.

In my classes at the Instituto Tecnológico de Aeronáutica (ITA), I teach big data infrastructure and analysis using free and open-source tolls shared by the Apache Project.

Figure 1: Data Pipeline Phases and Related Apache Tools.

The phases presented on Figure 1, are related to our data pipeline framework phases:

Figure 2: Data Pipeline Framework.

___________________________________________________________________________________

Apache Tools Tutorials:

Hadoop - the distributed file system and data processing environmet
HBase - the NoSQL Colunm Family for Hadoop
Hive - the Data Warehouse storage for Hadoop
Spark - the data processing and analysis interative tool for Hadoop
Kafka - the distributed event streaming tool used for high-performance data streams.
Flink - the stateful computations tool over unbounded and bounded data streams
AirFlow - the tool used to programmatically author, schedule and monitor data workflows.

Note: The AirFlow tool is needed to schedule and monitor data processing jobs.

___________________________________________________________________________________

MapReduce Programs:

Emilia Colonese