Tutorial de Instalação do Spark no Windows - WSL
Profa. Emilia Colonese
1. Instalar o Python
Digitar no prompt do Linux/WSL:
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt install python3-pip
Python 3.4+ é
requerido pelo PySpark (versões anteriores do Python não funcionam).
$ python3 --version
OBS. Para verificar a versão do ubuntu digite: lsb_release -a. Se O comando não funcionar instale o pacote: sudo apt-get install lsb-core.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
2.
Instalar o Jupyter Notebook
2.1. Crie
um alias para iniciar o jupyter sem browser no WSL
Adicione no final do arquivo:
###################################################################
export PATH=$PATH:~/.local/bin
export JRE_HOME=$JAVA_HOME/jre
alias jupyter-notebook="~/.local/bin/jupyter-notebook --no-browser"
Execute o arquivo de configuração:
$ cd
2.2. Inicie
o servidor Jupyter:
A janela do terminar irá ficar travada. Para sair tecle crtl+c.
Para continuar sem travar, acrescente ‘&’ ao final do comando acima.
$ jupyter-notebook &
[1] 3588
[I 2024-03-15 11:13:05.738 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-03-15 11:13:05.743 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-03-15 11:13:05.746 ServerApp] notebook | extension was successfully linked.
[I 2024-03-15 11:13:05.857 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-03-15 11:13:05.865 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-03-15 11:13:05.866 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-03-15 11:13:05.866 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-03-15 11:13:05.867 LabApp] JupyterLab extension loaded from /home/emilia/.local/lib/python3.10/site-packages/jupyterlab
[I 2024-03-15 11:13:05.867 LabApp] JupyterLab application directory is /home/emilia/.local/share/jupyter/lab
[I 2024-03-15 11:13:05.867 LabApp] Extension Manager is 'pypi'.
[I 2024-03-15 11:13:05.910 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-03-15 11:13:05.912 ServerApp] notebook | extension was successfully loaded.
[I 2024-03-15 11:13:05.912 ServerApp] Serving notebooks from local directory: /home/emilia
[I 2024-03-15 11:13:05.912 ServerApp] Jupyter Server 2.13.0 is running at:
[I 2024-03-15 11:13:05.912 ServerApp] http://localhost:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
[I 2024-03-15 11:13:05.912 ServerApp] http://127.0.0.1:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
[I 2024-03-15 11:13:05.912 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-03-15 11:13:05.914 ServerApp]
file:///home/emilia/.local/share/jupyter/runtime/jpserver-3588-open.html
Or copy and paste one of these URLs:
http://localhost:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
http://127.0.0.1:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
Tecle <Enter>
O WSL disponibiliza o servidor jupyter na URL localhost na porta 8888. Para acessar o serviço com o browser do Windows, siga a instrução de acesso listada na saída do comando.
No exemplo acima, o comando configurou a seguinte URL para acesso ao jupyter notebook:
http://localhost:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
Cheque os
diretórios usados pelo jupyter com o comando:
$ jupyter --path
3.
Instalar o R
Java versão 11 deve estar corretamente instalado.
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt update -qq
$ sudo apt install --no-install-recommends software-properties-common dirmng
# To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9
$ wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
$ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
$ sudo apt install --no-install-recommends r-base
$ sudo apt install --no-install-recommends r-base-dev
Configurar o arquivo Renviron:
$ R_LIBS_SITE=${R_LIBS_SITE-'/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library'}
$ sudo chmod 777 -R /usr/lib/R/library
Entrar no R:
$ R
Atualizar os pacotes:
> q()
4.
Instalar o RStudio Server
Configurar o
Java para o RStudio Server:
$ sudo ln -s
/usr/lib/jvm/java-11-openjdk-amd64 /usr/lib/jvm/default-java
$ sudo R CMD javareconf
Configurar pacotes necessários:
$ sudo apt-get install libcurl4-openssl-dev
$ sudo apt-get install libxml2-dev
$ sudo apt install libfreetype6-dev
$ sudo apt install libpng-dev
$ sudo apt install libtiff5-dev
$ sudo apt install libjpeg-dev
$ sudo apt install libwebp-dev
$ sudo apt install libfontconfig1-dev
Fazer download e instalar:
$ sudo apt-get install gdebi-core
$ wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2025.09.2-418-amd64.deb
5. Instalar e Configurar ODBC para MySql
Cheque a versão atual do seu Ubuntu Linux e sempre use a sua versão para os downloads a seguir. Neste tutorial a versão é 24.04.
Acessar URL: https://dev.mysql.com/downloads/mysql/5.5.html?os=31&version=5.1
Na caixa de sistema operacional, escolha ubuntu linux. Na caixa de versão, escolha 24.04 (escolha a sua versão).
Escolher para download: mysql-community-client-plugins_9.5.0-1ubuntu24.04_amd64.deb (de acordo com a sua versão).
Execute:
$ sudo gdebi mysql-community-client-plugins_9.5.0-1ubuntu24.04_amd64.deb
Acessar URL: https://dev.mysql.com/downloads/connector/odbc/
Escolher para download: mysql-connector-odbc_9.5.0-1ubuntu24.04_amd64.deb (de acordo com a sua versão).
Execute:
$ sudo gdebi
mysql-connector-odbc_9.5.0-1ubuntu24.04_amd64.deb
$ sudo cp /usr/lib/x86_64-linux-gnu/odbc/*.so
/usr/local/lib/
Modificar o arquivo /etc/odbc.ini ( sudo nano /etc/odbc.ini ) para utilizar o hive, adicionando as seguintes linhas:
[MySqlconn]
Description = MySQL connection to database
Driver = MYSQL
Database = mysql
Server = localhost
User = emilia
Password = emilia
Port
= 3306
Socket = /var/run/mysqld/mysqld.sock
Modificar o arquivo /etc/odbcinst.ini (sudo nano /etc/odbcinst.ini) para inserir o driver para o Mysql:
[MYSQL]
Description=ODBC
for MySQL
Driver=/usr/local/lib/libmyodbc9a.so
Setup=/usr/local/lib/libmyodbc9w.so
UsageCount=1
Reinicie o terminal!!
6. Entrar no RStudio Server pelo browser
http://localhost:8787/
Digitar o seu usuário e sua senha do ubuntu.
Na aba console do R, instalar
pacotes necessários:
>
install.packages("rJava")
> install.packages("DBI")
> install.packages("RJDBC")
> install.packages("RODBC")
> install.packages("sparklyr")
> install.packages("dplyr")
> install.packages("data.table")
> install.packages("tidyr")
Chamar os
pacotes instalados:
library(DBI)
library(RJDBC)
library(sparklyr)
library(dplyr)
library(dbplyr)
library(data.table)
library(tidyr)
7. Acessar o HIVE do RStudio Server
> options(java.parameters = '-Xmx8g')
> hadoop_jar_dirs <- c('/opt/hadoop/share/hadoop/common/lib', '/opt/hadoop/share/hadoop/common', '/opt/hadoop/hive/lib', '/opt/hadoop/hive/jdbc', '/opt/hadoop/hbase/lib/')
> clpath <- c()
> for (d in hadoop_jar_dirs) {
clpath <- c(clpath,
list.files(d, pattern = 'jar', full.names = TRUE))
}
> .jinit(classpath = clpath)
> .jaddClassPath(clpath)
> hive_jdbc_jar <- '/opt/hadoop/hive/lib/hive-jdbc-4.0.1.jar'
> hive_driver <- 'org.apache.hive.jdbc.HiveDriver'
> hive_url <- 'jdbc:hive2://localhost:10000'
> drv <- JDBC(hive_driver, hive_jdbc_jar)
> conn <- dbConnect(drv, hive_url)
> showdb <- dbGetQuery(conn, "show databases")
> showdb
> dbGetQuery(conn, "show tables")
Todas as queries podem ser executadas usando o comando dbGetQuery nas tabelas existentes no HIVE.
>
dbGetQuery(conn, "<query>")
Criando tabelas e inserindo dados:
> emp <- read.csv{xx, xx.csv, header = T, sep – “;” )
> dbCreateTable(conn,
"employees", emp) –
cria a estrutura da tabela no HIVE
8. Instalar o Scala
$
sudo apt-get install scala
Checar a instalação do Scala:
$ scala -version
9. Instalar py4j para a interação do Python-Java
10.
Instalar
o Apache Spark
Acesse a página
de download do Spark (Spark download page) e escolha a última verão
(este tutorial instala a versão 3.5.7).
Como o Hadoop
já está instalado, escolha a versão sem o Hadoop (pre-built with user-provided Apache Hadoop), copie o link de download e use com o wget:
$ cd
Descompacte o
arquivo baixado:
$
cd /opt/hadoop
$
sudo tar -xvzf /home/emilia/spark-3.5.7-bin-without-hadoop.tgz
$
sudo mv spark-3.5.7-bin-without-hadoop spark
$ sudo chmod 777 -R spark
As variáveis de
ambiente para o PySpark com Python 3 e Jupyter Notebook devem ser configuradas.
$ nano .bashrc
# SPARK & PYSPARK IN JUPITER NOTEBOOK
############################################################
export SPARK_HOME=/opt/hadoop/spark
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$SPARK_HOME/jars/*
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export HADOOP_CLASSPATH=$CLASSPATH
export JAVA_LIBRARY_PATH=$CLASSPATH
Execute o .bashrc:
$ exec bash
Cheque a instalação do Spark:
$ spark-shell --version
11.
Configure
o Spark
Crie o arquivo spark-env.sh
para configurar o Spark.
$ cd $SPARK_HOME
$ cd $SPARK_HOME
Crie o arquivo workers, se não existir.
Crie o arquivo spark-defaults.conf se não existir.
Insira os comando a seguir no final do arquivo spark-defaults.conf:
spark.sql.catalogImplementation hive
spark.sql.hive.metastore.version 4.0.0 # Substitua pela sua versão do Hive
spark.sql.hive.metastore.jars $HIVE_HOME/lib/* # Path para os JARs do Hive client
Abra o arquivo spark-env.sh:
$ nano spark-env.sh
Insira os comando a seguir no final deste arquivo:
export
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export
HADOOP_HOME=/opt/hadoop
#
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
#export
SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
export
SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$HIVE_HOME/lib/*:$SPARK_HOME/jars/*
export
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export
YARN_CONF_DIR=/opt/hadoop/etc/hadoop
export
SPARK_MASTER_HOST=localhost
export
HIVE_HOME=/opt/hadoop/hive
export
SPARK_HOME=/opt/hadoop/spark
export
SPARK_CONF_DIR=${SPARK_HOME}/conf
export
SPARK_LOG_DIR=${SPARK_HOME}/logs
export
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export
CLASSPATH=$CLASSPATH:/opt/hadoop/spark/jars
export HADOOP_CLASSPATH=$CLASSPATH
Crie os links de arquivos necessários para o diretório conf do spark:
$ sudo ln -s $HADOOP_HOME/etc/hadoop/core-site.xml
/opt/hadoop/spark/conf/core-site.xml
$ sudo ln -s
$HIVE_HOME/conf/hive-site.xml /opt/hadoop/spark/conf/hive-site.xml
$ sudo ln -s $HBASE_HOME/conf/hbase-site.xml
/opt/hadoop/spark/conf/hbase-site.xml
Faça download do jar spark-hive necessários para conexões do SPARK com o HIVE:
$ wget https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.12/3.5.7/spark-hive_2.12-3.5.7.jar
Copie os arquivos .jar
necessários para conexões com HBASE e HIVE:
$ cd $HBASE_HOME/lib/
- do HBASE para o HIVE (cheque sua versão do hbase instalada!)
cp $HBASE_HOME/lib/hbase-client-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-common-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-hadoop-compat-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-hadoop2-compat-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-http-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-mapreduce-2.5.12-tests.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-mapreduce-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-metrics-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-metrics-api-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-procedure-2.5.12-tests.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-procedure-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-protocol-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-protocol-shaded-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-replication-2.5.12.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-gson-4.1.11.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jackson-jaxrs-json-provider-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jersey-4.1.11.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jetty-4.1.11.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-miscellaneous-4.1.11.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-netty-4.1.11.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-protobuf-4.1.11.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/zookeeper-3.8.4.jar
$HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jersey-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jetty-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-miscellaneous-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-netty-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-protobuf-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/zookeeper-3.8.4.jar $HIVE_HOME/lib/
- do HBASE para o SPARK (cheque sua versão do hbase instalada!)
cp $HBASE_HOME/lib/hbase-common-2.5.12.jar
$SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-gson-4.1.11.jar
$SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-jackson-jaxrs-json-provider-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-jersey-4.1.11.jar
$SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-jetty-4.1.11.jar
$SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-miscellaneous-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-netty-4.1.11.jar
$SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-protobuf-4.1.11.jar
$SPARK_HOME/jars/
$ cd $HIVE_HOME/lib/
- do HIVE para o SPARK
cp $HIVE_HOME/lib/hive-cli-4.0.1.jar
$SPARK_HOME/jars/
cp $HIVE_HOME/lib/hive-common-4.0.1.jar
$SPARK_HOME/jars/
cp $HIVE_HOME/lib/hive-storage-api-4.0.1.jar
$SPARK_HOME/jars/
cp $HIVE_HOME/lib/mysql-connector-j-9.5.0.jar
$SPARK_HOME/jars/
Renomear .jar incompatíves (cheque a versão instalada!):
$ cd $SPARK_HOME/jars
mv orc-core-1.9.7-shaded-protobuf.jar
orc-core-1.9.7-shaded-protobuf.jarold
mv orc-mapreduce-1.9.7-shaded-protobuf.jar
orc-mapreduce-1.9.7-shaded-protobuf.jarold
mv orc-shims-1.9.7.jar
orc-shims-1.9.7.jarold
Criar o log4j2.properties:
$ cd $SPARK_HOME/conf
Configurar o log4j2.properties para warn ou error:
$ nano log4j2.properties
Substituir onde estiver info:
rootLogger.level
= info
por warn: rootLogger.level = warn
12.
Inicie o
Spark
Iniciar o Hadoop e Yarn se
não estiverem inicializados:
$ start-dfs.sh
$ start-yarn.sh
Iniciar o Spark:
$
start-master.sh
$
start-workers.sh
Verificar os processos ativos:
13809 Master
13060 SecondaryNameNode
13446 NodeManager
12663 NameNode
13303 ResourceManager
12843 DataNode
13948 Worker
14031 Jps
O Spark, quando ativo, pode ser visto pela WEB pela URL: http://localhost:8080
Chame o spark pela API do Python no modo standalone:
Agora cheque a ULR http://localhost:8080 novamente:
O funcionamento, configuração e status de uma Aplicação Spark
podem ser vistos pela WEB pela URL: http://localhost:4040