Tutorial de Instalação do Spark no Windows - WSL


ProfaEmilia Colonese 


1.    Instalar o Python

Digitar no prompt do Linux/WSL:

$ sudo apt-get update && sudo apt-get upgrade

$ sudo apt install python3 ipython3
$ sudo apt install python3-pip
$ sudo apt install python3-full

Python 3.4+ é requerido pelo PySpark (versões anteriores do Python não funcionam).

 Checar a versão do Python:

$ python3 --version

OBS. Para verificar a versão do ubuntu digite: lsb_release -a. Se O comando não funcionar instale o pacote: sudo apt-get install lsb-core. 


$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.3 LTS
Release:        24.04
Codename:       noble

2.    Instalar o Jupyter Notebook


pip install --break-system-packages jupyter

2.1. Crie um alias para iniciar o jupyter sem browser no WSL

 Abra o arquivo de configuração:

 $ nano ~/.bashrc


Adicione no final do arquivo:


###################################################################
# Jupyter-notebook
###################################################################
export PATH=$PATH:~/.local/bin
export JRE_HOME=$JAVA_HOME/jre    
alias jupyter-notebook="~/.local/bin/jupyter-notebook --no-browser"

Execute o arquivo de configuração:

$ cd

$ exec bash

2.2. Inicie o servidor Jupyter:

 $ jupyter-notebook 

A janela do terminar irá ficar travada. Para sair tecle crtl+c.

Para continuar sem travar,  acrescente ‘&’ ao final do comando acima.

$ jupyter-notebook &

[1] 3588

[I 2024-03-15 11:13:05.733 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-03-15 11:13:05.738 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-03-15 11:13:05.743 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-03-15 11:13:05.746 ServerApp] notebook | extension was successfully linked.
[I 2024-03-15 11:13:05.857 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-03-15 11:13:05.865 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-03-15 11:13:05.866 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-03-15 11:13:05.866 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-03-15 11:13:05.867 LabApp] JupyterLab extension loaded from /home/emilia/.local/lib/python3.10/site-packages/jupyterlab
[I 2024-03-15 11:13:05.867 LabApp] JupyterLab application directory is /home/emilia/.local/share/jupyter/lab
[I 2024-03-15 11:13:05.867 LabApp] Extension Manager is 'pypi'.
[I 2024-03-15 11:13:05.910 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-03-15 11:13:05.912 ServerApp] notebook | extension was successfully loaded.
[I 2024-03-15 11:13:05.912 ServerApp] Serving notebooks from local directory: /home/emilia
[I 2024-03-15 11:13:05.912 ServerApp] Jupyter Server 2.13.0 is running at:
[I 2024-03-15 11:13:05.912 ServerApp] http://localhost:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
[I 2024-03-15 11:13:05.912 ServerApp]     http://127.0.0.1:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
[I 2024-03-15 11:13:05.912 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-03-15 11:13:05.914 ServerApp]
 
    To access the server, open this file in a browser:
        file:///home/emilia/.local/share/jupyter/runtime/jpserver-3588-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
        http://127.0.0.1:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182
 
$ [I 2024-03-15 11:13:07.413 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-server-nodejs, javascript-typescript-langserver, jedi-language-server, julia-language-server, pyright, python-language-server, python-lsp-server, r-languageserver, sql-language-server, texlab, typescript-language-server, unified-language-server, vscode-css-languageserver-bin, vscode-html-languageserver-bin, vscode-json-languageserver-bin, yaml-language-server

Tecle <Enter> 

O WSL disponibiliza o servidor jupyter na URL localhost na porta 8888. Para acessar o serviço com o browser do Windows, siga a instrução de acesso listada na saída do comando. 

No exemplo acima, o comando configurou a seguinte URL para acesso ao jupyter notebook:

http://localhost:8888/tree?token=d39240812f616aa0f81115c925c02bea92421956b667a182

Cheque os diretórios usados pelo jupyter com o comando:

$ jupyter --path


3.     Instalar o R

Java versão 11 deve estar corretamente instalado.

$ sudo apt-get update && sudo apt-get upgrade


# update indices
$ sudo apt update -qq

# install two helper packages we need
$ sudo apt install --no-install-recommends software-properties-common dirmng

# add the signing key (by Michael Rutter) for these repos
# To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9
$ wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc

# add the repo from CRAN -- lsb_release adjusts to 'noble' or 'jammy' or ... as needed
$ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"

# install R itself
$ sudo apt install --no-install-recommends r-base
$ sudo apt install --no-install-recommends r-base-dev

Configurar o arquivo Renviron:

$ R_LIBS_SITE=${R_LIBS_SITE-'/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library'}

 
$ sudo chmod 1777 -R /usr/local/lib/R/
$ sudo chmod 777 -R /usr/lib/R/library
$ sudo chmod 777 -R /usr/share/R

Entrar no R:

$ R


Atualizar os pacotes:


> update.packages()
> q()

4.    Instalar o RStudio Server

Configurar o Java para o RStudio Server:

$ sudo ln -s /usr/lib/jvm/java-11-openjdk-amd64 /usr/lib/jvm/default-java
$ sudo R CMD javareconf

Configurar pacotes necessários:


$ sudo apt-get install unixodbc-dev
$ sudo apt-get install libcurl4-openssl-dev
$ sudo apt-get install libxml2-dev 
$ sudo apt install libharfbuzz-dev
$ sudo apt install libfribidi-dev
$ sudo apt install libfreetype6-dev
$ sudo apt install libpng-dev
$ sudo apt install libtiff5-dev
$ sudo apt install libjpeg-dev
$ sudo apt install libwebp-dev
$ sudo apt install libfontconfig1-dev 
$ sudo apt install fonts-noto-cjk 
$ sudo apt install fonts-liberation

$ pip install gap-stat --only-binary :all:

Fazer download e instalar:

$ sudo apt-get install gdebi-core 
$ wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2025.09.2-418-amd64.deb

$ sudo gdebi rstudio-server-2025.09.2-418-amd64.deb

 

5.    Instalar e Configurar ODBC para MySql

Cheque a versão atual do seu Ubuntu Linux e sempre use a sua versão para os downloads a seguir. Neste tutorial a versão é 24.04.

Acessar URL: https://dev.mysql.com/downloads/mysql/5.5.html?os=31&version=5.1

Na caixa de sistema operacional, escolha ubuntu linux. Na caixa de versão, escolha 24.04 (escolha a sua versão).

Escolher para download: mysql-community-client-plugins_9.5.0-1ubuntu24.04_amd64.deb (de acordo com a sua versão).

Execute:

$ sudo gdebi mysql-community-client-plugins_9.5.0-1ubuntu24.04_amd64.deb

Acessar URL: https://dev.mysql.com/downloads/connector/odbc/

 Na caixa de sistema operacional, escolha ubuntu linux. Na caixa de versão, escolha 24.04 (escolha a sua versão).

Escolher para download: mysql-connector-odbc_9.5.0-1ubuntu24.04_amd64.deb (de acordo com a sua versão).

Execute:

$ sudo gdebi mysql-connector-odbc_9.5.0-1ubuntu24.04_amd64.deb
$ sudo cp /usr/lib/x86_64-linux-gnu/odbc/*.so /usr/local/lib/

Modificar o arquivo /etc/odbc.ini ( sudo nano  /etc/odbc.ini ) para utilizar o hive, adicionando as seguintes linhas:

[MySqlconn]
Description        = MySQL connection to database
Driver                = MYSQL
Database            = mysql
Server                = localhost
User                   = emilia
Password           = emilia
Port                  = 3306
Socket                = /var/run/mysqld/mysqld.sock

Substituir emilia pelo <seu_usuário> e <sua_password>.

Modificar o arquivo /etc/odbcinst.ini (sudo nano /etc/odbcinst.ini) para inserir o driver para o Mysql:

[MYSQL]
Description=ODBC for MySQL
Driver=/usr/local/lib/libmyodbc9a.so
Setup=/usr/local/lib/libmyodbc9w.so
UsageCount=1

Reinicie o terminal!!


6.    Entrar no RStudio Server pelo browser

http://localhost:8787/

Digitar o seu usuário e sua senha do ubuntu.

Na aba console do R, instalar pacotes necessários:

> install.packages("rJava")
> install.packages("DBI")
> install.packages("RJDBC")
> install.packages("RODBC")
> install.packages("sparklyr")
> install.packages("dplyr")
> install.packages("data.table")
> install.packages("tidyr")

> install.packages("tidyverse")

 

Chamar os pacotes instalados:

library(rJava)
library(DBI)
library(RJDBC)
library(RODBC)
library(sparklyr)
library(dplyr)
library(dbplyr)
library(data.table)
library(tidyr)
library(tidyverse)

 

7.    Acessar o HIVE do RStudio Server

> options(java.parameters = '-Xmx8g')
> hadoop_jar_dirs <- c('/opt/hadoop/share/hadoop/common/lib', '/opt/hadoop/share/hadoop/common', '/opt/hadoop/hive/lib', '/opt/hadoop/hive/jdbc', '/opt/hadoop/hbase/lib/')
> clpath <- c()
> for (d in hadoop_jar_dirs) {
      clpath <- c(clpath, list.files(d, pattern = 'jar', full.names = TRUE))
  }
> .jinit(classpath = clpath)
> .jaddClassPath(clpath)
> hive_jdbc_jar <- '/opt/hadoop/hive/lib/hive-jdbc-4.0.1.jar'
> hive_driver <- 'org.apache.hive.jdbc.HiveDriver'
> hive_url <- 'jdbc:hive2://localhost:10000'
> drv <- JDBC(hive_driver, hive_jdbc_jar)
> conn <- dbConnect(drv, hive_url)
> showdb <- dbGetQuery(conn, "show databases")
> showdb
> dbGetQuery(conn, "show tables")

Todas as queries podem ser executadas usando o comando dbGetQuery nas tabelas existentes no HIVE.

> dbGetQuery(conn, "<query>")

Criando tabelas e inserindo dados:

> emp <- read.csv{xx, xx.csv, header = T, sep – “;” )
> dbCreateTable(conn, "employees"
, emp) – cria a estrutura da tabela no HIVE

 

8.    Instalar o Scala

$ sudo apt-get install scala

Checar a instalação do Scala:

$ scala -version

 

9.    Instalar py4j para a interação do Python-Java

$ pip install --break-system-packages py4j


10.         Instalar o Apache Spark

Acesse a página de download do Spark (Spark download page) e escolha a última verão (este tutorial instala a versão 3.5.7).

Como o Hadoop já está instalado, escolha a versão sem o Hadoop (pre-built with user-provided Apache Hadoop), copie o link de download e use com o wget:

$ cd

$ wget https://dlcdn.apache.org/spark/spark-3.5.7/spark-3.5.7-bin-without-hadoop.tgz

Descompacte o arquivo baixado:

$ cd /opt/hadoop 
$ sudo tar -xvzf  /home/emilia/spark-3.5.7-bin-without-hadoop.tgz
$ sudo mv spark-3.5.7-bin-without-hadoop spark
$ sudo chmod 777 -R spark


As variáveis de ambiente para o PySpark com Python 3 e Jupyter Notebook devem ser configuradas.


$ cd
$ nano .bashrc

Adicione os seguintes comandos ao final do arquivo .bashrc:

############################################################
# SPARK & PYSPARK IN JUPITER NOTEBOOK
############################################################

export SPARK_HOME=/opt/hadoop/spark
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$SPARK_HOME/jars/*
 
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
 
export PATH=$PATH:$SPARK_HOME:$JAVA_HOME/bin:$JRE_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
 
export CLASSPATH=$CLASSPATH:$SPARK_HOME/jars
export HADOOP_CLASSPATH=$CLASSPATH
export JAVA_LIBRARY_PATH=$CLASSPATH

Execute  o .bashrc:

$ exec bash

Cheque a instalação do Spark:

$ spark-shell --version 

 Não é necessário instalar o PySpark, pois ele está empacotado no Spark.

 

11.         Configure o Spark

Crie o arquivo spark-env.sh para configurar o Spark.

$ cd $SPARK_HOME

$ cd conf
cd $SPARK_HOME

Crie o arquivo workers, se não existir.

 $ mv workers.template workers

Crie o arquivo spark-defaults.conf se não existir.

 $ mv spark-defaults.conf.template spark-defaults.conf

Insira os comando a seguir no final do arquivo spark-defaults.conf:


spark.sql.warehouse.dir            hdfs://localhost:8020/user/hive/warehouse
spark.yarn.preserve.staging.files  true
 
spark.sql.catalogImplementation hive
spark.sql.hive.metastore.version 4.0.0  # Substitua pela sua versão do Hive
spark.sql.hive.metastore.jars $HIVE_HOME/lib/*  # Path para os JARs do Hive client
spark.hadoop.hive.metastore.uris thrift://localhost:9083 # Sua URI do metastore 

Abra o arquivo spark-env.sh:

$ nano spark-env.sh

Insira os comando a seguir no final deste arquivo:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
# export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
#export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$HIVE_HOME/lib/*:$SPARK_HOME/jars/*
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
 
export SPARK_MASTER_HOST=localhost
export HIVE_HOME=/opt/hadoop/hive
 
export SPARK_HOME=/opt/hadoop/spark
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export SPARK_LOG_DIR=${SPARK_HOME}/logs
 
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export CLASSPATH=$CLASSPATH:/opt/hadoop/spark/jars
export HADOOP_CLASSPATH=$CLASSPATH


Crie os links de arquivos necessários para o diretório conf do spark: 

$ sudo ln -s $HADOOP_HOME/etc/hadoop/core-site.xml /opt/hadoop/spark/conf/core-site.xml
$ sudo ln -s $HIVE_HOME/conf/hive-site.xml /opt/hadoop/spark/conf/hive-site.xml
$ sudo ln -s $HBASE_HOME/conf/hbase-site.xml /opt/hadoop/spark/conf/hbase-site.xml


Faça download do jar spark-hive  necessários para conexões do SPARK com o HIVE:

$ cd $SPARK_HOME/jars
$ wget https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.12/3.5.7/spark-hive_2.12-3.5.7.jar

Copie os arquivos .jar necessários para conexões com HBASE e HIVE:

$ cd $HBASE_HOME/lib/

- do HBASE para o HIVE (cheque sua versão do hbase instalada!)

cp $HBASE_HOME/lib/hbase-client-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-common-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-hadoop-compat-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-hadoop2-compat-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-http-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-mapreduce-2.5.12-tests.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-mapreduce-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-metrics-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-metrics-api-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-procedure-2.5.12-tests.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-procedure-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-protocol-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-protocol-shaded-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-replication-2.5.12.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-gson-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jackson-jaxrs-json-provider-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jersey-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-jetty-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-miscellaneous-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-netty-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/hbase-shaded-protobuf-4.1.11.jar $HIVE_HOME/lib/
cp $HBASE_HOME/lib/zookeeper-3.8.4.jar $HIVE_HOME/lib/

- do HBASE para o SPARK (cheque sua versão do hbase instalada!)

cp $HBASE_HOME/lib/hbase-common-2.5.12.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-gson-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-jackson-jaxrs-json-provider-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-jersey-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-jetty-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-miscellaneous-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-netty-4.1.11.jar $SPARK_HOME/jars/
cp $HBASE_HOME/lib/hbase-shaded-protobuf-4.1.11.jar $SPARK_HOME/jars/


$ cd $HIVE_HOME/lib/

- do HIVE para o SPARK (cheque sua versão do hive instalada!)

cp $HIVE_HOME/lib/hive-cli-4.0.1.jar $SPARK_HOME/jars/
cp $HIVE_HOME/lib/hive-common-4.0.1.jar $SPARK_HOME/jars/
cp $HIVE_HOME/lib/hive-storage-api-4.0.1.jar $SPARK_HOME/jars/
cp $HIVE_HOME/lib/mysql-connector-j-9.5.0.jar $SPARK_HOME/jars/


Renomear .jar incompatíves (cheque a versão instalada!): 

$ cd $SPARK_HOME/jars
mv orc-core-1.9.7-shaded-protobuf.jar orc-core-1.9.7-shaded-protobuf.jarold
mv orc-mapreduce-1.9.7-shaded-protobuf.jar orc-mapreduce-1.9.7-shaded-protobuf.jarold
mv orc-shims-1.9.7.jar orc-shims-1.9.7.jarold


Criar o log4j2.properties:

cd $SPARK_HOME/conf

$ cp log4j2.properties.template log4j2.properties

Configurar o log4j2.properties para warn ou error:

$ nano log4j2.properties

Substituir onde estiver info: rootLogger.level = info
por warn:
rootLogger.level = warn

 

12.         Inicie o Spark

Iniciar o Hadoop e Yarn se não estiverem inicializados:

$ start-dfs.sh
$ start-yarn.sh

Iniciar o Spark:

$ start-master.sh
$ start-workers.sh

Verificar os processos ativos:

$ jps
13809 Master
13060 SecondaryNameNode
13446 NodeManager
12663 NameNode
13303 ResourceManager
12843 DataNode
13948 Worker
14031 Jps

O Spark, quando ativo, pode ser visto pela WEB pela URL: http://localhost:8080


Chame o spark pela API do Python no modo standalone:

$ pyspark --master="spark://localhost:7077"

E abra um novo notebook, e execute alguns comandos!

Agora cheque a ULR http://localhost:8080 novamente:


O funcionamento, configuração e status de uma Aplicação Spark podem ser vistos pela WEB pela URL: http://localhost:4040