Hadoop Installation On Windows

Hadoop Installation On Windows 7
Hadoop Installation On Windows System Requirements
Hadoop Installation On Windows 7 Using Cygwin

Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL) 4,480 Run Multiple Python Scripts PySpark Application with yarn-cluster Mode 156 Apache Hive 3.1.1 Installation on Windows 10 using Windows Subsystem for Linux 1,037 Diagnostics: Container is running beyond physical memory limits 210 Load Data from Teradata in Spark. Installing Apache Hadoop on Windows 10 using Cygwin64 This article describes how to set up and configure a single-node Hadoop installation on Windows 10 using Cygwin.

This introduction to Hadoop will tell you how to install and configure it in Windows. The prerequisites, the associated software and other requirements are all listed here, and theres advice on the dos and don’ts for a successful installation.

Hadoop installation on Windows 10 Haddop - Introduction Here we will focus how to install Hadoop on Windows 10 environment rather than its detail framework, but will cover brief definition. Hadoop requires Java 8 so install that first. Next, follow the steps in the linked pdf which.

Hadoop is an Apache open source software library written completely in Java, designed to deliver a distributed file system (HDFS) and a method for distributed computation called MapReduce. It can scale up from single servers to thousands of machines.
It provides a framework that allows distributed processing of large sets of data across clusters of computers using simple programming models.

Hadoop implements MapReduce, which runs in two phases: the map phase and the reduce phase. The input to these computations is the data set of key/pair values. In short, the application is divided into many small fragments, each of which may be assigned to the map task. These map tasks are distributed and executed on any node in the cluster. If the nodes fail during the computation process, the tasks assigned to them are redistributed among the remaining nodes.

The Apache Hadoop project includes the following modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop Yarn: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A yarn-based system for parallel processing of large data sets.

Hadoop deployment methods
Here is a list of the methods used to deploy Hadoop.
Standalone: In this mode, there are no daemons running; everything runs as in a single JVM. This mode is suitable for running the MapReduce program during development, as it is easy to test and debug.
Pseudo-distributed: The Hadoop daemon process runs on a local machine simulating a cluster on a small scale.

Fully distributed: Here, Hadoop runs on a cluster of machines providing a production environment.
This article focuses on introducing Hadoop, and deploying single-node pseudo-distributed Hadoop on a Windows platform.

Prerequisite software or tools for running Hadoop on Windows
You will need the following software to run Hadoop on Windows.

Supported Windows OSs: Hadoop supports Windows Server 2008 and Windows Server 2008 R2, Windows Vista and Windows 7. For installation purposes we are going to make use of Windows 7 Edition and JDK.
As Hadoop is written in Java, we will need to install Oracle JDK 1.6 or higher. Download Java from the link given in References [1] at the end of this article, and install it to the default location.

Windows SDK: Download and install Windows SDK 7 from the link given in References [2]. Windows SDK provides the tools, compilers, headers and libraries that are necessary to run Hadoop.

Cygwin: Hadoop requires UNIX command line tools like Cygwin or GnuWin32. Download and install Cygwin from the link given in References [3], to its default location C:cygwin64 and make sure to select the openssh package and its associated prerequisites from the Select packages tab.

Maven and the Protocol buffer: Install Maven 3.0 or later and the Protocol buffer 2.5.0 into the C:maven and C:protobuff directories, respectively.

Setting environment variables
Navigate to System properties—> Advanced —> Environment Variables. Add environment variables JAVA_HOME, M2_HOME (for Maven) and Platform (x64 or Win32 depending on the system architecture).
Note that the variable name, Platform, is case sensitive and values will be x64 or Win32 for 64-bit and 32-bit systems. Edit the path variable under System variables to add the following: C:cygwin64bin;C:cygwin64usrsbin;C:mavenbin; C:protobuf

The dos and donts
Given below are some of the issues usually faced while installing Hadoop on Windows.
If the JAVA_HOME environment variable is set improperly, Hadoop will not run. Set environment variables properly for JDK, Maven, Cygwin and Protobuffer. If you still get a JAVA_HOME not set properly error, then edit the C:hadoopbinhadoop-env.cmd file, locate set JAVA_HOME = and provide the JDK path (with no spaces).

Hadoop Installation On Windows 7

Do not use the Hadoop binary, as it is bereft of Windowsutils.exe and some Hadoop.dll files. Native IO is mandatory on Windows and without it the Hadoop installation will not work on Windows. Instead, build from the source code using Maven. It will download all the required components.

Download Hadoop from the link in References [4]. Building from the source code requires an active Internet connection the first time.

Building and configuring Hadoop on Windows
Select Start —> All Programs —> Microsoft Windows SDK v7.1 and open the Windows SDK 7 command prompt as the administrator. Change the directory to C:hadoop (if it doesnt exist, create it). Run the command prompt:

Note: The above step requires an active Internet connection because Maven will download all the dependencies. A successful build will generate a native binary package C:hadoophadoop-disttargethadoop-2.2.0.tar.gz directory. Extract hadoop-2.2.0.tar.gz under C:hdp.

Starting a single node installation: If the build is successful, check out which Hadoop version by running the following command:

Startup scripts: The C:hdpbin directory contains the scripts used to launch Hadoop DFS and MapReduce daemons.
start-dfs.cmd: Starts the Hadoop DFS daemons, the namenode and datanode.
start-mapred.cmd: Starts the Hadoop MapReduce daemons, the jobtracker and tasktrackers.
start-all cmd: Starts all Hadoop daemons, the namenode, datanode, the jobtracker and tasktrackers.
start-dfs.cmd: Starts the Hadoop DFS daemons, the namenode and datanode. Use this before start-mapred.sh
start-mapred.cmd: Starts the Hadoop MapReduce daemons, the jobtracker and tasktrackers.
start:all.cmd – Starts all Hadoop daemons, the namenode, datanode, the jobtracker and tasktrackers.
The following section contains details on how to configure Hadoop on Windows. Assuming the install directory is C:hdp, run the command C:hadoop> cd C:hdpetchadoop

Edit the file hadoop-env.cmd (which contains the environment variable settings used by Hadoop) in Notepad++ and add the following lines at the end of the file. Run the command prompt:

Before we get started with setting Hadoop environment variables and running Hadoop daemons, we need to configure the following files: core-site.xml, hdfs-site.xml, yarn-site.xml mapred-site.xml and the slave files located in C:hdpetchadoop. The minimum configuration settings are given below.
Edit or create the file C:hdpetchadoopcore-site.xml (all Hadoop services and clients use this file to locate namenode; it contains the name of the default file system) and make sure it has the following configuration key:

Hadoop Installation On Windows System Requirements

Edit or create the file C:hdpetchadoophdfs-site.xml (HDFS services use this file and it contains HTTP addresses for namenode and datanode) and add the following configuration key:

Note that if namenode and datanode directories in the path C:hdpdatadfs are not present, you must create them.
Edit or create the file C:hdpetchadoopslaves. (This file lists the hosts, one per line, where the Hadoop slave daemons datanode and tasktrackers will run. By default, this contains the single entry localhost.) Make sure it has the following entry:

Edit or create C:hdpetchadoopmapred-site.xml (contains the framework for executing MapReduce jobs) and add the following entries:

Finally, edit or create yarn-site.xml and add the following entries:

Hadoop installation on windows using virtualbox

Initialising environment variables
Once all four files are configured, run hadoop-env.cmd and format the namenode file system. Run the following command prompt:

The output should be something like the message above along with: hadoopdatadfsnamenode has been successfully formatted

Starting HDFS daemons
Start namenode and datanode on the localhost by running the following command prompt:

Two command prompts named namenode and datanode will open (Figure 2).
Start MapReduce or Yarn with the following command prompt:

Hadoop Installation On Windows 7 Using Cygwin

Two command prompts will open, named yarn nodemanager and yarn resourcemanager,

Verifying the installation
If the entire configuration and installation is successful, then open Internet Explorer to run Resource Manager and Node Manager at http://localhost:8042 and Namenode at http://localhost:50070.

What Hadoop is not!
Hadoop is not a replacement for a database or a SAN file system.
It is not a substitute for a database: Hadoop does not index files but, instead, it stores data in files. In order to search for something, we have to run a MapReduce job, which goes through all data. Hadoop is suitable in a scenario where the data is too vast to be handled by a database.

MapReduce is not always the best algorithm: MapReduce is a simple functional programming operation and it can be applied, in parallel, to gigabytes or terabytes of data. For that parallelism, you need to have each MR operation independent from all the others.

For more information refer to the Apache Hadoop link in Reference [5]

References
[1] http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
[2] http://www.microsoft.com/en-in/download/details.aspx?id=8442
[3] http://cygwin.com/setup-x86_64.exe
[4] http://www.apache.org/dist/hadoop/core/hadoop-2.2.0/hadoop-2.2.0-src.tar.gz
[5] http://wiki.apache.org/hadoop/HadoopIsNot
[6] http://hadoop.apache.org/
[7] https://svn.apache.org/viewvc/hadoop/common/branches/branch-2/BUILDING.txt?view=markup
[8] https://wiki.apache.org/hadoop/Hadoop2OnWindows
[9] http://en.wikipedia.org/wiki/Apache_Hadoop

There is a free GUI desktop application for creating and experimenting with a cluster in Windows, Linux and Macos – Hadjo. It creates a cluster of Docker containers behind the scenes. Very easy to use, has quick starter setups.
https://hadjo.lazyweaver.com