Open Source software is created and maintained by a network of developers from around the world. It is available for download free of charge. Just type on its search engine "Hadoop Download" to find free distributions. Anyone can contribute to their development and use it. However, more and more commercial versions of the framework (often referred to as “distros”) are available.
Distributed by software vendors, these paid versions offer a personalized Hadoop framework. Buyers also benefit from additional features related to security, governance, SQL, management / administration consoles, as well as training, documentation and other services. Some of the most popular distributions include Cloudera, Hortonworks, MapR, IBM BigInsights, and PivotalHD.
There are several ways to enter data into Hadoop. You can use connectors from third-party vendors. You can use Sqoop to import structured data from a relational database to HDFS, Hive, and HBase. Flume allows data to be uploaded from logs to Hadoop continuously. Files can also be loaded to the system using simple Java commands. HDFS can also be mounted as a file system into which files can be copied. These are just a few of the many options available.
According to a recent study by Zion Research, the global Hadoop market was worth $ 5 billion in 2015, and could reach a value of $ 59 billion in 2021, with an annual growth of 51% between 2016 and 2021.
The increase in the volume of structured and unstructured data within large enterprises, and the willingness of enterprises to use this data, are the main factors driving the growth of the Distributed Computing Platform market. Healthcare, finance, manufacturing, biotech and defense industries need fast and efficient solutions to monitor data. The development and updates of the framework could open up new opportunities for the market. However, security and distribution issues could limit the use of the distributed processing platform.
Software, hardware and services are the three main segments of the Hadoop market. Of these three segments, the services segment dominated the market in 2015 and generated around 40% of total revenue. Services are expected to continue to dominate through 2020. The software segment is also expected to experience significant growth with the massive adoption of Hadoop by large enterprises.
The IT industry accounted for 32% of Hadoop’s total revenue in 2015. Next is the telecommunications industry, followed by government and retail. These latter sectors are expected to generate strong growth as companies increasingly adopt Hadoop solutions.
North America is the main Hadoop market in 2015, with 50% of the overall revenue generated by the framework. This trend is expected to continue in the years to come. Asia Pacific is the region in which it is experiencing the greatest growth thanks to the emergence of the telecommunications and computer industries in China and India. Europe is also expected to initiate strong growth.
Some of the main companies in the market include Amazon Web Services, Teradata Corporation, Cisco Systems, IBM Corporation, Cloudera, Inc., Datameer, Inc., Oracle Corporation, Hortonworks, Inc., VMware, OpenX.
Hadoop 2 Vs Hadoop 3
Available since 2012, Hadoop 2 has been expanded over the years. The latest version 2.90 has been available since November 17, 2017. In parallel, release 3.0.0 was released on December 13, 2017. This new version brings a lot of new features that should be presented.
The first difference comes from the management of containers. The third provides more agility with Docker’s packet isolation. This makes it possible to build applications quickly and deploy them in minutes. Thus, the Time to Market is shorter.
The cost of using Hadoop 3 is also lower. The second version requires more storage space. While the third version requires 9 blocks of storage, the second takes up to 18 blocks in total including copies. The latest release therefore makes it possible to reduce the storage load while maintaining the same quality of data backup. And who says less space occupied, says cost reduction.
Another major difference is that Hadoop 2 only supports a single namenode. This tool for managing the file system tree, file metadata and directories. The next version can handle more than one, which exponentially increases the size of the infrastructure. More namenodes also means more safety. If one of these managers is "down", another can take over.
With its intra-node storage balancing system, there is no longer a problem of imbalance in the use of hard drives with Hadoop 3. The division of labor is also easier. Unlike the second version, it is possible to prioritize applications and users.
Finally, this latest version of the HDFS system opens up new perspectives for designers of machine learning and deep learning algorithms. Indeed, it supports more hard disk, but especially graphics cards. Analyzes are thus improved by the computing power of these processors, which is particularly useful for developing artificial intelligence applications.