DataBricks: Simplifying Big Data
One of the biggest challenges while defining a big data solution is the creation and maintenance of development environment that follows an Agile methodology. Traditionally, IT and DevOps teams are required to create physical clusters which can be challenging and time consuming.
A Databricks cloud gives an elegant solution to the problem allowing to create a cluster and scale it up with hundreds of nodes
All these clusters are created and maintained by Databricks on AWS cloud and the user only needs to pay as per the usage. The cluster can run on any release of Spark, including the latest beta versions.
Once we the have the cluster running, the next challenge is to create either a prototype or a production version of the Spark Job. For this Databricks provides Notebook, a Spark shell with several user friendly features.
We can run sql queries, shell scripts and create nodes directly from within the shell. Currently, Databricks’ Notebook supports Spark, Python, Sql and R. With its collaborative feature, it allows multiple users to contribute and share simultaneously.
Databricks environment gives connectivity to Cassandra, a local file system for storing files and all other major data sources/sinks.
After the project has been developed using Notebooks, it can be brought to production using Databricks JOB functionality.
The jobs can be scheduled to run as per the user defined use cases, with options to notify users about run results via email.
The projects that are developed outside of Databricks environment, can also be automated by scheduling the project jar. Using this method, JAVA jars can also be run apart from the code written in Python, Scala and R.
Databricks also provides Rest-Api, which could be used to create clusters, call jobs and verify its status. Allowing the user to use other automation tools such as Jenkins to automate the process is a bonus.
Databricks also gives us a visualization tool called Dashboards to view the output or we can use any other visualization tool such as Tableau.
But no tool is foolproof and Databricks is no exception. One such issue is making large projects inside a Notebook, without proper IDE like support. Big Notebooks can become messy and difficult to maintain.
Databricks provides an ability to call a Notebook from another, allowing for breaking of large projects in more manageable chunks.
In conclusion, Databricks is a great cloud based tool if you are looking for an easy, scalable big data solution. The Databricks community edition empowers users to play with the Databricks environment free of cost, which will hopefully increase adoption and more user involvement in further enhancing its feature set.
Article contribution – Prashant Verma:
Prashant Verma is Lead, Data Engineer with Data Sciences and Analytics Team, at QA InfoTech