In our fast-paced, data-driven world, being able to quickly access and analyse huge amounts of data is key to success and Databricks which is undoubtedly leading platform for big data analytics, machine learning, and data engineering makes this possible. Its preloaded datasets are a game-changer for anyone looking to boost their data journey.
So, what is Databricks Datasets? Databricks Datasets are ready-to-use, real-world data that comes built into the Databricks environment. Think of them as your secret weapon for rapidly building, testing, and refining your data science and analytics projects without the hassle of sourcing, cleaning, or managing huge datasets.
Whether you're experimenting with machine learning algorithms, scaling up big data projects, or sharpening your SQL and data warehousing skills, the built-in datasets are a one-stop solution to fast-track your results. And with Databricks' seamless integration, you can dive into them instantly with just a few clicks.
How to Access the Dataset?
Accessing the datasets is as simple as running a line of code within Databricks Notebook. It is important you have your cluster running. In my case, I've got my cluster fired up as seen below
Then you can navigate by clicking on the New button and create Notebook as seen below
The running clustered is automatically attached to the Notebook. Then you can run the code below in the cell as seen in the screenshot below
As at the time of this article, there are total of 56 dataset users can play with.
Read Dataset into DataFrame
In this demo, I read the sfo_customer_survey from the databricks-datasets into a Spark dataframe and displayed using the display method as seen below
Once the dataset is loaded, we can start performing transformations, queries or any analysis. See you in the next tutoria