Getting Started with Databrick Dataset

Abiola David
1w
1.5k
0
5

Article

In our fast-paced, data-driven world, being able to quickly access and analyse huge amounts of data is key to success and Databricks which is undoubtedly leading platform for big data analytics, machine learning, and data engineering makes this possible. Its preloaded datasets are a game-changer for anyone looking to boost their data journey.

So, what is Databricks Datasets? Databricks Datasets are ready-to-use, real-world data that comes built into the Databricks environment. Think of them as your secret weapon for rapidly building, testing, and refining your data science and analytics projects without the hassle of sourcing, cleaning, or managing huge datasets.

Whether you're experimenting with machine learning algorithms, scaling up big data projects, or sharpening your SQL and data warehousing skills, the built-in datasets are a one-stop solution to fast-track your results. And with Databricks' seamless integration, you can dive into them instantly with just a few clicks.

How to Access the Dataset?

Accessing the datasets is as simple as running a line of code within Databricks Notebook. It is important that you have your cluster running. In my case, I've got my cluster fired up as seen below

Compute

Then you can navigate by clicking on the New button and creating a Notebook as seen below

Databricks

The running cluster is automatically attached to the Notebook. Then you can run the code below in the cell as seen in the screenshot below

Datasets table

At the time of this article, there is a total of 56 datasets that users can play with.

Read the Dataset into a DataFrame

In this demo, I read the sfo_customer_survey from the databricks-datasets into a Spark dataframe and displayed it using the display method as seen below

Datasets table

Once the dataset is loaded, we can start performing transformations, queries, or any analysis. See you in the next tutorial.