EDA and Data Visualization of Zomato Bangalore Restuarants Dataset
- Introduction
- Setup
- Data Preparation and Cleaning
- Exploratory Analysis and Visualization
- Q1. What locations are most popular for restaurants in Benagluru?
- Q2. Which locations have the best rated restaurants?
- Q3. What relation does the rating and number of votes that a restaurant receives have? What about table booking and online order facility?
- Q4. Is a restaurant which offers online order facility rated better?
- Q5. Are restaurants offering expensive food rated better? Does a table booking facility make a difference?
- Q6. Does the number of cuisines that a restaurant provides have a relation to the rating it recieves?
- Q7. How do Casual Dining and Fine Dining restaurants differ in their cost_for_two_people?
- Q8. How do Casual Dining and Fine Dining restaurants differ in their rating?
- Q9. What are the five most common types of restaurants?
- Q10. What are the top 5 rated restaurants in type and no_of_cuisines combined?
- Q11. What is the relationship between the type and cost_for_two_people?
- Summary and Conclusion
Zomato is an Indian multinational restaurant aggregator and food delivery company. In this project, we're going to study and analyze the Zomato Dataset shared by Himanshu Podder on Kaggle. This dataset contains information on restaurants in the city of Bengaluru, India. We can use this dataset to get an idea of different factors affecting the restaurants in different parts of the city and also answer questions like which type of food is most popular in the city, how does the location of the restuarant affects its rating on the Zomato platform, and what relation does the rating of the restaurant and the number of cuisines it offers has?
We will use the approach of Exploratory Data Analysis (EDA) for studying this data, which is used to analyse datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA can help in seeing what the data can tell us beyond the formal modeling or hypothesis testing task.3
The dataset can be used to answer a lot of questions but for this project, we are going to limit ourselves to 5-10 questions and then try to answer these questions.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data Visualization
import seaborn as sns # Data Visualization
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
file_path = "/kaggle/input/zomato-bangalore-restaurants/zomato.csv"
!ls -lh {file_path}
The file size of the dataset is 548MB and it is safe to import the whole dataset at once.
# Read the csv file into a pandas DataFrame
df = pd.read_csv(file_path, thousands = ',')
df.head()
df.info()
df.drop(["url", "name", "phone", "reviews_list", "address", "menu_item"], axis = 1, inplace = True)
df.columns
We'll also rename some of the columns.
df.rename(mapper = {"listed_in(type)": "type", "approx_cost(for two people)": "cost_for_two_people", "rate": "rating"}, axis = 1, inplace = True)
df.columns
df.dtypes
We need to change the rating
column to numeric dtype.
# All distinct values in the `rating` column
df.rating.unique()
# Remove the non-desired values from the rating column
df = df.loc[df.rating != "NEW"]
df = df.loc[df.rating != "-"]
# Select the first 3 characters and convert the column to numeric
df.rating = pd.to_numeric(df.rating.str[:3])
df.rating.head()
# Number of null values
df.isnull().sum().sort_values(ascending = False)
It appears that in all the columns with null values, absence of values neither indicates the value of zero nor informs us on something useful. Thus, it's better to drop the rows with null values.
In the dish_liked
column, because the null values account for about half of the data, it's better to drop the whole column.
df.dropna(subset = ["location", "rating", "rest_type", "cuisines", "cost_for_two_people"], inplace = True)
df.drop(["dish_liked"], axis = 1, inplace = True)
df.info()
df.head()
df.cuisines = df.cuisines.str.split(",")
df.rest_type = df.rest_type.str.split(",")
df.head()
Exploratory Analysis and Visualization
With these columns in mind, we will try to answer the following questions from the dataset:
- What locations are most popular for restaurants in Benagluru?
- Which locations have the best rated restaurants?
- What relation does the rating and number of votes that a restaurant receives have? What about table booking and online order facility?
- Is a restaurant which offers online order facility rated better?
- Are restaurants offering expensive food rated better? Does a table booking facility make a difference?
- Does the number of cuisines that a restaurant provides have a relation to the rating it recieves?
- How do Casual Dining and Fine Dining restaurants differ in their rating?
- How do Casual Dining and Fine Dining restaurants differ in their cost_for_two_people?
- What are the number of different types of restaurants?
popular_locations = df.location.value_counts().head(15)
popular_locations
plt.figure(figsize = (10, 8))
sns.barplot(x = popular_locations, y = popular_locations.index)
The 5 most popular locations for restaurants are BTM, Koramangala 5th Block, HSR, Indiranagar, and JP Nagar, with BTM boasting of nearly 4000 eateries.
# groupby location and get the count of each location along with the average rating of restaurants in that location.
location_rating = df.groupby(by = ["location"])["rating"].agg(["count", "mean"])
location_rating.head()
# select the locations with 50 minimumn eateries and then sort them by their rating.
rated_locations = location_rating.loc[location_rating["count"] >= 50].sort_values(by = "mean", ascending = False)
# select the top 20 locations.
top20_rated_locations = rated_locations[:20]
top20_rated_locations.head()
# plot the observations
plt.figure(figsize = (15, 5))
sns.barplot(x = top20_rated_locations.index, y = top20_rated_locations["mean"])
plt.xticks(rotation = 45)
plt.ylim(3.5, 4.3)
plt.show()
These are the top 20 locations in Bengaluru based on the eatries' ratings. The average rating in these locations range from 4.14 and 3.8.
# Relationship between `rating`, `votes` and `book_table`
plt.figure(figsize = (8, 5))
sns.scatterplot(x = "rating", y = "votes", hue = "book_table", data = df, s = 40)
The number of votes look to be directly correlated with the rating of the restaurant, and they look to increase exponentially with the rating after a critical point. This is to be expected because better restaurants would atrract more customers and thus more votes.
Also, the restaurants which don't provide booking facility are clustered in low ratings and less number of votes. In other words, more popular restaurants with high ratings are more likely to offer table booking facility, which can also be seen the following graph.
sns.boxplot(x = "book_table", y = "rating", data = df)
Now, we can look at the scatterplot for rating
, votes
and online_order
# Relationship between `rating`, `votes` and `online_order`
plt.figure(figsize = (8, 5))
sns.scatterplot(x = "rating", y = "votes", hue = "online_order", data = df, s = 40)
The data points for online order facility are scattered in the graph, and thus the data doesn't reveal any relationship between these variables. But, it may be worth exploring its relationship with the votes
and the rating
individually, which we'll do in the following sections.
sns.violinplot(x = "online_order", y = "rating", data = df)
As discussed in the previous question, the rating doesn't seem to be any different between the restaurants which offer online order
facility and which don't.
plt.figure(figsize = (8, 5))
sns.scatterplot(x = "rating", y = "cost_for_two_people", data = df, hue = 'book_table')
Although there are hints of an exponential relationship between the cost for two people and the rating of a restaurant, most of the data is clustered in low cost, and thus because of lack of data points for expensive restaurants, the data is inconclusive for this relations. But we can study book table
facility individuallly with the cost for two people
, where restaurants which offer book_table
facility seem to be more expensive than restaurants which don't.
sns.boxplot(x = "book_table", y = "cost_for_two_people", data = df)
This graph also supports the idea that table booking is correlated with the cost for two people.
The cuisines column shows the all the cuisines that a restaurant offers. We can add column to the DataFrame to store the number of cuisines that a restaurant offers.
df["no_of_cuisines"] = df.cuisines.str.len()
df["no_of_cuisines"].head()
plt.figure(figsize = (8, 5))
sns.stripplot(x = "no_of_cuisines", y = "rating", data = df)
The rating seems to become more concentrated towards mean as the no. of cuisines that a restaurant offers goes up. But, this could also be a artifact of low no. of restaurants offering higher no. of cuisines. In general, the mean of rating also seems to go up with the increase in no. of cuisines, but the graph is inconclusive. We'll explore this more in the boxplot.
sns.boxplot(x = "no_of_cuisines", y = "rating", data = df)
This graph reflects the relationship better and does support the idea that restaurants offering more no of cuisines are usually rated better.
#Look at the dataframe
df.head()
We'll need to explode the rest_type
column to extract information.
# extract information from `rest_type` column
rest_type_exploded = df.explode(column = "rest_type")
rest_type_exploded["rest_type"] = rest_type_exploded["rest_type"].str.strip()
rest_type_exploded.head()
# separate data for casual dining restaurants and fine dining restaurants.
fine_dining_rest = rest_type_exploded.loc[rest_type_exploded.rest_type == "Fine Dining"]
casual_dining_rest = rest_type_exploded.loc[rest_type_exploded.rest_type == "Casual Dining"]
# plot the data
sns.kdeplot(fine_dining_rest.cost_for_two_people, fill = True)
sns.kdeplot(casual_dining_rest.cost_for_two_people, fill = True)
plt.legend(["Fine Dining", "Casual Dining"])
plt.show()
As expected, fine dining restaurants are much more expensive than casual dining restaurants.
sns.kdeplot(fine_dining_rest.rating, fill = True)
sns.kdeplot(casual_dining_rest.rating, fill = True)
plt.legend(["Fine Dining", "Casual Dining"])
plt.show()
Fine dining restaurants are usually rated better and their ratings show much less variance than the ratings of casual dining restaurants.
# Five most common types of restaurants
most_common_types_of_restaurants = rest_type_exploded.rest_type.value_counts()
most_common_types_of_restaurants.head()
sns.barplot(x = most_common_types_of_restaurants.head(), y = most_common_types_of_restaurants.index[:5])
plt.show()
# group and extract data for different `types` and `no of cuisines`
rating_data = df.groupby(by = ["type", "no_of_cuisines"])["rating"].agg("mean")
rating_data.head()
Now, we'll make a 2D datarame out of this multiindexed pandas Series.
rating_data_df = rating_data.unstack()
rating_data_df.head()
# plot the data
plt.figure(figsize = (9, 7))
fig = sns.heatmap(data = rating_data_df, annot = True, cmap = "rocket_r")
fig.set(xlabel = "No. of cuisines", ylabel = "Type")
plt.show()
From the plot, Pubs and bars which offer more than 3 cuisines are all rated high. Similarly, Drinks & nightlife restaurants with multiple cuisines are also rated really high.
We can get the top 5 combinations from the rating_data
Series.
# top 5 combinations for `type` and `no_of_cuisines`
rating_data.sort_values(ascending = False).head()
plt.figure(figsize = (8, 6))
sns.boxplot(x = "type", y = "cost_for_two_people", data = df)
plt.xticks(rotation = 45)
plt.show()
First thing to note is that there are quite a few outliers in the data, almost all of them offering much more expensive food from the rest of the distribution. Also buffet, drinks & nighlife, and pubs are much more expensive than eateries of the type desserts and delivery.
Many questions could be asked and explored from the Zomato Dataset, but here we tried to answer 11 of them. We studied all the restaurants in Bengaluru, who have registered on Zomato, and tried to explore multiple variables' relationship with the restaurants' ratings. We also studied what factors go along with the food cost for two people in these restaurants.
There are two important things to note here before making any conclusions. First, all the analysis we did might apply only to restaurants registered on Zomato and other similar online platforms, and might differ significantly if we explore the food industry offline. Second really important thing is all the relationships that we studied are correlational in nature. This project thus does not establish causal relationships, although it might suggest some and can be taken as an inspiration to conduct actual experimental studies to explore the variables discussed here. Keeping in mind this, we can look at what we actually did establish in this EDA of Zomato Dataset.
- The most popular places for restaurants in Benagluru are BTM, Koramangala 5th Block, HSR, Indiranagar, and JP Nagar, with BTM boasting of nearly 4000 eateries.
- The top 5 locations according to avg rating of restaurants are Lavelle Road, Koramangala 3rd Block, St. Marks Road, Koramangala 5th Block and Church Street.
- Restaurants with higher ratings have generally received more votes than the restaurants with lower rating and they are more likely to offer table booking facility.
- Also, restaurants offering table booking facility are also generally more expensive.
- Restaurants offering more no. of cuisines are also on average rated better.
- Fine dining restaurants are much more expensive than casual dining restaurants and they are also usually rated better with much less variance in the ratings.
- Quick Bites and Casual Dining restaurants but are the most popular types of restaurant in Bengaluru on Zomato.
- Pubs and bars which offer more than 3 cuisines are all rated high. Similarly, Drinks & nightlife restaurants with multiple cuisines are also rated really highly.
- Buffet, drinks & nighlife, and pubs are much more expensive than eateries of the type desserts and delivery.
- For the other questions we asked, the data was more or less inconclusive. We may need more extensive data to answer those questions.
Apart from these inferences, many more interesting relationships can be studied and should be explored from this data.
Thanks for reading.