Small Step By Step Guide For Your End-to-End Data Science Project

Anmol Sharma
7 min readMay 25, 2021

ELI5: Data to Deployment; DIY

E2E Deployment

How does a Data Science project is made? How to start the project? Where to search for what we want? How to proceed in the project?… If you are stuck with any of these questions, here is a simple step-by-step guide for a fully deployed demo project done by me.

The first and the foremost important thing is the path you are following towards approaching your project because Data Science projects have two ways of moving forward:

  1. You got the problem statement, you found the relevant dataset. You got started.
  2. You got a random dataset, you try to solve a problem statement by making insights from that data. You got started.

When this thing is clear in your mind, you are good to go.

In my case, it was the first one.

Now, let’s move to my project. I will be discussing the first approach where I got the problem statement, searched for a dataset, and started. If you want to follow the second approach, just omit the data collection step and go straight to the EDA section.

By the way, my project is a multiple linear regression model which can predict the calories burnt by you during exercise.

All the below steps can be followed using your local Jupyter notebook instance or Google Colab as per your wish. I did all the data wrangling, EDA, model training in Google Colab.

I have simply laid a blueprint from which you can take help. For the exact dataset, code implementation, check out this Github repo of mine.

1. Data collection

The first step is always collecting the relevant data for your problem. In my case, I want to build a machine learning model capable of predicting the calories burnt during exercise. For this, I wanted to collect the relevant dataset. I was fortunate enough to find a suitable dataset for my need on Kaggle itself. Download yours from Kaggle or wherever you want from.

Next comes the Data wrangling(structuring, cleaning, discovering)part.

2. Data Wrangling

After getting the dataset, it’s time to see for data adequacy. Whether your data is structured, clean enough to provide useful insights. For this, we need to explore our dataset.

head() of my dataset

Here as I explored my dataset, I found some missing values. Just filled it with the mean() value. You may find some outliers or some Nan values. Just follow the steps mentioned below for handling them.

  • For missing values: You can either omit them, if they don’t really affect your model performance or if they are not in a large number; or you can fill them with a mean() value or mode() value depending on whether the missing value is a quantitative or categorical one respectively.
  • For Outliers: You can either simply omit them or you can go for normalization if it does help.

This is just an eagle’s eye view of what is going in Data wrangling. The real story is running behind the scenes. But don’t worry, it is good for the start.

After this comes EDA.

3. EDA(Exploratory Data Analysis)

This is concerned with what a Data Analyst’s role is about. It includes:

  • Analyzing all the features in the dataset, as a cumulative effect as well as an individual contributor.
  • Carry operations like correlation plots, visualizations, summarizing the data distribution, etc. to analyze more and more about your data.

Below is a plot showing how different features are related to each other in a form of a correlation plot.

Correlation plot of my Dataset

From the plot shown above, I made several conclusions about my data.

  • The rightmost three(excluding calories, since it is the target variable) features i.e. Duration, Heart_Rate and Body_Temp have the highest correlation with calories. They are the deciding factor in the prediction of burnt calories.

For further details, I went for a pair plot as shown below.

Pair plot b/w calories and duration, body_temp, heart_rate

From this pair plot, it is clear that how data points are distributed in the sample space as well as how tightly they are correlated with each other.

You can also go for various other graphical techniques like box plot, histogram, scatter plot for digging deep into your data.

Next comes the choice of model.

4. Model Selection and Building

This step is based on various factors:

  • Your problem statement: If you are solving a regression problem or a classification one. Like in my case, it was a regression problem.
  • The complexity of the dataset and the accuracy required: Like in my case the dataset was pretty simple and self-explanatory so linear regression worked pretty well.

I used linear regression in my case as the problem was concerned with predicting quantitative target variable(calorie).

Python’s scikit learn gives tremendous power for applying complex models like linear regression, decision trees, naive Bayes classifiers, etc. It helped me out with my linear regression.

First of all I performed simple linear regression between duration and calories as shown below:

It shows a line of best fit(red line) for all data points. The above plot is a linear regression with only one independent feature i.e. duration and one dependent feature(target) i.e. calories.

After this I tried quadratic(degree=2) regression for the same as shown below:

Here we can see two lines, blue for quadratic regression and red for linear regression. Blueline is a better fit for the dataset as compared to redline.

After this, I did regression for (2 independent, 1 dependent); (3 independent, 1 dependent); (4 independent, 1 dependent), etc… Plots not available because of the high dimensional complexity of visualizations.

The final model was implemented using all 7 independent(shown above in dataset table) and 1 dependent feature.

All this work was going on in Google Colab.

Now the model was done with training as well as testing(refer to the image below). The next step was to save the model for future use so that we need not train the model at every instance we run our program.

For saving my model, I used a wonderful module of Python named Pickle.

Training and saving model

I also evaluated the model using a custom test instance as shown below. ‘loaded_model_l’ is a linear model; ‘ar’ is an array containing features for the model; ‘arr’ is the transformed array; ‘.predict()’ function is the real hero behind prediction; [[238.42637952]] is the predicted calorie value.

Prediction by model

After completing the evaluation of the model, it’s time to focus on the frontend and deployment part.

5. Frontend

Python provides a wonderful module specially designed for data, “Streamlit”.

Easy to use, just follow its official documentation page for its usage.

I myself used the same for the frontend part as well as for the deployment.

Left(Frontend), Right(Backend)

For full code, here👇 is a link to my GitHub repository:

For the deployment part, you just need to follow one more additional step. While you are on your localhost, just select those three horizontal bars from the upper left corner as shown in the image below and select “Deploy this app”.

Deploy this app

After clicking on “Deploy this app” you will be directed to another page s shown below. Here you need to Sign up for a new account. That account will be linked with your Github account which will carry all your files related to the project.

Here we can understand this as like: Stremalit is working as a frontend to the program files and code resting in one of your GitHub repositories linked to it.

Note: Don’t forget to add all the necessary files of your project along with the requirements.txt file in your GitHub repo.

After creating an account, click “Request an invite”. This will add you to the request queue.

Request for streamlit service

As streamlit is still growing, so for hosting services usually take 2–3 days to set up your account. Once your account is active, you are ready to host at most 3 apps on a single Github account.

After 2–3 days, when you have received account activation mail, you can directly go and deploy a new app as shown in the image

New app using existing repo or a new one
Final one-click deployment

After clicking “Deploy”, it’ll hardly take 3–4 minutes to deploy your app. Your live link to the app will be visible on the “share.streamlit.io” main page.

Now it's your turn to try things out on your own.

For more such insightful and exciting projects, follow me here👇

Thank you for your read.

--

--

Anmol Sharma

Machine learning enthusiast | Data science aficionado | Web Designer | Always curious to know something new and innovative