Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
Task to be done:
I have done the Exploratory Data Analysis of the famous Iris dataset and tried to gain useful insights from the data. The features present in the dataset are:
- Sepal Width
- Sepal Length
- Petal Width
- Petal Length
Steps to perform EDA
- Gathering dataset and preliminary processing of data.
- Check total number of entries and column types
- Check any null values
- Check duplicate entries
- Plot distribution of numeric data (univariate and pairwise joint distribution)
- Plot count distribution of categorical data
- Analyse the different frequencies of varieties of given dataset
Source Dataset : Iris (Taken from Kaggle)
The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
high level fast, powerful, flexible and easy to use open source data analysis and manipulation tool
for numerical array and mathematical statistical calculations
Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots.
provides a variety of visualization patterns
Introduction Task to be done Steps to perform EDA EDA Flowchart Dataset Library used Code Structure Code snippet…
A EDA_Prince.ipynb named file is created in Module Presentation Folder
Iris1.csv dataset is in same Module Presentation Folder.
Visualizations of different varieties of iris
Sepal length and width
Petal length and width
- Used Iris dataset has equal frequencies i.e. equal records are present for all three species.
- There was four numerical columns while just one categorical column which in turn is our target column and one unnamed was unwanted so we attempt feature selection by dropping it.
- I have visualised the correlation between petal width and petal length using seaborn and matplotlib
- if 0≤petal_length≤2 and 0≤petal_width≤0.7then setosa if 2≤petal_lenght≤5.2 and 1≤petal_length≤1.7 then versicolor else virginica
In nutshell, EDA is powerful tool and that can highlight problems to be addressed, lead to insights, and suggest patterns through visualizations. It provides strong comparisons to be drawn and estimates of how confident the analyst can be in their work, if they have previous knowledge of the data and it’s pattern.
- Applying model and algorithm
- Decision making
- Finding accuracy and precision score.