Exploratory Data Analysis on Iris Flower Dataset

Introduction

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

Task to be done:

I have done the Exploratory Data Analysis of the famous Iris dataset and tried to gain useful insights from the data. The features present in the dataset are:

  • Sepal Width
  • Sepal Length
  • Petal Width
  • Petal Length

Steps to perform EDA

  • Gathering dataset and preliminary processing of data.
  • Check total number of entries and column types
  • Check any null values
  • Check duplicate entries
  • Plot distribution of numeric data (univariate and pairwise joint distribution)
  • Plot count distribution of categorical data
  • Analyse the different frequencies of varieties of given dataset

Dataset

Source Dataset : Iris (Taken from Kaggle)

The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

Library used

  • Pandas
    high level fast, powerful, flexible and easy to use open source data analysis and manipulation tool
  • Numpy
    for numerical array and mathematical statistical calculations
  • Matplotlib
    Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots.
  • Seaborn
    provides a variety of visualization patterns

Code Structure

A EDA_Prince.ipynb named file is created in Module Presentation Folder

Iris1.csv dataset is in same Module Presentation Folder.

Output

Visualizations of different varieties of iris

Sepal length and width

Petal length and width

Summary

  • Used Iris dataset has equal frequencies i.e. equal records are present for all three species.
  • There was four numerical columns while just one categorical column which in turn is our target column and one unnamed was unwanted so we attempt feature selection by dropping it.
  • I have visualised the correlation between petal width and petal length using seaborn and matplotlib
  • if 0≤petal_length≤2 and 0≤petal_width≤0.7then setosa if 2≤petal_lenght≤5.2 and 1≤petal_length≤1.7 then versicolor else virginica

Conclusion

In nutshell, EDA is powerful tool and that can highlight problems to be addressed, lead to insights, and suggest patterns through visualizations. It provides strong comparisons to be drawn and estimates of how confident the analyst can be in their work, if they have previous knowledge of the data and it’s pattern.

Next Steps

  • Applying model and algorithm
  • Decision making
  • Finding accuracy and precision score.