top of page

Data Analysis Toolkit

Data Exploration and Cleaning

Data Analysis Toolkit
This Python program is designed to be an all-in-one solution for various stages of data analysis, including data exploration, cleaning, manipulation, wrangling, and even initial analysis with visualization. This toolkit streamlines the process of transforming raw data into valuable insights and can serve as a powerful asset for data enthusiasts, analysts, and researchers. Features 1. Data Exploration: This toolkit offers an interactive data exploration module that allows you to quickly get insights into your dataset. With just a few clicks, you can access information about data types, basic statistics, top and bottom rows, column names, null values, duplicates, and more. 2. Data Cleaning: The Data Cleaning module helps you tidy up your dataset for more accurate analysis. You can easily remove duplicate rows, handle missing values through various methods like imputation or removal, and correct inconsistent data formats (such as date and currency formats). 3. Data Manipulation and Standardization: You can standardize text data to lowercase, uppercase, or title case, making it easier to work with. Additionally, you can convert data to proper units or formats, handle special characters or symbols, and even address data entry errors. 4. Data Wrangling and Analysis: While this toolkit doesn't replace advanced data analysis tools, it provides a foundation for basic analysis. The visualization module allows you to create meaningful graphs and charts, providing initial insights into your dataset's trends and patterns. How to Use 1. Explore the Dataset: Begin by using the exploration module to understand your dataset's structure and characteristics. Identify potential issues like missing values or duplicates. 2. Clean the Data: Move on to the cleaning module, where you can remove duplicates, handle missing values, and correct data formats. This step ensures your data is accurate and consistent. 3. Data Manipulation: Standardize text data, handle data entry errors, and convert data to appropriate formats. 4. Data Visualization: Although not as comprehensive as dedicated visualization tools, this toolkit provides basic data visualization capabilities to give you a visual sense of your data's distribution and relationships. 5. Save Cleaned Data: After cleaning and preparing your dataset, the toolkit allows you to save the cleaned version as a CSV file for further analysis. Benefits Efficiency: With a user-friendly interface and predefined functions, this toolkit streamlines the data analysis process, saving you time and effort. Customizability: While the toolkit provides predefined cleaning and manipulation options, you can extend it to include additional functionalities tailored to your specific dataset's needs. Learning Tool: For beginners in data analysis, this toolkit serves as an educational resource, showcasing various techniques and best practices in data manipulation and cleaning. This Data Analysis Toolkit has been created with the aim of simplifying the early stages of data analysis, providing a foundation for more advanced techniques. Feel free to explore, clean, and manipulate your datasets with ease and efficiency. Whether you're a data enthusiast, an analyst, or a researcher, this toolkit is designed to support you on your data journey.

import pandas as pd
import numpy as np
import time

print("Welcome to the world of data!")
class DataExplorer:
    def __init__(self, file_path):
        self.df = pd.read_csv(file_path)
    
    def explore_dataset(self):
        while True:
            print('''Choose an option below:
                1. Get dataset information - Press 1
                2. Describe the dataset - Press 2
                3. Display the top 5 rows and bottom 5 rows - Press 3
                4. List columns - Press 4
                5. Check for null values - Press 5
                6. Find duplicates in the dataset - Press 6
                7. Find duplicates by columns - Press 7
                8. Determine data range by column - Press 8
                9. Return to the main menu - Press 9''')

            try:
                explore_input = int(input("Enter your choice\n"))
            except ValueError:
                print("Please enter an integer value only")
                continue

            if explore_input == 1:
                self.get_dataset_info()
            elif explore_input == 2:
                self.describe_dataset()
            elif explore_input == 3:
                self.get_dataset_head_tail()
            elif explore_input == 4:
                self.get_columns()
            elif explore_input == 5:
                self.get_null_values()
            elif explore_input == 6:
                self.find_duplicates()
            elif explore_input == 7:
                self.find_duplicates_by_columns()
            elif explore_input == 8:
                self.data_range()
            elif explore_input == 9:
                break  # Return to the main menu
            else:
                print("Invalid option, please enter a number from the menu")

    def get_dataset_info(self):
        print(self.df.info())

    def describe_dataset(self):
        print(self.df.describe())

    def get_dataset_head_tail(self):
        print("Here are the top 5 rows:")
        print(self.df.head(5))
        print("Here are the bottom 5 rows:")
        print(self.df.tail(5))

    def get_columns(self):
        print(f"Total Columns: {self.df.columns}")

    def get_null_values(self):
        null_values = self.df.isnull().sum()
        print(f"Total Null Values:\n{null_values}")

    def find_duplicates(self):
        duplicates_count = self.df.duplicated().sum()
        print(f"Total Duplicates: {duplicates_count}")

    def find_duplicates_by_columns(self):
        duplicates_count = {}
        for column in self.df.columns:
            duplicates_count[column] = self.df[column].duplicated().sum()

        # Print the duplicate counts for each column
        for column, count in duplicates_count.items():
            print(f"Column {column}: {count} duplicates")

    def data_range(self):
        print(f"Here is the list of all columns: {self.df.columns.tolist()}")
        column_input = input("Please enter the column name to see Max and Min values:\n")
        column_max = self.df[column_input].max()
        column_min = self.df[column_input].min()
        print(f"The maximum value for column {column_input} is {column_max}")
        print(f"The minimum value for column {column_input} is {column_min}")

class DataCleaning:
    def __init__(self, file_path):
        self.df = pd.read_csv(file_path)
    
    def datasetcleaning(self):
        while True:
            print('''
                  Choose an option below to clean the dataset:
                  1. Remove Duplicate Rows - Press 1
                  2. Handle Missing Values (Impute or Remove) - Press 2
                  3. Correct Inconsistent Data Formats (e.g., Date, Currency, etc.) - Press 3
                  4. Standardize Text - Press 4
                  5. Delete Rows - Press 5
                  6. Drop a Column - Press 6
                  7. Handle Special Characters or Symbols - Press 7
                  8. Save the Cleaned Dataset - Press 8
                  9. Return to the previous menu - Press 9
                  ''')
            try:
                cleaning_input = int(input("Enter your choice\n"))
            except ValueError:
                print("Please enter an integer value only")
                continue

            if cleaning_input == 1:
                self.removeduplicates()
            elif cleaning_input == 2:
                self.imputeorremove()
            elif cleaning_input == 3:
                self.correctformat()
            elif cleaning_input == 4:
                self.standardizetext()
            elif cleaning_input == 5:
                self.deleterows()
            elif cleaning_input == 6:
                self.dropcolumn()
            elif cleaning_input == 7:
                self.handlespecialcharacter()
            elif cleaning_input == 8:
                self.savecleandataset()
            elif cleaning_input == 9:
                break
            else:
                print("Invalid Input")
                continue
    
    def removeduplicates(self):
        print(f"Checking for duplicates....")
        print(f"Total Duplicates: {self.df.duplicated().sum()}")
        self.df = self.df.drop_duplicates()
        time.sleep(3)
        print(f"Removing duplicates...")
        time.sleep(4)
        print(f"Duplicates removed successfully. Total Duplicates: {self.df.duplicated().sum()}")
    
    def imputeorremove(self):
        while True:
            print('''
                  To Impute missing values - Press 1
                  To Remove rows with null values - Press 2
                  To Fill null values with Mean - Press 3
                  To Fill null values with Median - Press 4
                  To Fill null values with Mode - Press 5
                  To return - Press 6
                  ''')
            try:
                missing_handle_initial_input = int(input("Enter your choice\n"))
            except ValueError:
                print("Enter an integer only")
                continue

            if missing_handle_initial_input == 1:
                impute_value_input = input("Enter the value you'd like to impute")
                self.df = self.df.fillna(impute_value_input)
                print(f"Dataset has been updated with the requested value: {impute_value_input}")
                print(self.df)
            
            elif missing_handle_initial_input == 2:
                self.df = self.df.dropna()
                print(f"Dataset has been updated:\n {self.df}")

            elif missing_handle_initial_input == 3:
                self.df = self.df.replace('NaN', np.nan)
                self.df = self.df.fillna(self.df.mean())
                print(f"Dataset has been updated:\n {self.df}")
                
            elif missing_handle_initial_input == 4:
                self.df = self.df.replace('NaN', np.nan)
                self.df = self.df.fillna(self.df.median())
                print(f"Dataset has been updated:\n {self.df}")

            elif missing_handle_initial_input == 5:
                self.df = self.df.replace('NaN', np.nan)
                self.df = self.df.fillna(self.df.mode().iloc[0])
                print(f"Dataset has been updated:\n {self.df}")

            elif missing_handle_initial_input == 6:
                break

            else:
                print("Invalid Input")
                continue

    def correctformat(self):
        while True:
            print('''
                  1. To correct date and time format - Press 1
                  2. To correct currency format - Press 2
                  3. To return - Press 3
                  ''')
            try:
                input_format_option = int(input("Enter your choice\n"))
            except ValueError:
                print("Enter an integer only")
                continue

            if input_format_option == 1:
                print(f"Please enter the column with date and time data:\n{self.df.columns}")
                date_column_input_to_format = input("Enter the column name")

                if date_column_input_to_format in self.df.columns:
                    self.df[date_column_input_to_format] = pd.to_datetime(self.df[date_column_input_to_format], errors='coerce', format='%d-%m-%Y %H:%M')
                    # If you only need the date without time, use the following line:
                    # self.df[date_column_input_to_format] = pd.to_datetime(self.df[date_column_input_to_format])
                    # self.df[date_column_input_to_format] = self.df[date_column_input_to_format].dt.strftime('%d-%m-%Y')
                    print(f"The date column {date_column_input_to_format} has been updated to the desired format")
                
                else: 
                    print("Column not found")

            if input_format_option == 2:
                print(f"Please enter the column with currency data:\n{self.df.columns}")
                currency_column_input_to_format = input("Enter the column name")

                if currency_column_input_to_format in self.df.columns:
                    self.df[currency_column_input_to_format] = self.df[currency_column_input_to_format].str.replace('$', '').str.replace(',', '').astype(float)
                    print(f"The currency column {currency_column_input_to_format} has been updated to the desired format")
                
                else: 
                    print("Column not found")

            elif input_format_option == 3:
                break

            else:
                print("Invalid Input")
                continue

    def standardizetext(self):
        while True:
            print('''
                  1. Convert all text to lowercase - Press 1
                  2. Convert all text to uppercase - Press 2
                  3. Convert all text to proper case - Press 3
                  4. Return - Press 4
                  ''')
            try:
                standardize_text_input = int(input("Enter your choice\n"))
            except ValueError:
                print("Enter an integer only")
                continue

            if standardize_text_input == 1:
                print(f"Please enter the column with text data:\n{self.df.columns}")
                column_input_to_standardize = input("Enter the column name")

                if column_input_to_standardize in self.df.columns:
                    self.df[column_input_to_standardize] = self.df[column_input_to_standardize].str.lower().str.strip()
                    print(f"The selected {column_input_to_standardize} column has been updated to all lowercase!")

            elif standardize_text_input == 2:
                print(f"Please enter the column with text data:\n{self.df.columns}")
                column_input_to_standardize = input("Enter the column name")

                if column_input_to_standardize in self.df.columns:
                    self.df[column_input_to_standardize] = self.df[column_input_to_standardize].str.upper().str.strip()
                    print(f"The selected {column_input_to_standardize} column has been updated to all uppercase!")

            elif standardize_text_input == 3:
                print(f"Please enter the column with text data:\n{self.df.columns}")
                column_input_to_standardize = input("Enter the column name")

                if column_input_to_standardize in self.df.columns:
                    self.df[column_input_to_standardize] = self.df[column_input_to_standardize].str.title().str.strip()
                    print(f"The selected {column_input_to_standardize} column has been updated to proper (title) case!")

            elif standardize_text_input == 4:
                break

            else:
                print("Invalid Input")
                continue

    def deleterows(self):
        while True:
            print('''Please check the columns based on which you want to delete rows - currently supporting numeric comparison only
                1. View columns - Press 1
                2. Return to the previous menu - Press 2
                ''') 
            try:
                deleterows1stinput = int(input("Enter your choice\n"))    
            except ValueError:
                print("Enter an integer only")
                continue  # Continue the loop to get a valid input

            if deleterows1stinput == 1:
                print(self.df.columns)
                deleterows2ndinput = input("Enter the column name\n")
                deletecondition = input('''
                                    Enter the condition
                                    1. Enter 1 for greater >
                                    2. Enter 2 for lower <
                                    3. Enter 3 for equal ==
                                    4. Enter 4 for >=
                                    5. Enter 5 for <=
                                    6. Enter 6 to return to the previous menu
                                    ''')
                basevaluetodeleterows = int(input("Enter the value\n"))
                if deletecondition == '1':
                    self.df = self.df[self.df[deleterows2ndinput] > basevaluetodeleterows]
                    print("The dataset has been updated. Please save it!!")
                elif deletecondition == '2':
                    self.df = self.df[self.df[deleterows2ndinput] < basevaluetodeleterows]
                    print("The dataset has been updated. Please save it!!")
                elif deletecondition == '3':
                    self.df = self.df[self.df[deleterows2ndinput] == basevaluetodeleterows]
                    print("The dataset has been updated. Please save it!!")
                elif deletecondition == '4':
                    self.df = self.df[self.df[deleterows2ndinput] >= basevaluetodeleterows]
                    print("The dataset has been updated. Please save it!!")
                elif deletecondition == '5':
                    self.df = self.df[self.df[deleterows2ndinput] <= basevaluetodeleterows]
                    print("The dataset has been updated. Please save it!!")
                elif deletecondition == '6':
                    break
                else:
                    print("Invalid Selection")

            elif deleterows1stinput == 2:
                break
            else:
                print("Invalid Selection")

    def dropcolumn(self):
         while True:
            print('''Please check the columns based on which you want to delete rows - currently supporting numeric comparison only
                1. View columns - Press 1
                2. Return to the previous menu - Press 2
                ''') 
            try:
                columndropinput = int(input("Enter your choice\n"))    
            except ValueError:
                print("Enter an integer only")
                continue  # Continue the loop to get a valid input
            
            if columndropinput == 1:
                print(self.df.columns)
                columndropinput2 = input("Enter the column name\n")
                self.df = self.df.drop(columndropinput2, axis=1)
                print("The dataset has been updated. Please save it!!")

            elif columndropinput == 2:
                break
            else:
                print("Invalid Input")

    def handlespecialcharacter(self):
        while True:
            print('''Please check the columns based on which you want to modify values - currently supporting string replacement only
                1. View columns - Press 1
                2. Return to the previous menu - Press 2
                ''') 
            try:
                specialcharacter1stinput = int(input("Enter your choice\n"))    
            except ValueError:
                print("Enter an integer only")
                continue  # Continue the loop to get a valid input

            if specialcharacter1stinput == 1:
                print(self.df.columns)
                column_name = input("Please enter the column name: ")
                if column_name not in self.df.columns:
                    print("Column not found in DataFrame.")
                    continue

                while True:
                    print('''
                        1. Replace character - Press 1
                        2. Remove character - Press 2
                        3. To Go Back - Press 3
                    ''') 
                    try:
                        choicesselect = int(input("Enter your choice\n"))
                    except ValueError:
                        print("Enter an integer only")
                        continue  # Continue the loop to get a valid input

                    if choicesselect == 1:
                        special_character = input("Please enter the character to replace: ")
                        replacement_value = input("Please enter the value with which you want to replace the character: ")
                        
                        self.df[column_name] = self.df[column_name].str.replace(special_character, replacement_value)
                        print("Values replaced successfully.")
                        
                    elif choicesselect == 2:
                        special_character = input("Please enter the character to remove: ")
                        
                        self.df[column_name] = self.df[column_name].str.replace(special_character, '')
                        print("Characters removed successfully.")
                        
                    elif choicesselect == 3:
                        # Go back to the previous menu
                        break
                    else:
                        print("Invalid choice. Please enter 1, 2, or 3.")
            elif specialcharacter1stinput == 2:
                break
            else:
                print("Invalid choice. Please enter 1 or 2.")


    def savecleandataset(self):
        self.df.to_csv(r"YOUR UPDATED CODE",index=False)
        print("The cleaned dataset has been saved successfully!!")

def main():
    original_data_set = r"YOUR CODE"
    file_path = original_data_set
    explorer = DataExplorer(file_path)
    cleaning = DataCleaning(file_path)
    
    while True:
        print('''Choose an option below:
            1. Explore the dataset
            2. Clean the dataset
            3. Exit''')
        
        try:
            first_input = int(input("Enter your choice\n"))
        except ValueError:
            print("Please enter an integer value only")
            continue
        
        if first_input == 1:
            explorer.explore_dataset()
        elif first_input == 2:
            cleaning.datasetcleaning()
        elif first_input == 3:
            print("Exiting the program")
            break
        else:
            print("Invalid option")

if __name__ == "__main__":
    main()

bottom of page