top of page
Data Analysis Toolkit
Data Exploration and Cleaning
This Python program is designed to be an all-in-one solution for various stages of data analysis, including data exploration, cleaning, manipulation, wrangling, and even initial analysis with visualization. This toolkit streamlines the process of transforming raw data into valuable insights and can serve as a powerful asset for data enthusiasts, analysts, and researchers.
Features
1. Data Exploration: This toolkit offers an interactive data exploration module that allows you to quickly get insights into your dataset. With just a few clicks, you can access information about data types, basic statistics, top and bottom rows, column names, null values, duplicates, and more.
2. Data Cleaning: The Data Cleaning module helps you tidy up your dataset for more accurate analysis. You can easily remove duplicate rows, handle missing values through various methods like imputation or removal, and correct inconsistent data formats (such as date and currency formats).
3. Data Manipulation and Standardization: You can standardize text data to lowercase, uppercase, or title case, making it easier to work with. Additionally, you can convert data to proper units or formats, handle special characters or symbols, and even address data entry errors.
4. Data Wrangling and Analysis: While this toolkit doesn't replace advanced data analysis tools, it provides a foundation for basic analysis. The visualization module allows you to create meaningful graphs and charts, providing initial insights into your dataset's trends and patterns.
How to Use
1. Explore the Dataset: Begin by using the exploration module to understand your dataset's structure and characteristics. Identify potential issues like missing values or duplicates.
2. Clean the Data: Move on to the cleaning module, where you can remove duplicates, handle missing values, and correct data formats. This step ensures your data is accurate and consistent.
3. Data Manipulation: Standardize text data, handle data entry errors, and convert data to appropriate formats.
4. Data Visualization: Although not as comprehensive as dedicated visualization tools, this toolkit provides basic data visualization capabilities to give you a visual sense of your data's distribution and relationships.
5. Save Cleaned Data: After cleaning and preparing your dataset, the toolkit allows you to save the cleaned version as a CSV file for further analysis.
Benefits
Efficiency: With a user-friendly interface and predefined functions, this toolkit streamlines the data analysis process, saving you time and effort.
Customizability: While the toolkit provides predefined cleaning and manipulation options, you can extend it to include additional functionalities tailored to your specific dataset's needs.
Learning Tool: For beginners in data analysis, this toolkit serves as an educational resource, showcasing various techniques and best practices in data manipulation and cleaning.
This Data Analysis Toolkit has been created with the aim of simplifying the early stages of data analysis, providing a foundation for more advanced techniques. Feel free to explore, clean, and manipulate your datasets with ease and efficiency. Whether you're a data enthusiast, an analyst, or a researcher, this toolkit is designed to support you on your data journey.
import pandas as pd
import numpy as np
import time
print("Welcome to the world of data!")
class DataExplorer:
def __init__(self, file_path):
self.df = pd.read_csv(file_path)
def explore_dataset(self):
while True:
print('''Choose an option below:
1. Get dataset information - Press 1
2. Describe the dataset - Press 2
3. Display the top 5 rows and bottom 5 rows - Press 3
4. List columns - Press 4
5. Check for null values - Press 5
6. Find duplicates in the dataset - Press 6
7. Find duplicates by columns - Press 7
8. Determine data range by column - Press 8
9. Return to the main menu - Press 9''')
try:
explore_input = int(input("Enter your choice\n"))
except ValueError:
print("Please enter an integer value only")
continue
if explore_input == 1:
self.get_dataset_info()
elif explore_input == 2:
self.describe_dataset()
elif explore_input == 3:
self.get_dataset_head_tail()
elif explore_input == 4:
self.get_columns()
elif explore_input == 5:
self.get_null_values()
elif explore_input == 6:
self.find_duplicates()
elif explore_input == 7:
self.find_duplicates_by_columns()
elif explore_input == 8:
self.data_range()
elif explore_input == 9:
break # Return to the main menu
else:
print("Invalid option, please enter a number from the menu")
def get_dataset_info(self):
print(self.df.info())
def describe_dataset(self):
print(self.df.describe())
def get_dataset_head_tail(self):
print("Here are the top 5 rows:")
print(self.df.head(5))
print("Here are the bottom 5 rows:")
print(self.df.tail(5))
def get_columns(self):
print(f"Total Columns: {self.df.columns}")
def get_null_values(self):
null_values = self.df.isnull().sum()
print(f"Total Null Values:\n{null_values}")
def find_duplicates(self):
duplicates_count = self.df.duplicated().sum()
print(f"Total Duplicates: {duplicates_count}")
def find_duplicates_by_columns(self):
duplicates_count = {}
for column in self.df.columns:
duplicates_count[column] = self.df[column].duplicated().sum()
# Print the duplicate counts for each column
for column, count in duplicates_count.items():
print(f"Column {column}: {count} duplicates")
def data_range(self):
print(f"Here is the list of all columns: {self.df.columns.tolist()}")
column_input = input("Please enter the column name to see Max and Min values:\n")
column_max = self.df[column_input].max()
column_min = self.df[column_input].min()
print(f"The maximum value for column {column_input} is {column_max}")
print(f"The minimum value for column {column_input} is {column_min}")
class DataCleaning:
def __init__(self, file_path):
self.df = pd.read_csv(file_path)
def datasetcleaning(self):
while True:
print('''
Choose an option below to clean the dataset:
1. Remove Duplicate Rows - Press 1
2. Handle Missing Values (Impute or Remove) - Press 2
3. Correct Inconsistent Data Formats (e.g., Date, Currency, etc.) - Press 3
4. Standardize Text - Press 4
5. Delete Rows - Press 5
6. Drop a Column - Press 6
7. Handle Special Characters or Symbols - Press 7
8. Save the Cleaned Dataset - Press 8
9. Return to the previous menu - Press 9
''')
try:
cleaning_input = int(input("Enter your choice\n"))
except ValueError:
print("Please enter an integer value only")
continue
if cleaning_input == 1:
self.removeduplicates()
elif cleaning_input == 2:
self.imputeorremove()
elif cleaning_input == 3:
self.correctformat()
elif cleaning_input == 4:
self.standardizetext()
elif cleaning_input == 5:
self.deleterows()
elif cleaning_input == 6:
self.dropcolumn()
elif cleaning_input == 7:
self.handlespecialcharacter()
elif cleaning_input == 8:
self.savecleandataset()
elif cleaning_input == 9:
break
else:
print("Invalid Input")
continue
def removeduplicates(self):
print(f"Checking for duplicates....")
print(f"Total Duplicates: {self.df.duplicated().sum()}")
self.df = self.df.drop_duplicates()
time.sleep(3)
print(f"Removing duplicates...")
time.sleep(4)
print(f"Duplicates removed successfully. Total Duplicates: {self.df.duplicated().sum()}")
def imputeorremove(self):
while True:
print('''
To Impute missing values - Press 1
To Remove rows with null values - Press 2
To Fill null values with Mean - Press 3
To Fill null values with Median - Press 4
To Fill null values with Mode - Press 5
To return - Press 6
''')
try:
missing_handle_initial_input = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue
if missing_handle_initial_input == 1:
impute_value_input = input("Enter the value you'd like to impute")
self.df = self.df.fillna(impute_value_input)
print(f"Dataset has been updated with the requested value: {impute_value_input}")
print(self.df)
elif missing_handle_initial_input == 2:
self.df = self.df.dropna()
print(f"Dataset has been updated:\n {self.df}")
elif missing_handle_initial_input == 3:
self.df = self.df.replace('NaN', np.nan)
self.df = self.df.fillna(self.df.mean())
print(f"Dataset has been updated:\n {self.df}")
elif missing_handle_initial_input == 4:
self.df = self.df.replace('NaN', np.nan)
self.df = self.df.fillna(self.df.median())
print(f"Dataset has been updated:\n {self.df}")
elif missing_handle_initial_input == 5:
self.df = self.df.replace('NaN', np.nan)
self.df = self.df.fillna(self.df.mode().iloc[0])
print(f"Dataset has been updated:\n {self.df}")
elif missing_handle_initial_input == 6:
break
else:
print("Invalid Input")
continue
def correctformat(self):
while True:
print('''
1. To correct date and time format - Press 1
2. To correct currency format - Press 2
3. To return - Press 3
''')
try:
input_format_option = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue
if input_format_option == 1:
print(f"Please enter the column with date and time data:\n{self.df.columns}")
date_column_input_to_format = input("Enter the column name")
if date_column_input_to_format in self.df.columns:
self.df[date_column_input_to_format] = pd.to_datetime(self.df[date_column_input_to_format], errors='coerce', format='%d-%m-%Y %H:%M')
# If you only need the date without time, use the following line:
# self.df[date_column_input_to_format] = pd.to_datetime(self.df[date_column_input_to_format])
# self.df[date_column_input_to_format] = self.df[date_column_input_to_format].dt.strftime('%d-%m-%Y')
print(f"The date column {date_column_input_to_format} has been updated to the desired format")
else:
print("Column not found")
if input_format_option == 2:
print(f"Please enter the column with currency data:\n{self.df.columns}")
currency_column_input_to_format = input("Enter the column name")
if currency_column_input_to_format in self.df.columns:
self.df[currency_column_input_to_format] = self.df[currency_column_input_to_format].str.replace('$', '').str.replace(',', '').astype(float)
print(f"The currency column {currency_column_input_to_format} has been updated to the desired format")
else:
print("Column not found")
elif input_format_option == 3:
break
else:
print("Invalid Input")
continue
def standardizetext(self):
while True:
print('''
1. Convert all text to lowercase - Press 1
2. Convert all text to uppercase - Press 2
3. Convert all text to proper case - Press 3
4. Return - Press 4
''')
try:
standardize_text_input = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue
if standardize_text_input == 1:
print(f"Please enter the column with text data:\n{self.df.columns}")
column_input_to_standardize = input("Enter the column name")
if column_input_to_standardize in self.df.columns:
self.df[column_input_to_standardize] = self.df[column_input_to_standardize].str.lower().str.strip()
print(f"The selected {column_input_to_standardize} column has been updated to all lowercase!")
elif standardize_text_input == 2:
print(f"Please enter the column with text data:\n{self.df.columns}")
column_input_to_standardize = input("Enter the column name")
if column_input_to_standardize in self.df.columns:
self.df[column_input_to_standardize] = self.df[column_input_to_standardize].str.upper().str.strip()
print(f"The selected {column_input_to_standardize} column has been updated to all uppercase!")
elif standardize_text_input == 3:
print(f"Please enter the column with text data:\n{self.df.columns}")
column_input_to_standardize = input("Enter the column name")
if column_input_to_standardize in self.df.columns:
self.df[column_input_to_standardize] = self.df[column_input_to_standardize].str.title().str.strip()
print(f"The selected {column_input_to_standardize} column has been updated to proper (title) case!")
elif standardize_text_input == 4:
break
else:
print("Invalid Input")
continue
def deleterows(self):
while True:
print('''Please check the columns based on which you want to delete rows - currently supporting numeric comparison only
1. View columns - Press 1
2. Return to the previous menu - Press 2
''')
try:
deleterows1stinput = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue # Continue the loop to get a valid input
if deleterows1stinput == 1:
print(self.df.columns)
deleterows2ndinput = input("Enter the column name\n")
deletecondition = input('''
Enter the condition
1. Enter 1 for greater >
2. Enter 2 for lower <
3. Enter 3 for equal ==
4. Enter 4 for >=
5. Enter 5 for <=
6. Enter 6 to return to the previous menu
''')
basevaluetodeleterows = int(input("Enter the value\n"))
if deletecondition == '1':
self.df = self.df[self.df[deleterows2ndinput] > basevaluetodeleterows]
print("The dataset has been updated. Please save it!!")
elif deletecondition == '2':
self.df = self.df[self.df[deleterows2ndinput] < basevaluetodeleterows]
print("The dataset has been updated. Please save it!!")
elif deletecondition == '3':
self.df = self.df[self.df[deleterows2ndinput] == basevaluetodeleterows]
print("The dataset has been updated. Please save it!!")
elif deletecondition == '4':
self.df = self.df[self.df[deleterows2ndinput] >= basevaluetodeleterows]
print("The dataset has been updated. Please save it!!")
elif deletecondition == '5':
self.df = self.df[self.df[deleterows2ndinput] <= basevaluetodeleterows]
print("The dataset has been updated. Please save it!!")
elif deletecondition == '6':
break
else:
print("Invalid Selection")
elif deleterows1stinput == 2:
break
else:
print("Invalid Selection")
def dropcolumn(self):
while True:
print('''Please check the columns based on which you want to delete rows - currently supporting numeric comparison only
1. View columns - Press 1
2. Return to the previous menu - Press 2
''')
try:
columndropinput = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue # Continue the loop to get a valid input
if columndropinput == 1:
print(self.df.columns)
columndropinput2 = input("Enter the column name\n")
self.df = self.df.drop(columndropinput2, axis=1)
print("The dataset has been updated. Please save it!!")
elif columndropinput == 2:
break
else:
print("Invalid Input")
def handlespecialcharacter(self):
while True:
print('''Please check the columns based on which you want to modify values - currently supporting string replacement only
1. View columns - Press 1
2. Return to the previous menu - Press 2
''')
try:
specialcharacter1stinput = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue # Continue the loop to get a valid input
if specialcharacter1stinput == 1:
print(self.df.columns)
column_name = input("Please enter the column name: ")
if column_name not in self.df.columns:
print("Column not found in DataFrame.")
continue
while True:
print('''
1. Replace character - Press 1
2. Remove character - Press 2
3. To Go Back - Press 3
''')
try:
choicesselect = int(input("Enter your choice\n"))
except ValueError:
print("Enter an integer only")
continue # Continue the loop to get a valid input
if choicesselect == 1:
special_character = input("Please enter the character to replace: ")
replacement_value = input("Please enter the value with which you want to replace the character: ")
self.df[column_name] = self.df[column_name].str.replace(special_character, replacement_value)
print("Values replaced successfully.")
elif choicesselect == 2:
special_character = input("Please enter the character to remove: ")
self.df[column_name] = self.df[column_name].str.replace(special_character, '')
print("Characters removed successfully.")
elif choicesselect == 3:
# Go back to the previous menu
break
else:
print("Invalid choice. Please enter 1, 2, or 3.")
elif specialcharacter1stinput == 2:
break
else:
print("Invalid choice. Please enter 1 or 2.")
def savecleandataset(self):
self.df.to_csv(r"YOUR UPDATED CODE",index=False)
print("The cleaned dataset has been saved successfully!!")
def main():
original_data_set = r"YOUR CODE"
file_path = original_data_set
explorer = DataExplorer(file_path)
cleaning = DataCleaning(file_path)
while True:
print('''Choose an option below:
1. Explore the dataset
2. Clean the dataset
3. Exit''')
try:
first_input = int(input("Enter your choice\n"))
except ValueError:
print("Please enter an integer value only")
continue
if first_input == 1:
explorer.explore_dataset()
elif first_input == 2:
cleaning.datasetcleaning()
elif first_input == 3:
print("Exiting the program")
break
else:
print("Invalid option")
if __name__ == "__main__":
main()
bottom of page