Make sure to incorporate what you have learned so far in the program, including the use of conditional statements, i.e., IF functions, that are used for decision making given certain criteria.
Note: The data contains some errors, which reflects a realistic business situation. You should answer questions 1 through 3 using the data as is. In Question 4, you will then handle errors through data cleaning. Additionally, you will recalculate summary statistics in Q5 with the cleaned data in order to see the impact on these numbers using uncleaned data versus cleaned data.
Q1. When you clean your data, sometimes you may want to combine similar information found in separate columns into one column. Do you have similar categories of data that could be combined in order to conduct your analysis? You should combine some of your columns that contain similar information into a new column.
- Getting started: look at some of the social-demographic columns; for example, information related to marital status contained in separate columns can be
combined into one new column
Q2. Statistics provide useful summary information of data. You should perform basic summary statistics on the most �important� columns (give this thought as to what you consider important information given the business case).
Note that summary statistics paint an overall picture of your data. Examples include the minimum, maximum, range, mean, mode, and median, which are helpful in understanding the underlying data.
Please calculate the following statistics:
- What are the extreme values (min, max)?
- Which values occur the most (mode)?
- How often do values occur (frequency)?
- What is the range of values?
- Within the range of values, is the data concentrated more toward the lower end of the range or the higher, or is it evenly distributed? Notice how the data
changes.
Q3. Are there simple Excel charts that can assist you in visually analyzing the data from the previous questions? Please provide examples of these charts for some of the �important� variables
Q4. Part of data curation involves cleaning your data and removing values that are incorrect or are significantly different from the rest of the data as they don�t contribute valuable information and may hinder your analysis. While you may consider removing irrelevant rows, be cautious, as seemingly unimportant values may hold valuable information that needs to be considered in your analysis. You should identify which values of each of the columns, if any, do not make sense and then determine if you should remove the row.
- Getting Started: look for blanks and missing values, values that are outside of the typical range (e.g., negative or extreme values) or do not make sense
given what the data represents (e.g., decimals).
Q5. Now that you have cleaned the data, you should calculate the statistics from question 2 again. Do you notice any significant differences from the values in Q2?
Bookmarks