Hi,
I'm working on cleaning data to use in my project. The deduplication criteria are quite complicated. The data include Disease (5 values), ID, name, Accession number, specimen date, specimen source, result date, result status (3 values) (included in the attachment sample), and other variables (not included). The criteria for removing observations are as follows:
For observations with the same "name", "specimen_date", "Result_Status" and different "disease", keep all as in observations 2 & 3.
- I removed duplicate rows using "disease", "Name", "specimen_date" and "result_status" columns.
For observations with the same "disease", "name", and "result status", keep the one with the earliest "specimen date" (or "result_date") as in observations 4 & 5, also 16 & 17.
- I sorted the data set by "specimen_date" and "result_date" before removing duplicate rows.
For observations with the same "disease", "name", "Specimen_Date" and different "result status", keep the one with the 'Final Result' and delete the 'preliminary' as in observations 21 & 22 also 18 & 19, and 6 & 8.
- I removed duplicate rows using "disease", "Name", and "specimen date".
For observations with the same name, disease, and result status but a difference in specimen date of more than one year, keep both as in observations 23 & 24.
- I have no clue how to work on this.
Any help on this is greatly appreciated.
Thank you
Bookmarks