Hello: I have a plain text data file that contains rows of data - each with 3 columns separated by spaces - that is a dataset for a visualization program (Cytoscape).

The data is of the form

a pi b
a pi c
a pi d
c pi a
etc.

where the first and last columns are genes, and the middle column is the relationship (interaction type) between them.

My problem is that the rows (a pi c) and (c pi a) are informationally equivalent (gene "a" interacts with gene "c" with interaction type "pi"), but both are mapped (displayed) in Cytoscape, adding superfluous redundancy to the visual images, e.g.

a=c (with two lines connecting genes a and c)

and not

a-c (with 1 line connecting genes a and c).

I would like to parse (delete) these duplicate data.

I imported (copied/pasted) the sorted data from Notepad into Excel (into two Excel files - there are ~120,000 rows, which exceeds Excel's maximum number of rows by ~2X), but I cannot figure out how to delete these duplicates.

I thought that if I reversed the order of the cells - the "reverse complement," i.e. (a pi b) converted to (b pi a) then combined the starting and transposed list, that I could sort out the duplicates (which I am able to do), but this action also generates new duplicates, a "vicious cycle."

What I really need to have, either in Excel (or a linux script?), is a script that interprets these actions:

for each row of data in the data file
if (reverse complement) = true then delete

That is, look at the list

a pi b
a pi c
a pi d
c pi a

if (b pi a) is present, then delete (a pi b)

and so on. In this example, when the last row is evaluated,

if (a pi c) is present, then delete (c pi a)

this "duplicate" (redundant) entry will be deleted.

Is there a simple way to do (code) this in Excel? And / or a linux script?

I apologize for the length of this message - I wanted to clearly explain the problem. If someone could provide a solution, I would really appreciate it! Sincerely, Greg S. :-)