Extract data from formatted text file

**ahartman** · 02-12-2010, 12:12 PM

Hi everyone. I think the situation that I have is doable, but I'm asking for guidance so I don't muddle through it w/ a pieced together solution. Also, this is a long post, so thank you in advance to those who take the time to read it all.

I have a text file containing data that I want to analyze. The text file actually has whitespace with headers and footers such that the data is spread across multiple "pages". That is, you can print the text file and each printed page has similar header/footer information, and then a list of data w/ each piece of data on a new line. Sort of like this:

Applesauce Name Date Official Stuff

---
data1
data2
data3

Bananas Page:1
Time: ##:##:##

Applesauce Name Date Official Stuff2

---
data4
data5
data6

Bananas Page:2
Time: ##:##:##

.
.
.
etc

I've got no problem importing the whole text file, but I want to know how to remove all the whitespace and header/footer stuff and just retain the pieces of data in between.

These are the things I know about all the text files in general:

-Each "page" has 66 total lines, including header, data, whitespace, and footer.

-Each "page" may have a minimum of 1 and maximum of 52 data lines.

-The text that always marks the beginning of the headers is: "Applesauce" (I've used a nonsense word to protect confidential data for this post).

-The text that usually marks the end of the header is: "---" and will be on line 11 if present.

-If the "page" does not include the text "---", data will start on line 7.

-If the "page" has the text "Cucumber", data will start on line 12.

-The text that always mark the beginning and end of the footers are: "Bananas" and "Time: ##:##:##" respectively, and happen on lines 65 and 66, respectively.

-I can know the first page's ##:##:## time values from the filename if I need to, but each subsequent "page" can get a slightly different time value depending on how long it took the source program to create the data file.

-The number of pages isn't known ahead of time, but is possible to get because the header info includes "Page: #" for each sequential page.

-The total number of lines of data isn't known in advance.

There's got to be enough there to program a macro to extract the data from the other stuff, but I'm not sure what the right approach is.

Some thoughts off the top of my head:

-After import, dividing the total number of imported rows by 66 will give the number of pages.

-Looking for "---" in each group of 66 rows will tell if that group's data starts on line 7.

-Looking for "Cucumber" in each group of 66 rows will tell if that group's data starts on line 12.

What's the right way to handle this task? Should I copy each block of 66 rows to a new sheet and then run the same filtering logic against each sheet? Or, should I just keep the imported lines on one sheet and process each block of 66 rows one after another?

Advise or suggestions would be greatly appreciated, as always!

Extract data from formatted text file

LinkBack

Thread Tools

Rate This Thread

Display

Threaded View

Extract data from formatted text file

Thread Information

Users Browsing this Thread

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions