Delete HTML tags

**sglick** · 10-30-2007, 06:45 PM

Hi --
I have lots of text with HTML tags attached, I would like to delete the tags and be left with only the text. Is there an easy way to do this?

eg.

A1:
PHI 21    Story Assignment  

Result:
PHI 21 Story Assignment

Thanks. Stephanie

**shg** · 10-30-2007, 07:32 PM

Maybe Reafidy will stop by and give you a regular expressions soulution. He's a RegEx wiz.

**Reafidy** · 10-30-2007, 07:45 PM

hmm tricky.

Goin out to lunch will take a look when I get back if someone hasnt solved it.

**shg** · 10-30-2007, 07:58 PM

Well this has no elegance (it makes no attempt to match opening and closing tags), and for your sample string it returns PHI 21    Story Assignment  . What's the rule to suppres the &nbsp?

Please Login or Register  to view this content.

Reafidy's RegEx solution is likely to be much better.

**Paul** · 10-30-2007, 08:15 PM

Hi folks,

shg, I was thinking along your lines, and found this site (http://www.codeproject.com/asp/removehtml.asp) that gave me some ideas. I added to it so that not only did it remove anything between < and > tags, but it will also remove anything left over that is between & and ; (primarily non-breaking spaces,  , but it will also remove any other &___; code.

As with both of our examples, they can remove valid text and not just tags. If my code sees any ampersand it's going to delete that sign and additional text. Just like if either of our code finds a stray < in the text it will start deleting from that point forward.

Please Login or Register  to view this content.

Hopefully Reafidy gets back from lunch soon. There is a RegEx example on the site I linked above, but I don't know how to use RegEx and patterns. Oh well, just another thing to learn.

**Reafidy** · 10-30-2007, 08:50 PM

You can remove any thing between < > with regular expressions but you are going to have to counter for any anomolies like   manually so:

Please Login or Register  to view this content.

**shg** · 10-30-2007, 09:01 PM

Pjoaquin, I like your improvement. Reafidy, your RegEx has got to be the better approach.

I think ideally either approach should remove opening and closing tags only in pairs ...

**Reafidy** · 10-30-2007, 09:13 PM

Originally Posted by shg

I think ideally either approach should remove opening and closing tags only in pairs ...

do you mean like this:

Please Login or Register  to view this content.

**Paul** · 10-30-2007, 09:36 PM

Reafidy,

The new code still deletes anything after a < symbol even without a matching >.

For example, if you put a < between PHI and 21, but no ending > symbol, it still deletes the 21. I get the same results with the original code I had posted earlier.

In the link I provided earlier, there was another script that used some kind of array of valid HTML tags, e.g. HREF, SPAN, TABLE, etc..

Could each instance of <**** be checked against such an array, and if it doesn't exist in the array to leave it alone until the next < symbol?

**Reafidy** · 10-30-2007, 10:53 PM

pjoaquin,

Yes I think its possible but probably a bit of extra work so lets see what the op requires first.

This change should get rid of the problem you mentioned where < is in the middle.

Please Login or Register  to view this content.

**shg** · 10-30-2007, 11:37 PM

What I mean is that the routine should iteratively replace

<thistagstring>*</thistagstring>

with *.

**Reafidy** · 10-30-2007, 11:53 PM

Originally Posted by shg

What I mean is that the routine should iteratively replace

<thistagstring>*</thistagstring>

with *.

isnt that what the code in post 8 does?

**shg** · 10-31-2007, 09:28 AM

Please Login or Register  to view this content.

That doesn't look for matching tags, does it: Hello. You'd need a back reference in the second expression, wouldn't you?

**sglick** · 10-31-2007, 11:17 AM

Originally Posted by Reafidy

isnt that what the code in post 8 does?

I copied the code in #8 into the VB module and it returned a text box with the text of A1. How would I get it to look at the rest of the page?

I appreciate the help. Thank you.

Stephanie

**lecxe** · 10-31-2007, 02:01 PM

Hi Stephanie

You just have to adapt the Sub TestIt().

For example, if you have the HTML in column A and want to write just the text in column B, you can use

Please Login or Register  to view this content.

Remark: You can use code #6.

**sglick** · 10-31-2007, 03:14 PM

Thank you all for the input, I used #6 with the added bit above.

You all saved me Lots and Lots of time!

Thank You!

Stephanie

**shg** · 10-31-2007, 03:26 PM

Great, thanks for posting back, Stephanie.

**lecxe** · 10-31-2007, 04:21 PM

Now that Stephanie's problem is solved, going back a little:

shg posted:

That doesn't look for matching tags, does it: Hello. You'd need a back reference in the second expression, wouldn't you?

I think we could include the backreference without making the pattern much more complex.
Using part of Reafidy's code #8 (hope he doesn't mind), I believe this code would delete HTML tag pairs:

Please Login or Register  to view this content.

Remark: Not all HTML tags come in pairs. That's why this code wouldn't be enough. If it's XHTM we could add strings that start with "<" and end with "/>". If it's pure HTML it would be wilder as it allows for a loose syntax.

Best regards
lecxe

Delete HTML tags

LinkBack

Thread Tools

Rate This Thread

Display

Delete HTML tags

Thanks for the input

Thank You

Thread Information

Users Browsing this Thread

Bookmarks

Bookmarks

Posting Permissions