+ Reply to Thread
Results 1 to 18 of 18

Delete HTML tags

  1. #1
    Registered User
    Join Date
    03-20-2007
    Posts
    89

    Delete HTML tags

    Hi --
    I have lots of text with HTML tags attached, I would like to delete the tags and be left with only the text. Is there an easy way to do this?

    eg.

    A1:
    <P class=MsoNormal style="MARGIN: 0in 0in 0pt">PHI 21<SPAN style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp; </SPAN>Story Assignment<SPAN style="mso-spacerun: yes">&nbsp; </SPAN></P>

    Result:
    PHI 21 Story Assignment

    Thanks. Stephanie

  2. #2
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2010, 2019
    Posts
    40,689
    Maybe Reafidy will stop by and give you a regular expressions soulution. He's a RegEx wiz.

  3. #3
    Forum Contributor
    Join Date
    12-12-2006
    Location
    New Zealand
    Posts
    151
    hmm tricky.

    Goin out to lunch will take a look when I get back if someone hasnt solved it.

  4. #4
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2010, 2019
    Posts
    40,689
    Well this has no elegance (it makes no attempt to match opening and closing tags), and for your sample string it returns PHI 21&nbsp;&nbsp;&nbsp; Story Assignment&nbsp; . What's the rule to suppres the &nbsp?

    Please Login or Register  to view this content.
    Reafidy's RegEx solution is likely to be much better.

  5. #5
    Forum Expert Paul's Avatar
    Join Date
    02-05-2007
    Location
    Wisconsin
    MS-Off Ver
    2016/365
    Posts
    6,887
    Hi folks,

    shg, I was thinking along your lines, and found this site (http://www.codeproject.com/asp/removehtml.asp) that gave me some ideas. I added to it so that not only did it remove anything between < and > tags, but it will also remove anything left over that is between & and ; (primarily non-breaking spaces, &nbsp;, but it will also remove any other &___; code.

    As with both of our examples, they can remove valid text and not just tags. If my code sees any ampersand it's going to delete that sign and additional text. Just like if either of our code finds a stray < in the text it will start deleting from that point forward.
    Please Login or Register  to view this content.
    Hopefully Reafidy gets back from lunch soon. There is a RegEx example on the site I linked above, but I don't know how to use RegEx and patterns. Oh well, just another thing to learn.

  6. #6
    Forum Contributor
    Join Date
    12-12-2006
    Location
    New Zealand
    Posts
    151
    You can remove any thing between < > with regular expressions but you are going to have to counter for any anomolies like &nbsp; manually so:

    Please Login or Register  to view this content.
    Last edited by Reafidy; 10-30-2007 at 09:17 PM.

  7. #7
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2010, 2019
    Posts
    40,689
    Pjoaquin, I like your improvement. Reafidy, your RegEx has got to be the better approach.

    I think ideally either approach should remove opening and closing tags only in pairs ...

  8. #8
    Forum Contributor
    Join Date
    12-12-2006
    Location
    New Zealand
    Posts
    151
    Quote Originally Posted by shg
    I think ideally either approach should remove opening and closing tags only in pairs ...
    do you mean like this:

    Please Login or Register  to view this content.
    Last edited by Reafidy; 10-30-2007 at 09:17 PM.

  9. #9
    Forum Expert Paul's Avatar
    Join Date
    02-05-2007
    Location
    Wisconsin
    MS-Off Ver
    2016/365
    Posts
    6,887
    Reafidy,

    The new code still deletes anything after a < symbol even without a matching >.

    For example, if you put a < between PHI and 21, but no ending > symbol, it still deletes the 21. I get the same results with the original code I had posted earlier.

    In the link I provided earlier, there was another script that used some kind of array of valid HTML tags, e.g. HREF, SPAN, TABLE, etc..

    Could each instance of <**** be checked against such an array, and if it doesn't exist in the array to leave it alone until the next < symbol?
    Last edited by Paul; 10-30-2007 at 09:39 PM.

  10. #10
    Forum Contributor
    Join Date
    12-12-2006
    Location
    New Zealand
    Posts
    151
    pjoaquin,

    Yes I think its possible but probably a bit of extra work so lets see what the op requires first.

    This change should get rid of the problem you mentioned where < is in the middle.

    Please Login or Register  to view this content.

  11. #11
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2010, 2019
    Posts
    40,689
    What I mean is that the routine should iteratively replace

    <thistagstring>*</thistagstring>

    with *.

  12. #12
    Forum Contributor
    Join Date
    12-12-2006
    Location
    New Zealand
    Posts
    151
    Quote Originally Posted by shg
    What I mean is that the routine should iteratively replace

    <thistagstring>*</thistagstring>

    with *.
    isnt that what the code in post 8 does?

  13. #13
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2010, 2019
    Posts
    40,689
    Please Login or Register  to view this content.
    That doesn't look for matching tags, does it: <B>Hello</B>. You'd need a back reference in the second expression, wouldn't you?

  14. #14
    Registered User
    Join Date
    03-20-2007
    Posts
    89

    Thanks for the input

    Quote Originally Posted by Reafidy
    isnt that what the code in post 8 does?
    I copied the code in #8 into the VB module and it returned a text box with the text of A1. How would I get it to look at the rest of the page?

    I appreciate the help. Thank you.

    Stephanie

  15. #15
    Valued Forum Contributor
    Join Date
    10-15-2007
    Location
    Home
    MS-Off Ver
    Office 2010, W10
    Posts
    373
    Hi Stephanie

    You just have to adapt the Sub TestIt().

    For example, if you have the HTML in column A and want to write just the text in column B, you can use

    Please Login or Register  to view this content.
    Remark: You can use code #6.

  16. #16
    Registered User
    Join Date
    03-20-2007
    Posts
    89

    Thank You

    Thank you all for the input, I used #6 with the added bit above.

    You all saved me Lots and Lots of time!

    Thank You!

    Stephanie

  17. #17
    Forum Expert shg's Avatar
    Join Date
    06-20-2007
    Location
    The Great State of Texas
    MS-Off Ver
    2010, 2019
    Posts
    40,689
    Great, thanks for posting back, Stephanie.

  18. #18
    Valued Forum Contributor
    Join Date
    10-15-2007
    Location
    Home
    MS-Off Ver
    Office 2010, W10
    Posts
    373
    Now that Stephanie's problem is solved, going back a little:

    shg posted:
    That doesn't look for matching tags, does it: <B>Hello</B>. You'd need a back reference in the second expression, wouldn't you?
    I think we could include the backreference without making the pattern much more complex.
    Using part of Reafidy's code #8 (hope he doesn't mind), I believe this code would delete HTML tag pairs:

    Please Login or Register  to view this content.
    Remark: Not all HTML tags come in pairs. That's why this code wouldn't be enough. If it's XHTM we could add strings that start with "<" and end with "/>". If it's pure HTML it would be wilder as it allows for a loose syntax.

    Best regards
    lecxe

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1