Regular expressions for HTML - html

I am trying to find the following regular expressions to implement to a program of mine to parse a given html file. Could you help me with any of those?
<div>
<div class=”menuItem”>
<span>
class=”emph”
Any string beginning with < and ending with >, i.e. all tags.
The contents of the body tag.
The contents of all divs
All divs that make menus
I have managed to figure out that the single div tag is simply " < div >"
and the "all tags expression is <(\"[^\"]*\"|'[^']*'|[^'\">])*>
Do you think you could help me with any of the rest?
Thank you in advance guys...
I know that HTML parsing is an already solved problem and that regex is not efficient, however it is requested that I do this like this, in order to demonstrate how regular expressions can work by making them (sometimes) long and detailed. That's why I'm simply handling the HTML file I have as a simple text file and I need to apply those regular expressions on it.

Please, for your own sanity, consider using an HTML parser library for the language you are using. Regexps are not a suitable tool for this application - they cannot reliably or cleanly handle structured data like HTML.
https://stackoverflow.com/a/1732454/457201

Related

Look for Nested XML tag with Regex

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.
I'm trying to find all the nested tags, here are some examples
I want to catch:
<a><a></a></a>
I don't want to catch
<a></a><a></a>
So in plain english I want to catch all
<a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak
Hoping to have this problem solved.
Thanks all!
I hope you are ready for parsing XML with regex.
First of all, let's define what XML tags would look like!
<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>
To match one of these tags we can then use the following regex:
/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s
Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:
/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s
Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):
/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s
Done - The regex should do.
No seriously, try it out.
I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.
(source: gyazo.com)
Cheers.
If you want a 100% correct solution, for example one that works with arbitrary content in comments and CDATA sections and in internal/external entities, and with author-chosen namespace prefixes, then it can't be done with regular expressions.
And since a 100% correct solution is very easy to achieve with XSLT, I think you are using the wrong technology.
No doubt you can achieve an acceptably high hit rate with regular expressions if you're prepared to put enough work in, but the details depend on aspects of the specification that you haven't made clear: for example, what you want to do with the nested elements that you find, and whether you want to locate elements nested 3-deep or 4-deep as well as those nested 2-deep.

How to add html code to a website but not have it actually turn to code?

I'm not sure how to describe this one. Basicly I want to show the coding of my website, in my website. So I'm not sure what tag to put. I've done some googling but i'm not even sure what I'm Googleing. I've found these tags, but none of them seemed to work. I found the script tag which I found out isn't what I needed and the code tag but that didn't do what I wanted. Anyone know what to do?
Thanks in advance
write the < as < and the > as > (their character entity equivalents). This will prevent the browser from interpreting them as tags.
So instead of
<p>This is an <i>example</i> of html.<p>
You'd write
<p>This is an <i>example</i> of html.</p>
Tedious, but necessary. Technically speaking, a > is only seen as a tag closer if it was preceded by a < somewhere, so you MIGHT be able to get by with just doing < -> <, but it's safer to write out both characters in entity format.
Once that's done you may want to wrap it with the <pre> tag (as in preformatted) which will cause the spacing and line breaks to be rendered just as they appear in your source code.
Utilize the HTML entity for each applicable symbol. Alternatively, you could use PHP and run the string through htmlentities.
I am not sure if you can do it with a simple tag, but I use this php library to convert my code to nice html code
http://qbnz.com/highlighter/
take a look at this, i think it may be what you are looking for.
http://code.google.com/p/google-code-prettify/

How can I take an xml string and display it on my page similiar to how StackOverflow does it with 'insert code'?

I'm using the DataContractSerializer to convert and object returned from a WCF call to xml. The client would like to see that xml string in a webpage. If I output the string directly to a label, the browser strips out the angle brackets obviously. My question is how can I do something similar to StackOverflow? Are they doing a find & replace to replace angle brackets with their html entities? I see they are doing a code tag inside a pre tag and then making spans with the appropriate class. Is there an existing utility out there I can use to do this instead of writing some kind of parsing routine. I'm sure something free must be out there. If anyone can direct to the right place or some code that can easily accomplish this, I would greatly appreciate it. I apologize if this is more of a meta.stackoverflow question. Thanks for any tips.
The basic answer is that to get HTML displayed as typed, special characters and all, you need to replace the special characters (<, > etc.), with their escaped equivalents (>, < etc.). Beyond that if you want syntax colouring you'll have to parse the input to identify the keywords etc.
A full list of the special characters and their escape codes can be found here, but this is just one site of many.
you're talking about "pretty print".. if you want to diplay source code you could use this link 16 Free Javascript Code Syntax Highlighters For Better Programming
But if you want to display only xml.. there are some functions on the web that could help you with that, like this one: XML PHP Pretty Printer
and dont forget the special characters =)
good luck

Howto remove HTML <a> tags in a CDATA element

I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!
You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.
In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.

Building Regular Expression (RegEx) to extract text of HTML tag [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
I am trying to build a regular expression to extract the text inside the HTML tag as shown below. However I have limited skills in regular expressions, and I'm having trouble building the string.
How can I extract the text from this tag:
text
That is just a sample of the HTML source of the page. Basically, I need a regex string to match the "text" inside of the <a> tag. Can anyone assist me with this? Thank you. I hope my question wasn't phrased too horribly.
UPDATE: Just for clarification, report_drilldown is absolute, but I don't really care if it's present in the regex as absolute or not.
145817 is a random 6 digit number that is actually a database id. "text" is just simple plain text, so it shouldn't be invalid HTML. Also, most people are saying that it's best to not use regex in this situation, so what would be best to use? Thanks so much!
The answer is... DON'T!
Use a library, such as this one
([^<]*)
This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']*, and/or capture both it and the number if you need.
If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)
I agree regex might not be the best way to parse this, but using backreference it's easily done:
<(?<tag>\w*)(?:.*)>(?<text>.*)</\k<tag>>
Where tag and text are named capture groups.
hat-tip: expresso library
<a href\=\"[^\x00]*?\">
should get you the opening tag.
<\/a>
will give you the closing tag. Just extract out what is in between. Untested though.