find all occurrences between tag - html

I am trying to find all <br> instances in the following string, but only in the "categories" class:
<td>bla<br>bla</td><td class="categories">cat1<br>cat2<br>cat3</td>
I am realy new to regex, and this is what I tried so far, but it only finds the first <br> after cat1 and takes the whole part in front of it in the result as well...
(?>categories">).*?<br>
EDIT: I want to find all <br> occurences to replace them with a comma. For the moment I'm using a text editor (Sublime Text) to achieve this...

Why would you want to use regex? What are you trying to validate?
http://en.wikipedia.org/wiki/Regular_expression
If you want to find html elements using class variables then you want to look into javascript or jquery
http://api.jquery.com/class-selector/
Hope that helps a bit

Related

Using Code <> As Actual Text

Really having trouble with this and can't find any results on it.
I want my html text to utilize the carrots <> for some of my text.
Specifically for a navbar menu item. But I can't seem to build it without activating the text as an actual div.
I want it to say "< Dev>" without using quotes or spaces, but it when I take the quotes/spaces away it activates it as a div. How do I keep the entire message "< Dev>" without turning it into a div item?
E.g:
<p> Welcome to my <Dev> portfolio</p>
Also what is the term used to override reserved code functions as text? Will help me research answers for other issues too. Like when using & as text and not as code.
Thanks for the assistance!
You'll want to use <p> Welcome to my <Dev> portfolio</p>
You can find a list of HTML character codes Here
Try using the html unicode values for those characters instead.
Welcome to my &60Dev&62 portfolio
Sorry it looks like this forum reads those unicode characters and prints them correctly. Add # signs at the after the & characters to get the html code.

ruby tags for Sphinx/rst

I create HTML documents from a rst-formated text, with the help of Sphinx. I need to display some Japanese words with furiganas (=small characters above the words), something like that :
I'd like to produce HTML displaying furiganas thanks to the < ruby > tag.
I can't figure out how to get this result. I tried to:
insert raw HTML code with the .. raw:: html directive but it breaks my line into several paragraphs.
use the :superscript: directive but the text in furigana is written beside the text, not above.
use the :role: directive to create a link between the text and a CSS class of my own. But the :role: directive can only be applied to a segment of text, not to TWO segments as required by the furiganas (=text + text above it).
Any idea to help me ?
As long as I know, there's no simple way to get the expected result.
For a specific project, I choosed not to generate the furiganas with the help of Sphinx but to modify the .html files afterwards. See the add_ons/add_furiganas.py script and the result here. Yes, it's a quick-and-dirty trick :(

Find and replace a lot of <a> tags, but the url are different? Is there some way to do it?

I have a page that I need to fix..
There are thousands of <a> tags like <a href="kl1j23l123l12j3"> that I need to get rid off, but the problem is that each <a> tag has a different url in them (href attribute). So, I am wondering if there is some advanced way to get rid of the whole anchor/link but keep the link-text, as that would save me a whole lot of time.
Example
Input : StackOverflow.com
Output: StackOverflow.com
Thanks.
Maybe this is a solution using JavaScript and jQuery. It also can be tweaked to only get the values of links that do not start with http. I wasn't quite sure whether this is relevant according to the links in the question.
// get all links within the document
​var links = $('a');
// simply get all link texts
var x = links.text();​​​​​​
// or just get all links that are like 'kl1j23l123l12j3' as they don't start with 'http'
var x = links.filter('[href^=http]').text();
Here's a demo: http://jsfiddle.net/rg3ET/
Instead of applying them all together into one variable ("x") you could of course loop through them and output them individual.
The following would work under the assumption that each anchor tag is on its own line.
Example:
asdf
<div>
</div>
asdf2
Notepad++ has a regex find and replace feature, which may work for your need.
Replace all </a> tags with nothing
Use a regex to find all <a href="anything"> and replace with nothing.
The following image shows what I did for step 2. You can see that I used a regex of <a .*>. For this to work properly, there should only be one > character per line. Otherwise, the regex will make the longest possible match, possibly including a bunch of other tags. This is why I said the procedure would only work for anchor tags that are on their own lines.
In case you can't see the image (again, this only works:
From Notepad++ menu: Search > Replace
Select Regular expression
In Find what box, put <a .*>
Click Replace All

regex: selecting everything but img tag

I'm trying to select some text using regular expressions leaving all img tags intact.
I've found the following code that selects all img tags:
/<img[^>]+>/g
but actually having a text like:
This is an untagged text.
<p>this is my paragraph text</p>
<img src="http://example.com/image.png" alt=""/>
this is a link
using the code above will select the img tag only
/<img[^>]+>/g #--> using this code will result in:
<img src="http://example.com/image.png" alt=""/>
but I would like to use some regex that select everything but the image like:
/magical regex/g # --> results in:
This is an untagged text.
<p>this is my paragraph text</p>
this is a link
I've also found this code:
/<(?!img)[^>]+>/g
which selects all tags except the img one. but in some cases I will have untagged text or text between tags so this won't work for my case. :(
is there any way to do it?
Sorry but I'm really new to regular expressions so I'm really struggling for few days trying to make it work but I can't.
Thanks in advance
UPDATE:
Ok so for the ones thinking I would like to parse it, sorry I don't want it, I just want to select text.
Another thing, I'm not using any language in specific, I'm using Yahoo Pipes which only provide regex and some string tools to accomplish the job. but it doesn't evolves any programming code.
for better understanding here is the way regex module works in yahoo pipes:
http://pipes.yahoo.com/pipes/docs?doc=operators#Regex
UPDATE 2
Fortuntately I'm being able to strip the text near the img tag but on a step-by-step basis as #Blixt recommended, like:
<(?!img)[^>]+> , replace with "" #-> strips out every tag that is not img
(?s)^[^<]*(.*), replace with $1 #-> removes all the text before the img tag
(?s)^([^>]+>).*, replace with $1 #-> removed all the text after the img tag
the problem with this is that it will only catch the first img tag and then I would have to do it manually and catch the others hard-coding it, so I still not sure if this is the best solution.
The regexp you have to find the image tags can be used with a replace to get what you want.
Assuming you are using PHP:
$htmlWithoutIMG = preg_replace('/<img[^>]+>/g', '', $html);
If you are using Javascript:
var htmlWithoutIMG = html.replace(/<img[^>]+>/g, '');
This takes your text, finds the <img> tags and replaces them with nothing, ie. it deletes them from the text, leaving what you want. Can not recall if the <,> need escaping.
Regular expression matches have a single start and length. This means the result you want is impossible in a single match (since you want the result to end at one point, then continue later).
The closest you can get is to use a regular expression that matches everything from start of string up to start of <img> tag, everything between <img> tags and everything from end of <img> tag to end of string. Then you could get all matches from that regular expression (in your example, there would be two matches).
The above answer is assuming you can't modify the result. If you can modify the result, simply replace the <img> tags with the empty string to get your result.

RegEx: Link Twitter-Name Mentions to Twitter in HTML

I want to do THIS, just a little bit more complicated:
Lets say, I have an HTML input:
Don't break!
Some Twitter Users: #codinghorror, #spolsky, #jarrod_dixon and #blam4c.
You can't reach me at blam4c#example.com.
Is there a good RegEx to replace the twitter username mentions by links to twitter, but leave #example (eMail-Adress at the bottom) AND #test (in the link title, i.e. in HTML tags)?
It probably should also try to not add links inside existing links, i.e. not break this:
Hello #someone there!
My current attempt is to add ">" at the beginning of the string, then use this RegEx:
Search: '/>([^<]*\s)\#([a-z0-9_]+)([\s,.!?])/i'
Replace: '>\1#\2\3'
Then remove the ">" I added in step 1.
But that won't match anything but the "#blam4c". I know WHY it does so, that's not the problem.
I would like to find a solution that finds and replaces all twitter user name mentions without destroying the HTML. Maybe it might even be better to code this without RegEx?
First, keep the angle brackets out of your regexps.
Use a HTML parser and xpath to select the text nodes you are interested in processing, then consider a regexp for matching only #refs in those nodes.
I'll let to other people to try and give a specific answer to the regex part.
I agree with ddaa, there's almost no sane way to attack this without stripping the html links out first.
Presumably you'd be starting out with an actual Twitter message, which cannot by definition include any manually entered hyperlinks.
For example, here's how I found this question (the link resolves to this question so don't bother clicking it!)
Some Twitter Users: #codinghorror, #spolsky, #jarrod_dixon and #blam4c. http://bit.ly/2phvZ1
In this case, it's easy:
var msg = "Some Twitter Users: #codinghorror, #spolsky, #jarrod_dixon and #blam4c. http://bit.ly/2phvZ1";
var html = Regex.Replace(msg, "(?<!\w)(#(\w+))",
"$1");
(this might need some tweaking, I'd like to test it against a corpus, but it seems correct for the average Twitter message)
As for your more complicated cases (with HTML markup embedded in the tweets), I have no idea. Way too hard for me.
This regexp might work a bit better: /\B\#([\w\-]+)/gim
Here's a jsFiddle example of it in action: http://jsfiddle.net/2TQsx/4/