Using regex to extract information from a file and need help - html

I have an html file that has a table of information and I'm trying to extract specific columns. The pattern is like this with alternating "TableDarkRow" and "TableLightRow":
'>817338284254611</A></td><td Class='TableDarkRow' NOWRAP> 01/14/2011</td>
And I'm trying to extract an array of number and date pairs :
817338284254611
01/14/2011
I tried and came up with this:
>([0-9])+</A>(.*)NOWRAP> ?([0-9]{2}\/[0-9]{2}\/[0-9]{4})
But the (.*) is allowing the entire document to be selected between the first and last occurrences.

Replace the .* with .*? for non-greedy matching.
Reference: Watch Out for The Greediness!

Try this one(haven't tested):
/[0-9\/ ]+/

You can replace .* with `[A-Za-z'<> \t]+'.

Related

RegEx valid relative urls in href and src links of html

I have this RegEx and have tested it against the below dataset:
RegEx: /(href|src)\=\"(?!(ht|f)tp|www|:|\/\/)(\/)?/g
Dataset:
href="/hello
href="hello/bob
href="new/hello/bob
href="hello/test.com/hello
href="abc.hello.com/hello <-- I want to exclude this type of url
href="www.google.com/hello
href="https://www.google.com
href="http://google.com
href="ftp://www.google.com
href="://google.com
href="//google.com
Here is a demo link with the above inputs:
https://regex101.com/r/1mCFWL/4
The issue I am having is that the 4th test item abc.hello.com/hello also matches the RegEx and I would like to exclude all URLs which contain a .com before a /.
I am trying to do a lookup ahead but have been unable to get this working.
Can anyone help improve the above RegEx to add support to exclude URLs which contain a .com before a /?
EDIT:
A successful match criterion is matching only the first 4 items in the dataset.
You may add [^"\/]*\.com or [^"\/]*\.com(?![^\/]) alternative to the negative lookahead:
(?:href|src)="(?!(?:ht|f)tp|www|:|\/\/|[^"\/]*\.com)
See the regex demo and the Regulex graph:
The (?![^\/]) will require / or end of string if you add that pattern after com.

Select only URLs separated by commas with REGEX

My objective is to put all the URLs between "" so I'm trying to select them without the comma , then I will use the regular expression to do a large search/replace.
My current REGEX: "BigImage":\s(\[(.*)\])
I tried this but it doesn't work: "BigImage":\s(\[([^,]+)\])
"BigImage": [http://example.com/1.jpg,http://example.com/2.jpg,http://example.com/3.jpg]
Example: https://regex101.com/r/nE5eV3/30
You can make a regex for your urls, i don know, if it allways looks the same. For your links the regex would lookls like this:
(https?://(www)?[a-zA-Z0-9]*\.[a-zA-Z]{2,4}/[^\.]*\.(jpg|jpeg|png|gif))
This regex will match all of your urls (you posted in your question).
Full Blocks:
("BigImage": \[([^,\]]*,?)*\])
If you want to filter the string you posted above, you can use the regex above.
Tested with this site!
If you post a more complete example of your data, we can help you more.

Remove/strip specific Html tag and replace using NotePad++

Here is my text:
<h3>#6</h2>
Is he eating a lemon?
</div>
I have a few of them in my articles the #number is always different also the text is always different.
I want to make this out of it:
<h3>#6 Is he eating a lemon?</h3>
I tried it via regex in notepad++ but I am still very new to this:
My Search:
<h3>.*?</h2>\r\n.*?\r\n\r\n</div>
Also see here.
Now it is always selecting the the right part of the text.
How does my replace command need to look like now to get an output like above?
You should modify your original regex to capture the text you want in groups, like this:
<h3>(.*?)</h2>\r\n(.*?)\r\n\r\n</div>
( ) ( )
// ^ ^ These are your capture groups
You can then access these groups with the \1 and \2 tokens respectively.
So your replace pattern would look like:
<h3>\1 \2</h3>
Your search could be <h3>(.*)<\/h2>\r\n(.*)\r\n\r\n<\/div>
and the replace is <h3>$1 $2</h3>, where $1 and $2 represent the strings captured in the parentheses.

How to replace a specific line of HTML code with Regular Expression In Dreamweaver?

I want to replace <whatever>Some Title</whatever> with <something>Some Title</something> using the Find and Replace tool inside of Dreamweaver. How do I perform?
Not a Dreamweaver user, but this simple approach works in my editor (Emacs):
Replace:
<whatever>\(.*\)</whatever>
With:
<something>\1</something>
This is a pretty straightforward approach but it may fall short of your needs. Do some or all of your <whatever> element pairs occupy more than one line of text? Or do you have more than one <whatever> pair on a single line?
i guess what you want is to change all your <whatever> tag with an <something> tag whitout changing your text, right?
If it is so, you want to use find and replace with regular expression. Find (in source code) <whatever>(.*)</whatever> and replace it with <something>$1</something>. The $1 is used as a variable for anything fits the (.*) part DW finds for each instance.
For example, you you want to comment all instances of an
document.NAMEOFANYFORMONTHEPAGE.WHATEVERNAME.focus();
in a JavaScript file, you would use find:
document\.(.*)\.focus\(\);
and replace it with:
// document.$1.focus();
Don't forget to escape special characters and, please, try a few instances before using Replace All

How do I check if values between html tags are blank or empty using regular expressions in notepad plus plus

I'm conducting a mass search of files in notepad++ and I need to determine if there are no values between a set of tags (i.e. ).
".*?" will search for 0 or more characters (well, most), which is fine. But I'm looking for a set of tags with at least one character between them.
".+?" is similar to the above and does work in notepad++.
I tried the following, which was unsuccessful:
<author>.{0}?</author>
Thank you for any help.
Since you look for something that doesn't exist you don't have to make it that complicated. Simply searching for <author></author> would do the trick, wouldn't it? If you want to include space-characters as "nothing" you could modify it to the following:
<author>\s*?</author>
Output:
<author></author> Match
<author> </author> Match
<author>something</author> No match
I don't understand why you are using the "?" operator; ".+" should yield the result you need.