Regular expressions - finding and comparing the first instance of a word - html

I am currently trying to write a regular expression to pull links out of a page I have. The problem is the links need to be pulled out only if the links have 'stock' for example. This is an outline of what I have code wise:
<td class="prd-details">
<a href="somepage">
...
<span class="collect unavailable">
...
</td>
<td class="prd-details">
<a href="somepage">
...
<span class="collect available">
...
</td>
What I would like to do is pull out the links only if 'collect available' is in the tag. I have tried to do this with the regular expression:
(?s)prd-details[^=]+="([^"]+)" .+?collect{1}[^\s]+ available
However on running it, it will find the first 'prd-details' class and keep going until it finds 'collect available', thereby taking the incorrect results. I thought by specifying the {1} after the word collect it would only use the first instance of the word it finds, but apparently I'm wrong. I've been trying to use different things such as positive and negative lookaheads but I cant seem to get anything to work.
Might anyone be able to help me with this issue?
Thanks,
Dan

You need an expression that knows "collect unavailable" is junk. You should be able to use a negative lookahead with your wildcard after the link capture. Something like:
prd-details[^=]+="([^"]+)"(.(?!collect un))+?collect available
This will collect any character after the link that isn't followed by "collect un". This should eliminate capturing the "collect unavailable" chunk along with "collect available".
I tested in C# treating the text as a single line. You may need a slightly different syntax and options depending on your language and regex library.

If you insist on doing this with regex, I recommend a 2-step split-then-check approach:
First, split into each prd-details.
Then, within each prd-details, see if it contains collect available
If yes, then pull out the href
This is easier than trying to do everything in one step. Easier to read, write, and maintain.

Related

Simple macros for HTML

My html file contains in many places the code
It is too short and it doesn't really make sense to replace it with a code like
<span class="three-spaces"></span>
I would like to replace it with something like
##TS##
or
%%TS%%
and the file should start with something like:
SET TS = " "
Is there any way to write the HTML this way? I am not looking for compiling a source file into a HTML. I am looking for a solution that allows directly writing macros into HTML files.
Later edit: I'm coming with another example:
I also need to transform
lnk(http://www.example.com)
into
<a target="_blank" href="http://www.example.com">http://www.example.com</a>
Instead of telling him WHY he should not do something, how about telling him HOW he could do it? Maybe his example is not an appropriate need for it, but there's other situations where being able to create a macro would be nice.
For example... I have an HTML page that I'm working on that deals with unit conversions and quite often, I'm having to type things like "cm/in" as "cm/in" or for volumes "cu-cm/cu-in" as "cm3/in3". It would be really nice from a typing and readability standpoint if I could create macros that were just typed as "%%cm-per-in%%, %%cc-per-cu-in%% or something like that.
So, the line in the 'sed' file might look like this:
s/%%cc-per-cu-in%%/<sup>cm<sup>3<\/sup><\/sup>\/<sub>in<sup>3<\/sup><\/sub>/g
Since the "/" is a field separator for the substitute command, you need to explicitly quote it with the backslash character ("\") within the replacement portion of the substitute command.
The way that I have handled things like this in the past was to either write my own preprocessor to make the changes or if the "sed" utility was available, I would use it. So for this sort of thing, I would basically have a "pre-HTML" file that I edited and after running it through "sed" or the preprocessor, it would generate an HTML file that I could copy to the web server.
Now, you could create a javascript function that would do the text substitution for you, but in my opinion, it is not as nice looking as an actual preprocessor macro substitution. For example, to do what I was doing in the sed script, I would need to create a function that would take as a parameter the short form "nickname" for the longer HTML that would be generated. For example:
function S( x )
{
if (x == "cc-per-cu-in") {
document.write("<sup>cm<sup>3</sup></sup>/<sub>in<sup>3</sup></sub>");
} else if (x == "cm-per-in") {
document.write("<sup>cm</sup>/<sub>in</sub>");
} else {
document.write("<B>***MACRO-ERROR***</B>");
}
}
And then use it like this:
This is a test of cc-per-cu-in <SCRIPT>S("cc-per-cu-in");</SCRIPT> and
cm-per-in <SCRIPT>S("cm-per-in");</SCRIPT> as an alternative to sed.
This is a test of an error <SCRIPT>S("cc-per-in");</SCRIPT> for a
missing macro substitution.
This generates the following:
This is a test of cc-per-cu-in cm3/in3
and cm-per-in cm/in as an alternative to sed. This is a test of an error MACRO-ERROR for a missing macro substitution.
Yeah, it works, but it is not as readable as if you used a 'sed' substitution.
So, decide for yourself... Which is more readable...
This...
This is a test of cc-per-cu-in <SCRIPT>S("cc-per-cu-in");</SCRIPT> and
cm-per-in <SCRIPT>S("cm-per-in");</SCRIPT> as an alternative to sed.
Or this...
This is a test of cc-per-cu-in %%cc-per-cu-in%% and
cm-per-in %%cm-per-in% as an alternative to sed.
Personally, I think the second example is more readable and worth the extra trouble to have pre-HTML files that get run through sed to generate the actual HTML files... But, as the saying goes, "Your mileage may vary"...
EDITED: One more thing that I forgot about in the initial post that I find useful when using a pre-processor for the HTML files -- Timestamping the file... Often I'll have a small timestamp placed on a page that says the last time it was modified. Instead of manually editing the timestamp each time, I can have a macro (such as "%%DATE%%", "%%TIME%%", "%%DATETIME%%") that gets converted to my preferred date/time format and put in the file.
Since my background is in 'C' and UNIX, if I can't find a way to do something in HTML, I'll often just use one of the command line tools under UNIX or write a small 'C' program to do it. My HTML editing is always in 'vi' (or 'vim' on the PC) and I find that I am often creating tables for alignment of various portions of the HTML page. I got tired of typing all the TABLE, TR, and TD tags, so I created a simple 'C' program called 'table' that I can execute via the '!}' command in 'vi', similar to how you execute the 'fmt' command in 'vi'. It takes as parameters the number of rows & columns to create, whether the column cells are to be split across two lines, how many spaces to indent the tags, and the column widths and generates an appropriately indented TABLE tag structure. Just a simple utility, but saves on the typing.
Instead of typing this:
<TABLE>
<TR>
<TD width=200>
</TD>
<TD width=300>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
</TABLE>
I can type this:
!}table -r 3 -c 2 -split -w 200 300
Now, with respect to the portion of the original question about being able to create a macro to do HTML links, that is also possible using 'sed' as a pre-processor for the HTML files. Let's say that you wanted to change:
%%lnk(www.stackoverflow.com)
to:
www.stackoverflow.com
you could create this line in the sed script file:
s/%%lnk(\(.*\))/<a href="\1">\1<\/a>/g
'sed' uses regular expressions and they are not what you might call 'pretty', but they are powerful if you know what you are doing.
One slight problem with this example is that it requires the macro to be on a single line (i.e. you cannot split the macro across lines) and if you call the macro multiple times in a single line, you get a result that you might not be expecting. Instead of doing the macro substitution multiple times, it assumes the argument to the macro starts with the first '(' of the first macro invocation and ends with the last ')' of the last macro invocation. I'm not a sed regular expression expert, so I haven't figured out how to fix this yet. For the multiple line portion though, a possible fix would be to replace all the LF characters in the file with some other special character that would not normally be used, run sed on that result, and then convert the special characters back to LF characters. Of course, the problem there is that the entire file would be a single line and if you are invoking the macro, it is going to have the results that I described above. I suspect awk would not have that problem, but I have never had a need to learn awk.
Upon further reflection, I think there might be an easier solution to both the multi-line and multiple invocation of a macro on a single line -- the 'm4' macro preprocessor that comes with the 'C' compiler (e.g. gcc). I haven't tested it much to see what the downside might be, but it seems to work well enough for the tests that I have performed. You would define a macro as such in your pre-HTML file:
define(`LNK', `$1')
And yeah, it does use the backwards single quote character to start the text string and the normal single quote character to end the text string.
The only problem that I've found so far is that is that for the macro names, it only allows the characters 'A'-'Z', 'a'-'z', '0'-'9', and '' (underscore). Since I prefer to type '-' instead of '', that is a definite disadvantage to me.
Technically inline JavaScript with a <script> tag could do what you are asking. You could even look into the many templating solutions available via JavaScript libraries.
That would not actually provide any benefit, though. JavaScript changes what is ultimately displayed, not the file itself. Since your use case does not change the display it wouldn't actually be useful.
It would be more efficient to consider why is appearing in the first place and fix that.
This …
My html file contains in many places the code
… is actually what is wrong in your file!
is not meant to use for layout purpose, you should fix that and use CSS instead to layout it correctly.
is meant to stop breaking words at the end of a line that are seperated by a space. For example numbers and their unit: 5 liters can end up with 5 at the end of the line and liters in the next line (Example).
To keep that together you would use 5 liters. That's what you use for and nothing else, especially not for layout purpose.
To still answer your question:
HTML is a markup language not a programming language. That means it is descriptive/static and not functional/dynamic. If you try to generate HTML dynamically you would need to use something like PHP or JavaScript.
Just an observation from a novice. If everyone did as purists suggest (i.e.-the right way), then the web would still be using the same coding conventions it was using 30 years ago. People do things, innovate, and create new ways, then new standards, and deprecate others all the time. Just because someone says "spaces are only for separating words...and nothing else" is silly. For many, many years, when people typed letters, they used one space between words, and two spaces between end punctuation and the next sentence. That changed...yeah, things change. There is absolutely nothing wrong with using spaces and non-breaking spaces in ways which assist layout. It is neither useful nor elegant for someone to use a long span with style over and over and over, rather than simple spaces. You can think it is, and your club of do it right folks might even agree. But...although "right", they are also being rather silly about it. Question: Will a page with 3 non-breaking spaces validate? Interesting.

Regular expressions to match several characters Editpad

I have a wordpress export with older posts which are going to be imported into a different installation. But I have an issue where there is a part of the content that needs to be converted into a different type of content in the other installation.
The original code is like this:
000d3c4c]]></content:encoded>
And I need to make it into this
</content:encoded><wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
I'm using EditPad Pro to search through the file, and to prevent me from using several hours doing this change manually, I tant to use the extensive search and replace feature in EditPad, but I have some issues trying to match this. I want an expression to be like this.
First search and replace will work no matter what because i can change
<a href="000
into
</content:encoded>
<wp:postmeta>
<wp:meta_key>
<![CDATA[mkd_post_audio_link_meta]]></wp:meta_key><wp:meta_value>><![CDATA[000d2aa8.mp3]]></wp:meta_value>
</wp:postmeta>
But I struggle with the next part.
How can I change everything after .mp3 to this
]]></wp:meta_value>
</wp:postmeta>
To make it brief. I want to replace all occurences of
">000d3c4c</a>]]></content:encoded>
with
]]></wp:meta_value>
</wp:postmeta>
The 000d3c4c are different for each occurence of the mp3 link.
Match this
<a[^>]*>([^<]+)
and replace by this
</content:encoded>
<wp:postmeta>
<wp:meta_key><![CDATA[mkd_post_audio_link_meta]]></wp:meta_key>
<wp:meta_value><![CDATA[$1.mp3]]></wp:meta_value>
</wp:postmeta>

Capybara won't find a button by its "name" attribute

A W3C-validated HTML 5 web page contains this working, simple button inside a login form.
<input data-disable-with="Signing in, please wait&hellip;"
name="commit" type="submit" value="Sign in" />
I'm writing a largely pointless test :-) in a Rails 3.2.17 application that's just to get the hang of Capybara and I've already got completely stuck Googling, reading documentation and reading source code to the test framework, with no joy - attempting to find this button by its name (i.e. "commit") fails.
click_button("commit")
find_button("commit")
Both result in Capybara::ElementNotFound: Unable to find button "commit". If I use the visible button text of Sign in then the element is found, i.e. these:
click_button("Sign in")
find_button("Sign in")
...both work fine, so it would appear that the XML parser isn't having any trouble finding the element.
Documentation for click_button says that the locator works on "id, text or value", with "text" being meaningless for an input element like this (the visible text is taken from the value attribute), but relevant perhaps for button elements. So, we might expect that to fail, though if we view the code via the documentation, find that it calls down to find in the same way as find_button. Yet find_button is documented differently; it says it locates by "id, name or value". So sadly, we know from this that the documentation is broken because it says two different things for what turns out to be an identical call at the back end.
Either way, the element isn't found by name, and that means the lower level find call isn't searching name attributes as far as I can see. This means Capybara (2.2.1, on Nokogiri 1.6.1) is rather broken in that respect. How come nobody has noticed? I've Googled for ages and it doesn't seem to come up. I seem to be rather missing the point :-)
Why don't I just search for the English text in the button, you might ask? Because of internationalisation. This old, Rails 1 -> 2 -> 3 upgraded app has some I18n parts and other static text parts. I don't want to be forced to put I18n into any view that Capybara tests, just so I can have the test use I18n.t() to ensure a match despite different languages or locale file updates. Likewise, it would clearly be very stupid in 2014 to write hard-coded English strings into my tests.
That's why we have names and IDs and such... The unique (in theory!) identifiers that are machine-read, not human-read.
I could hack up something that CSS-selected by "type=submit" but seriously, why isn't Capybara searching the name attribute when its documentation says it does, and why does the documentation disagree on what attributes are searched on two methods that call down to exactly the same back-end implementation with exactly the same parameters?
TIA :)
It turns out the docs are misleading for both calls, as neither look at the attributes listed. It's also clearly very confusing what exactly a "button" means, since a couple of people herein seemed to think it literally only meant an HTML button element but that's not the case.
If you view the source for the documentation of, say, click_button:
https://github.com/jnicklas/capybara/blob/a94dfbc4d07dcfe53bbea334f7f47f584737a0c0/lib/capybara/node/actions.rb#L36
...you will see that this just calls (as I've mentioned elsewhere) to find with a type of :button, which in turn passes through to Capybara's Query engine which, in turn, ends up just using the standard internal selection mechanism to find things. It's quite elegant; in the same way that an external client can add their own custom selectors to making finding things more convenient:
http://rubydoc.info/github/jnicklas/capybara/master/Capybara#add_selector-class_method
...so Capybara adds its own selectors internally, including, importantly, :button:
https://github.com/jnicklas/capybara/blob/a94dfbc4d07dcfe53bbea334f7f47f584737a0c0/lib/capybara/selector.rb#L133
It's not done by any special case magic, just some predefined custom selectors. Thus, if you've been wondering what custom selectors are available from the get-go in Capybara, that's the file to read (it's probably buried in the docs too but I've not found the list myself yet).
Here, we see that the button code is actually calling XPath::HTML.button, which is a different chunk of code in a different repository, with this documentation:
http://rdoc.info/github/jnicklas/xpath/XPath/HTML#button-instance_method
...which is at the time of writing slightly out of date with respect to the code, since the code shows quite a lot more stuff being recognised, including input types of reset and button (i.e. <input type="button"...> rather than <button...>...</button>, though the latter is also included of course).
https://github.com/jnicklas/xpath/blob/59badfa50d645ac64c70fc6a0c2f7fe826999a1f/lib/xpath/html.rb#L22
We can also see in this code that the finder method really only finds by id, value and title - i.e. not by "text" and not by name either.
So assuming XPath is behaving as intended, though it's not clear from docs, we can see that Capybara isn't documenting itself correctly but probably ought to make the link down to XPath APIs for more information, to avoid the current duplication of information and the problems this can cause for both maintainers and API clients.
In the mean time, I've filed this issue:
https://github.com/jnicklas/capybara/issues/1267
You can also use css selectors which are default capybara locators. People say they are faster.
find('[name=commit]').click
Capybara do not look at name attribute in it's finders :(
You can use xpath selector if you want
find(:xpath, "//input[contains(#name, 'commit')]").click()
If anyone wants it is possible to add (quite easily) find by name selector. In order to do so:
Add following code to test/test_helper.rb (for minitest)
Capybara.add_selector(:name) do
xpath { |name| XPath.descendant[XPath.attr(:name).contains(name)] }
end
Use it
Now in your tests you can use following selector:
find(:name, 'part_of_the_name_attribute')
It will find every element which name attribute contains searched value.
Example
find(:name, 'user')
This will find elements (element could be of any type):
<select name='user_name'>
<input name='name_of_user'>
<textarea name='some_user_info'>
You can use this selector to find a button on a page with RSpec and Capybara:
expect(page).to have_selector(:link_or_button, "Button text")
Check your gem depencies. RSpec 3 or higher works with gem 'rspec-rails', '~> 3.7.1' then capybara version must be gem 'capybara', '~>2.18.0' and poltergeist should be gem 'poltergeist', '~>1.17.0'.

How can I extract the HREF value from an HTML link?

My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> yahoo.com.jp/
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my #hrefs = $tree->findvalues( '//div[#class="noprint"]/a/#href');
print "The links are: ", join( ',', #hrefs ), "\n";
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*