Extract a html tag that contains a string in openrefine? - html

There is not much to add to the title. It's what i'm trying to do. Any suggestions?
I reviewed the docs at github and googled extensively.
The best i got is:
value.parseHtml().select('p[contains('xyz')]')
It results in a syntax error.

The 'select' syntax is based on the select syntax in Beautiful Soup (http://jsoup.org/cookbook/extracting-data/selector-syntax)
In this case I believe the syntax you need is:
value.parseHtml().select("p:contains(xyz)")
Owen

Perhaps you missed my writeup (and WARNING) on the wiki :) here ?
https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML#extract-html-attributes-text-links-with-integrated-grel-jsoup-commands
WARNING: Make sure to use .toString() suffixes when needed to output strings into Refine cells while working with the built-in HTML GREL commands (the default output is org.jsoup.nodes objects). Otherwise you'll get a preview just fine in the Expression Editor, BUT no data shown in the Refine cells when you apply it!
BTW, How could we make the docs better and where, so that someone doesn't miss this in the future ?
I even gave folks a nice example in our docs that shows using .toString() :
https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#selectelement-e-string-s

Related

Search for id="* *" in Eclipse

I am searching in my Eclipse project for id="* *" in order to find something like below in all files of my project
id="Text1 Text2"
Since space between attribute value needs to be removed I am searching like this to find out all the instances in my project.
But it searches and brings up lots of results since * represent any string.
For example:
id="xyzImg" onLoad="test();" SRC="abc.png"
It taking the complete element instead of desired results.
Please suggest me is there any way to get my desired result or is there any online tool that serve the same purpose.
I am going to write it here as an answer instead. Try with *|*. By asterisk I meant the wildcard symbol.

Matching nested constructs in TextMate / Sublime Text / Atom language grammars

While writing a grammar for Github for syntax highlighting programs written in the Racket language, I have stumbled upon a problem.
In Racket #| starts a multiline comment and |# ends it.
The problem is that multiline comments can be nested:
#| a comment #| still a comment |# even
more comment |#
Here is my non-working attempt:
repository:
multilinecomment:
begin: \#\|
end: \|\#
name: comment
contentName: comment
patterns:
- include: "#multilinecomment"
name: comment
- match: ([^\|]|\|(?=[^#]))*
name: comment
The intent of the match patterns are:
"#multilinecomment"
A multiline comment can contain another multiline comment.
([^\|]|\|(?=[^#]))*
The meaning of the subexpressions:
[^\|] any characters not an `|`
\|(?=[^#]) an `|` followed by a non-`#`
The entire expression thus matches a string not containg |#
Update:
Got an answer from Allan Odgaard on the TextMate mailing list:
http://textmate.1073791.n5.nabble.com/TextMate-grammars-and-nested-multiline-comments-td28743.html
So I've tested a bunch of languages in Sublime that have multiline comments (C/C++, Java, HTML, PHP, JavaScript), and none of the language syntaxes support multiline comments embedded in multiline comments - the syntax highlighting for the comment scope ends with the first "comment close" marker, not with symmetric markers. Now, this isn't to say that it's impossible, because the BracketHighlighter plugin works great for matching symmetric tags, brackets, and other markers. However, it's written in Python, and uses custom logic for its matching algorithms, something that may not be available in the Oniguruma engine that powers Sublime's syntax highlighter, and apparently Github's as well.
Basically, from your description of the problem, you need a code parser to ensure that nested comments are legal, something you can't do with just a syntax highlighting definition. If you're writing this just for Sublime, a custom plugin could take care of that, but I don't know enough about Github's Linguist syntax highlighting system to say if you're allowed to do that. I'm not a regex master yet, but it seems to me that it would be rather difficult to achieve this purely by regex, as you'd need to somehow keep track of an arbitrary number of internal symmetric "open" and "close" markers before finding (and identifying!) the final one.
Sorry I couldn't provide a definitive answer other than I'm not sure this is possible, but that's the best I can come up with without knowing more about Sublime's and Github's internals, something that (at least in Sublime's case) won't happen unless it's open-sourced. Good luck!
Old post, and I don't have the reputation for a comment, but it is emphatically NOT possible to detect arbitrarily nested comments using purely regular expressions. Intuitively, this is because all regular expressions can be transformed into a finite state machine, and keeping track of nesting depth requires a (theoretically) infinite amount of state (the number of states needs to be equal to at least the different possible nesting depths, which here is infinite).
In practice this number grows very slowly, so if you don't want to go to too much trouble you could probably write something that allows nesting up to a reasonable depth. Otherwise you'll probably need a separate phase that parses through and finds the comments to tell the syntax highlighter to ignore them.
You had the correct idea but it looks like your second pattern also matches for the "begin nested comment" sequence #| which will never give a chance for your recursive #multilinecomment pattern to kick in.
All you have to do is replace your second pattern with something similar to
(#(?=[^|])|\|(?=[^#])|[^|#])+
Take the last match out. You do not need it. Its redundant to what textmate will do naturally, which is to match all additional text in to the comment scope until the end marker comes along, or the entire pattern recurses upon itself.

Matlab Regular expression query

Very new to regex and haven't found a descriptive explaination to narrow down my understanding of regex to get me to a solution.
I use a script that scrapes html script from Yahoo finance to get financial options table data. Yahoo recently changed their HTML code and the old algorithm no longer works. The old expression was the following:
Main_Pattern = '.*?</table><table[^>]*>(.*?)</table';
Tables = regexp(urlText, Main_Pattern, 'tokens');
Where Tables used to return data, it no longer does. An HTML inspection of the HTML suggests to me that the data is no longer in <table>, but rather in <tbody>...
My question is "what does the Main_Pattern regex mean in layman's terms?" I'm trying to figure how to modify that expression such that is is applicable to the current HTML.
While I agree with #Marcin and Regular Expressions are best learned by doing and leveraging the reference of your chosen tool, I'll try and break down in what it is doing.
.*?</table>: Match anything up to the first </table> literal (This is a Lazy expression due to the ?).
<table: Match this literal.
[^>]*>: Match as much as possible that isn't > from after <table literal to the last occurrence of a > that satisfies the rest of the expression (this is a Greedy expression since there is no ? after the *).
(.*?)</table: Match and capture anything between the > from the previous part up to the </table literal; what was captured can be retrieved using the 'tokens' options of regexp (you can also get the entire string that was matched using the 'match' option).
While I broke it into pieces, I'd like to emphasize that the entire expression itself works as a whole, which is why some parts refer to the previous parts.
Refer to the Operators and Characters section of the MATLAB documentation for more in-depth explanations of the above.
For the future, a more robust option might be to use MATLAB's xmlread and DOM object to traverse the table nodes.
I do understand that that is another API to learn, but it may be more maintainable for the future.

Passing a command line argument (as a string) into my Perl script

I'm extremely new at Perl and trying to prove I can pick it up quickly. What I was asked to do is add a string as an argument on my command line, and then feed that into my script. From there it is supposed to search a MySQL table I've made for matches in one column, and spit the contents of another column into an array. It was suggested I used the Getops:Std but I'm uncertain how exactly to do that, and if that's the best technique.
For example: I have a MySQL table with car manufacturers, and car models. I want to run, Perl myscript.pl Ford, and then have it shoot me back an array with
Mustang
Escape
Focus
But I'm uncertain how to get that string input in the first place. Would Getops:Std be best? If so how would it be written? I'm picking this up quickly, but I've been at it less than a week, so the simpler the explanation, the better.
Edit: Basically I was confused why it was suggested I should use GetOpts::Std for this. It seems to be completely inappropriate for what I'm trying to do.
GetOpts::Std is overkill for this. Your command line arguments are in #ARGV. If you haven't been able to work that out after a week, then you need better references for Perl.
The first argument will be in $ARGV[0], the second in $ARGV[1] , and so on.
You should check the DBI module. Google for some tutorial out there.
Then try to write your script and post more specific questions with some code if you need more help.

R, individual ?argument function that leads to own online help

I have an internal wiki and I created a function w(argument), which directly opens the corresponding page on my wiki using browseURL(url, browser). However, instead of w(argument), I'd like to replace it by #argument, similar to ?argument. Does somebody know if such a function definition with a shortkey is possible within R
Thanks a lot for your help
BR
Martin
No. What you are looking for is to define a new unary operator in R, and that isn't possible. (And # is the comment character in R so is used already anyway, so that wouldn't work.)
This post by Brian Ripley, in response to a similarly motivated question, has a bit more explanation (not much)
'#' starts a comment in R, so that will never get passed the parser. You'll have to modify the core and recompile R if you really want #foo to do something other than nothing.
You can change what ?foo does by reassigning it:
> assign("?",function(x){cat("HALP!\n")})
> ?foo
HALP!
Obviously you'd make it fall through to the default help system if the arg isn't what you are interested in, but this is pretty ugly.
You could define a binary operator, then pass anything in to the first argument, e.g.,
"%w%" <- function(x, y) w(y)
1%w%argument
It's 4 keys rather than 1, but that's about as close as you can get without major reworking of R.