Replacing words while processing documents before tokenize in Rapidminer - rapidminer

I have set of documents and would like to replace some of the word sets with a single word before tokenize.
ex. "follow up" --> follow-up,
"Set up" --> Setup and
"with out" --> without
I tried using Replace(dictionary) by loading a csv file with the potential words. But can't tokenize after.
How do I do this?
Thanks,
Aji

have a look at Stem (Dictionary). this can be missused to do your trick.
Cheers,
Martin

Related

Create an NCX file with Notepad++ and Regular expression

I have a HTML Table of Contents page containing list of book chapters with hyperlinks:
Multimedia Implementation<br/>
Table of Contents<br/>
About the Author<br/>
About the Technical Reviewers<br/>
Acknowledgments<br/>
Part I: Introduction and Overview<br/>
Chapter 1. Technical Overview<br/>
...
I want create NCX file for a Kindle book which must contain details as follows:
<navPoint id="n1" playOrder="1">
<navLabel>
<text>Multimedia Implementation</text>
</navLabel>
<content src="final/main.html"/>
</navPoint>
<navPoint id="n2" playOrder="2">
<navLabel>
<text>Table of Contents</text>
</navLabel>
<content src="final/toc.html"/>
</navPoint>
<navPoint id="n3" playOrder="3">
<navLabel>
<text>About the Author</text>
</navLabel>
<content src="final/pref01.html"/>
</navPoint>
...
I'm using Notepad++: is it possible automate this process with regular expression?
You cannot do everything using regex.. you can split the problem into two parts..
generate strings like <navPoint id="n1" playOrder="1"> using program logic (increment variable)
remaining you can do with regex
Use the following regex to match:
<a\shref="([^"]*)">([^<]*)<\/a><br\/>
And replace with:
(generated string)<navLabel>\n<text>\2</text>\n<content src="\1"/>\n</navPoint>
See DEMO
Yes, it is possibly to replace the links with <navpoint> tags. The only thing I found no solution for is the incremental numbering of the <navpoint> attributes id and playOrder...
The following regex will do most of the work:
/^<a[^>]*href="([^"]+)"[^>]*([^<]+).*$/gm
substitute with:
<navpoint id="n" playOrder="">\n<navLabel><text>$2</text></navLabel>\n<content src="$1" />\n</navpoint>\n
Regex details
/^<a .. only parse lines that start with an `<a` tag
.*href=" .. find the first occurance of `href="`
([^"]+) .. capture the text and stop when a " is found
"[^>]*> .. find the end of the <a> tag
([^<]+) .. capture the text and stop when a < is found (i.e. the </a> tag)
.*$/ .. continue to end of the line
gm .. search the whole string and parse each line individually
More detailled (but also more confusing) explanation is here:
https://regex101.com/r/gA0yJ2/1
This link also demonstrates how the regex is working. You can test changes there if you like

Regex find two characters in order, between others, ignoring punctuation

I'm trying to filter using regex in mySQL.
The field is a text field and I want to find all that match 'MD' or similar ('M.D.', 'M. D.', 'DDS, M.D.' etc.).
I do not want to accept those that contain M and D as a part of another acronym (e.g., 'DMD'). However 'DMD, M.D.' I would want to find.
Apologies if this is a simple task - I read through some regex tutorials and couldn't figure this out! Thanks.
Update:
With help from the suggestions I arrived at the following solution:
(\s|^)M\.?\s*D\.?
which works for all of my cases. The quotes in my questions were to indicate it was a string, they are not a part of the string.
You can use a regex like this:
\b(M\.?\s*D\.?|D\.?\s*D\.?\s*S\.?)
Working demo
If I have understood your requirement:
'([^'.]*[ ,]*M[. ]*D[. ]*)'
this looks for MD preceded by space comma or ' separated by 0 or more dots & spaces, followed by '
it matches all the contents between the '' marks
test: https://regex101.com/r/oV2kV8/2
In the end I found this solution works:
(\s|^)M\.?\s*D\.?(\s|$)
This allows for the 'MD' to be at the start or after another credential and to have spaces or periods or nothing between the letters.

How to prevent link search from spilling across tags?

How to prevent link search from spilling across tags?
I have a local web site whose pages contain hyperlinks of various classes and would like to know how to prevent search results from spilling across several tags. (I need to do a batch modification of the address of a particular link type.)
E.g., my page may contain lists of links such as
Best solution:<br>
AAA<br> but see also
BBB<br> and
CCC<br>.
Now when I try to search the site for only the links of class "zzz" using the regex search term
<a href="+[].html" class="zzz">
my results include long strings such as
AAA<br> but see also BBB<br> and <a href="ccc.html" class="zzz>
What has happened is that the search engine (Funduc Search & Replace, if this helps) finds the <a href= of the first link (aaa.html), the matching class of the third link (ccc.html), and includes everything in between.
What expression must I use to ensure only the link of the file with the correct class, and nothing else, appears in the search result?
E.g.,
<a href="ccc.html" class="zzz>
Thanks for your help.
Use a DOM library (preferably one that supports XPath) instead of a regular expression. Regular expressions are poorly suited to dealing with HTML.
The + modifier for one or more occurrences, is eager to match in most regex engine. That means, [a-z]+ means "Match a or b or ... or z as many as possible".
Perl regex engine has a special modifier +? for lazy match, so [a-z]+? means "Match a..z as few as possible".
Simply, you can exclude ", > from "any char" to match:
[^">]+
The regex will be look like:
<a href="([^">]+.html)" class="zzz">
A more precised perl version:
<a\s+.*?\bhref\s*=\s*"(.+?\.html)"\s*class\s*=\s*"zzz".*?>
Here () for capture group.
I haven't tried with Funduc Search and Replace for Windows, hope it works.

How do I check if values between html tags are blank or empty using regular expressions in notepad plus plus

I'm conducting a mass search of files in notepad++ and I need to determine if there are no values between a set of tags (i.e. ).
".*?" will search for 0 or more characters (well, most), which is fine. But I'm looking for a set of tags with at least one character between them.
".+?" is similar to the above and does work in notepad++.
I tried the following, which was unsuccessful:
<author>.{0}?</author>
Thank you for any help.
Since you look for something that doesn't exist you don't have to make it that complicated. Simply searching for <author></author> would do the trick, wouldn't it? If you want to include space-characters as "nothing" you could modify it to the following:
<author>\s*?</author>
Output:
<author></author> Match
<author> </author> Match
<author>something</author> No match
I don't understand why you are using the "?" operator; ".+" should yield the result you need.

Reg Exp To Remove 'onclick' from HTML elements (Notepad++)

I have a big html file (87000+ lines) and want to delete all occurrences of onclick from all elements. Two examples of what I want to catch are:
1. onclick="fnSelectNode('name2',2, true);"
2. onclick="fnExpandNode('img3','node3','/include/format_DB.asp?Index2=3');"
The problem is that both function names and parameters passed to them are not the same. So I need a Regular expression that matches onclick= + anything + );"
And I need one that works in Notepad++
Thanks for helping ;-)
Not familiar with notepad++, but what I use in vim is:
onclick="[^"]+"
Of course this depends on there being double quotes around the onclick in every case...
This regular expression will fail if you have a " or ' character included within quotes escaped by a \. Other than that, this should do it.
(onclick="[^"]+")|(onclick='[^"]+')
onclick="[^"]+" works for me, for that 2 strings.
If you want to go with a regex:
/onclick=".*?"/
You could also use something which is DOM-aware, such as a HTML/XML parser, or even just load up jQuery:
$("[onclick]").removeAttr("onclick");
... and then copy the body HTML into a new file.
Could
onclick=\".+;\"
Work?
onclick=\".*\);\"
This regex should do the trick.
(\s+onclick=(?:"[^"]+")|(?:'[^']+'))
Open your file on dreamweaver, choose edit from the toolbar, select find and replace,
put onclick="[^"]+" in find field and keep replace blank
this will do the whole thing.
Enjoy