Exclude some characters from a Lex regex - html

I am trying to build a regex for lex that match the bold text in mardown syntax. For example: __strong text__ I thought this:
__[A-Za-z0-9_ ]+__
And then replace the text by
<strong>Matched text</strong>
But in Lex, this rule causes the variable yytext to be __Matched Text__. How could I get rid of the underscores? It would be better to create a regex that does not match the underscores or proccess the variable yytext to remove it?
With capturing groups it would be easer, because I would only need the regex:
__([A-z0-9 ]+)__
And use \1. But Lex does not support capturing groups.
Answer
I finally take the first option offer by João Neto, but a little modified:
yytext[strlen(yytext)-len]='\0'; // exclude last len characters
yytext+=len; // exclude first len characters
I've tried with Start conditions as he mentioned as second option, but did not work.

You can process yytext by removing the first and last two characters.
yytext[strlen(yytext)-2]='\0'; // exclude last two characters
yylval.str = &yytext[2]; // exclude first two characters
Another option is to use stack
%option stack
%x bold
%%
"__" { yy_push_state(bold); yylval.str = new std::string(); }
<bold>"__" { yy_pop_state(); return BOLD_TOKEN; }
<bold>.|\n { yylval.str += yytext; }

Related

How to write regex expression for this type of text?

I'm trying to extract the price from the following HTML.
<td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td>
What is the regex expression to get the number 75.00?
Is it something like:
<td>$*/<span class='small font-weight-bold text-danger'>
The dollar sign is a special character in regex, so you need to escape it with a backslash. Also, you only want to capture digits, so you should use character classes.
<td>\$(\d+[.]\d\d)<span
As the other respondent mentioned, regex changes a bit with each implementing language, so you may have to make some adjustments, but this should get you started.
I think you can go with /[0-9]+\.[0-9]+/.
[0-9] matches a single number. In this example you should get the number 7.
The + afterwards just says that it should look for more then just one number. So [0-9]+ will match with 75. It stops there because the character after 5 is a period.
Said so we will add a period to the regex and make sure it's escaped. A period usually means "every character". By escaping it will just look for a period. So we have /[0-9]+\./ so far.
Next we just to add [0-9]+ so it will find the other number(s) too.
It's important that you don't give it the global-flag like this /[0-9]+\.[0-9]+/g. Unless you want it to find more then just the first number/period-combination.
There is another regex you can use. It uses the parentheses to group the part you're looking for like this: /<td>\$(.+)<span/
It will match everything from <td>$ up to <span. From there you can filter out the group/part you're looking for. See the examples below.
// JavaScript
const text = "<td>$something<span class='small font-weight..."
const regex = /<td>\$(.+)<span/g
const match = regex.exec(text) // this will return an Array
console.log( match[1] ) // prints out "something"
// python
text = "<td>$something<span class='small font-weight..."
regex = re.compile(r"<td>\$(.+)<span")
print( regex.search(text).group(1) ) // prints out "something"
As an alternative you could use a DOMParser.
Wrap your <td> inside a table, use for example querySelector to get your element and get the first node from the childNodes.
That would give you $75.00/.
To remove the $ and the trailing forward slash you could use slice or use a regex like \$(\d+\.\d+) and get the value from capture group 1.
let html = `<table><tr><td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td></tr></table>`;
let parser = new DOMParser();
let doc = parser.parseFromString(html, "text/html");
let result = doc.querySelector("td");
let textContent = result.childNodes.item(0).nodeValue;
console.log(textContent.slice(1, -1));
console.log(textContent.match(/\$(\d+\.\d+)/)[1]);

Find the word and replace with html tag using regex

I have a text equation like: 10x^2-8y^2-7k^4=0.
How can I find the ^ and replace it with <sup>2</sup> in the whole string using regex. The result should be like:
I tried str = str.replace(/\^\s/g, "<sup>$1</sup> ") but I’m not getting the expected result.
Any ideas that can help to solve my problem?
I think you're looking for something like
\^(\d+)
It matches the ^, captures the exponent and replace with
<sup>$1</sup>
See it here at regex101.
Edit:
To meet your new demands, check this fiddle. It handles the sub as well using replace with a function.
Your current pattern matches a caret followed by a space character (space, tab, new-line, etc.), but you want to match a caret followed by a single character or multiple characters wrapped in accolades, as your string is in TeX.
/\^(?:([\w\d])|\{([\w\d]{2,})\})/g
Now, using str = str.replace(/\^(?:([\w\d])|\{([\w\d]{2,})\})/g, "<sup>$1</sup>"); should do the job.
You can make a more generic function from this expression that can wrap characters prefixed by a specific character with a specific tag.
function wrapPrefixed(string, prefix, tagName) {
return string.replace(new RegExp("\\" + prefix + "(?:([\\w\\d])|\\{([\\w\\d]{2,})\\})"), "<" + tagname + ">$1</" + tagname + ">");
}
For instance, calling wrapPrefixed("1_2 + 4_{3+2}", "_", "sub"); results in 1<sub>2</sub> + 4<sub>3+2</sub>.

Find word that starts at a newline

I have a simple loop to delete all words from the end of a text that start with a # and space.
AS3:
// messageText is usually taken from a users input field - therefore the newline is not present in the "messageText"
var messageText = "hello world #foo lorem ipsum #findme"
while (messageText.lastIndexOf(" ") == messageText.lastIndexOf(" #")){
messageText = messageText.slice(0,messageText.lastIndexOf(" "));
}
How to check if the position before the # is not a space but a newline?
I tried this but nothing gets found:
while (messageText.lastIndexOf(" ") == messageText.lastIndexOf("\n#")){
messageText = messageText.slice(0,messageText.lastIndexOf(" "));
}
\n is the newline character in the Unix file definition.
\r\n is the Windows version.
\r is the OSX version.
See also: this previous (dupe) post.
First thing is I'd manually try replacing "\n" with "\r\n" and then "\r" to see if there is some other newline in use. If so, then you just need a better search term that will match each version in one go.
A better solution might be to use Regular Expression (RegExp). You are explicitly looking for the newline character and a space after it. You could use this regex pattern to look for the start of a line with a single space:
var pattern:RegExp = /^\s/;
if (yourString.search(pattern) >= 0) { ... }
The ^ carat character enforces that it's the start of a line. The \s is a placeholder for any whitespace character, so if you don't want to match tabs then change it to a blank space. (I'm not familiar with ActionScript specifically, but that syntax looks OK and search() will return -1 if the pattern isn't found).

preg_replace not working

I have this function in my website.
function autolink($content) {
$pattern = "/>>[0-9]/i" ;
$replacement = ">>$0";
return preg_replace($pattern, $replacement, $content, -1);
This is for making certain characters into a clickable hyperlink.
For example, (on a thread) when a user inputs '>>4' to denote to the another reply number 4, the function can be useful.
But it's not working. the characters are not converted into a hyperlink. They just remain as plain text. Not clickable.
Could someone tell me what is wrong with the function?
So the objective is to convert:
This is a reference to the >>4 reply
...into:
This is a reference to the >>4 reply
...where ">" is the HTML UTF-8 equivalent of ">". (remember, you don't want to create HTML issues)
The problems: (1) you forgot to escape the quotes in the replacement (2) since you want to isolate the number, you need to use parentheses to create a sub-pattern for later reference.
Once you do this, you arrive at:
function autolink($contents) {
return preg_replace( "/>>([0-9])/i",
">>$1",
$contents,
-1
);
}
Good luck

AS3 validate form fields?

I wrote a AS3 script, i have 2 fields to validate, i.e email and name.
For email i use:
function isValidEmail(Email:String):Boolean {
var emailExpression:RegExp = /^[a-z][\w.-]+#\w[\w.-]+\.[\w.-]*[a-z][a-z]$/i;
return emailExpression.test(Email);
}
How about name field? Can you show me some sample code?
EDIT:
Invalid are:
blank
between 4 - 20 characters
Alphanumeric only(special characters not allowed)
Must start with alphabet
I think you probably want a function like this:
function isNameValid(firstname:String):Boolean
{
var nameEx:RegExp = /^([a-zA-Z])([ \u00c0-\u01ffa-zA-Z']){4,20}+$/;
return nameEx.test(firstname);
}
Rundown of that regular expression:
[a-zA-Z] - Checks if first char is a normal letter.
[ \u00c0-\u01ffa-zA-Z'] - Checks if all other chars are unicode characters or a space. So names like "Mc'Neelan" will work.
{4,20} - Makes sure the name is between 4 and 20 chars in length.
You can remove the space at the start of the middle part if you don't want spaces.
Hope this helps. here are my references:
Regular expression validate name asp.net using RegularExpressionValidator
Java - Regular Expressions: Validate Name
function isNameValid(firstname:String):Boolean
{
var nameEx:RegExp = /^([a-zA-Z])([ \u00c0-\u01ffa-zA-Z']){4,20}+$/;
return nameEx.test(firstname);
}
{4,20} instead {2,20}
Problem avoided for names like Ajit