Extracting text string between 2 characters with Regex - json

Say I have this string:
["teads.tv, 15429, reseller, 15a9c44f6d26cbe1 ","video.unrulymedia.com,367782854,reseller","google.com, pub-8173359565788166, direct, f08c47fec0942fa0","google.com, pub-8804303781641925, reseller, f08c47fec0942fa0 "]
I am trying to extract all the text strings like teads.tv, google.com and etc.
Each text string is placed in the following way "text.text,, but there are aslo combinations of ", without any character in between.
I tried this Regex expression:
"(.*?)\,
but I also capture the empty combinations, you can check it out here.
How can I modify the Regex expression, so it would capture only the combination with a string between ",?
Cheers,

If there should be at least a single non whitespace char present other than " , [ ] you can match optional whitespace chars and use a negated character class listing all the characters that should not be matched and repeat that 1 or more times.
"(\s*[^\][\s",]+),
Regex demo
The more broad variation is to repeat 1+ times any char except a comma:
"([^,]+),
Regex demo

How about using + (one or more) instead of * (zero or more) as quantifier:
"(.+?),
Additionally, you may not need to escape , with backslash.

Reading the question as retrieving the string with a dotted notation such as domain names means that we are looking for the first string after a ".
This string will grab strings with dots within them, but avoid the quote characters.
const regEx = /(?:\")([\w\d\.\-]+)/g;
const input = '["teads.tv, 15429, reseller, 15a9c44f6d26cbe1 ","video.unrulymedia.com,367782854,reseller","google.com, pub-8173359565788166, direct, f08c47fec0942fa0","google.com, pub-8804303781641925, reseller, f08c47fec0942fa0 "]';
const regMatch = Array.from(input.matchAll(regEx), m => m[1]);
console.log(regMatch)

Related

Regex grouping: must start with /, optional group of characters alpha-numeric with forward slashes and total 1-255 characters

I have an HTML5 input element with a pattern attribute. I'm having some trouble with an optional group.
The (relative) URL must start with a forward slash (I have this working).
The total (relative) URL may contain a total of up to 255 characters.
All characters from 2-255 must be (lowercase) alpha-numeric or a forward slash.
Separately the forward slash regex works and the 2-255 part works for alpha-numeric and forward slashes. However I'm having trouble allowing both groups with the second group being optional.
What I have confirmed to work:
pattern="^\/"
pattern="[a-z0-9\/]"
However I can't determine how to allow the second group as an option (I've tried adding the ? after the ending square bracket in example without luck).
I also am not sure how to combine the length ({255,}) bit to the total pattern expression.
How do I combine all three aspects of the regular expression?
Note: tags seem to be broken at the moment of posting this.
You can use
pattern="/[a-z0-9/]{0,254}"
You do not need ^ nor $ in the pattern regex, by the way, it must match the whole string anyway, it will be parsed as ^(?:/[a-z0-9/]{0,254})$ pattern. That is, it will match a string that starts with / and then contains 0 to 254 lowercase ASCII letters, digits or slashes till the string end.
Note that / should only be escaped in regex literals where / is used as a delimiter char. pattern regexps are defined with literal strings.

How to replace delimited words in regex

I have a underscore delimited string or words such as:
word1_word2_word3_word4 and a list of allowed values such as word1, word3
The goal is to filter out not allowed values and replace them with let's say … so the resulting string will be word1_..._word3_...
This needs to be done in MySQL and I plan to use REGEXP_REPLACE but all my attempts to come with a working regex that handles all instances ( such as first and last word) failed.
To simplify things I tried adding a leading and trailing underscores to the string so it becomes _word1_word2_word3_word4_ and do:
(?<=_)[^_]+(?=_) which nicely matches all string between delimiters, however I could not figure out how to exclude word1 and word3.
Just negative lookahead for word1|word3 right before the start of the match:
(?<=_)(?!word1|word3)[^_]+(?=_)
If the match may also start at the beginning of the string or end at the end of the string (without a _ delimiter), then alternate the lookarounds with ^ and $:
(?<=_|^)(?!word1|word3)[^_]+(?=_|$)
https://regex101.com/r/QFb1p7/1
You may use
(?<![^_])(?!(?:word1|word3)(?![^_]))[^_]+(?![^_])
See the regex demo.
Note: (?<![^_]) = (?<=^|_) and (?![^_]) = (?=_|$), but are more efficient. Why (?![^_]) in (?!(?:word1|word3)(?![^_])) is used? Because you may still want to match word10 or word345.
Details
(?<![^_]) - start of string or _
(?!(?:word1|word3)(?![^_])) - no word1 or word3 up to the end of string or _ are allowed
[^_]+ - 1+ chars other than underscores
(?![^_]) - end of string or _

Regexp to match JSON key:value pairs with commas in value [duplicate]

Can't get why this regex (regex101)
/[\|]?([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures all the input, while this (regex101)
/[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures only |Func
Input string is |Func(param1, param2, param32, param54, param293, par13am, param)|
Also how can i match repeated capturing group in normal way? E.g. i have regex
/\(\(\s*([a-z\_]+){1}(?:\s+\,\s+(\d+)*)*\s*\)\)/gui
And input string is (( string , 1 , 2 )).
Regex101 says "a repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations...". I've tried to follow this tip, but it didn't helped me.
Your /[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g regex does not match because you did not define a pattern to match the words inside parentheses. You might fix it as \|+([a-z0-9A-Z]+)(?:\(?(\w+(?:\s*,\s*\w+)*)\)?)?\|?, but all the values inside parentheses would be matched into one single group that you would have to split later.
It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer.
What you may do is get mutliple matches with preg_match_all capturing the initial delimiter.
So, to match the second string, you may use
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\()\K\w+
See the regex demo.
Details:
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\() - either the end of the previous match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*,\s*), or 1+ | symbols (\|+), followed with 1+ alphanumeric chars (captured into Group 1, ([a-z0-9A-Z]+)) and a ( symbol (\()
\K - omit the text matched so far
\w+ - 1+ word chars.

problems using replaceText for special characters: [ ]

I want to replace "\cite{foo123a}" with "[1]" and backwards. So far I was able to replace text with the following command
body.replaceText('.cite{foo}', '[1]');
but I did not manage to use
body.replaceText('\cite{foo}', '[1]');
body.replaceText('\\cite{foo}', '[1]');
Why?
The back conversion I cannot get to work at all
body.replaceText('[1]', '\\cite{foo}');
this will replace only the "1" not the [ ], this means the [] are interpreted as regex character set, escaping them will not help
body.replaceText('\[1\]', '\\cite{foo}');//no effect, still a char set
body.replaceText('/\[1\]/', '\\cite{foo}');//no matches
The documentation states
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.
Can I find a full description of what is supported and what not somewhere?
I'm not familiar with Google Apps Script, but this looks like ordinary regular expression troubles.
Your second conversion is not working because the string literal '\[1\]' is just the same as '[1]'. You want to quote the text \[1\] as a string literal, which means '\\[1\\]'. Slashes inside of a string literal have no relevant meaning; in that case you have written a pattern which matches the text /1/.
Your first conversion is not working because {...} denotes a quantifier, not literal braces, so you need \\\\cite\\{foo\\}. (The four backslashes are because to match a literal \ in a regular expression is \\, and to make that a string literal it is \\\\ — two escaped backslashes.)

Is there a way to include commas in CSV columns without breaking the formatting?

I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?
To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.
I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.
The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.
If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.
You need to quote that values.
Here is a more detailed spec.
In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.
You can use Template literals (Template strings)
e.g -
`"${item}"`
CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.
This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though
Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.
I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.
May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.
You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.
You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.
First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","
Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)
A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}