How to get a particular word/string in lua regex

How to get a particular word/string in lua regex - json

I have a problem to get a string.
Here is my code:
conf = "option fn_o 'Operator'"
print(conf)
local s, e, pa = string.find(conf, "\b(?!option|fn_o)\b\w+")
print(s, e, pa)
I want to get an Operator only. In Javascript, that regex works good, but in Lua it doesn't. I think there is no problem because Lua is based on json so it is similar to javascript. Is there any problem?

Lua does not support full regular expressions out of the box, but there are libraries that do.
Lua includes patterns, which suffice for your task.
This code gets the string inside single quotes:
print(string.match(conf, "'(.-)'"))
The pattern reads: find a single quote and capture everything until the next single quote.

Related

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/

Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.

For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

NLTK letter 'u' in front of text result?

I'm learning NLTK with a tutorial and whenever I try to print some text contents, it returns with 'u' in front of it.
In the tutorial it looks like this,
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...
But in my result, it looks like this
(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to se', '...')
I am not sure why. I followed exact way the tutorial is explaining. Can someone help me understand this problem? Thank you!

That leading u just means that that string is Unicode. All strings are Unicode in Python 3. The parentheses means that you are dealing with a tuple. Both will go away if you print the individual elements of the tuple, as with t[0], t[1], and so on (assuming that t is your tuple).
If you want to print the whole tuple as a whole, removing u's and parentheses, try the following:
print " ".join (t)

As mentioned in other answer the leading u just means that string is Unicode. str() can be used to convert unicode to str but there doesnt seem to be a direct way to convert all the values in a tuple from unicode to string.
Simple function as below and using it when ever you are referring to any tuple in nltk.
>>> def str_tuple(t, encoding="ascii"):
... return tuple([i.encode(encoding) for i in t])
>>> str_tuple(nltk.corpus.gutenberg.fileids())
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')

I guess you are using Python2.6 or any version before 3.0.
Python allows its users to do the same operation on 'str()' and 'unicode' in the early version. They tried to make conversion between 'str()' and 'unicode' directly in some case rely on default encoding, which on most platform is ASCII. That's probably the reason cause your problem. Here are two ways may solve it:
First, manually assign decoding method. For example:
>> for name in nltk.corpus.gutenberg.fileids():
>> name.decode('utf-8')
>> print(name)
The other way is to UPDATE your Python to version 3.0+ (Recommended). They fix this problem in Python3.0. Here is the link to update detail description:
https://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Hope this helps you.

Regex not matching URL Params

I am currently working on a stub server I can plug into a webpage so I do not need to hit sagepay every time I test my payment screen. I need the server to receive a request from the web page and use the dynamic parameters contained in the URL to build the server response. The stub uses regex targets to pick out the parameters I need and add them to the response.
I am using this stub server
I built the accepted URL piece by piece, using the regex tester contained here to test each bit of logic. The expressions work separately, but when I try to join two or more of them together they refuse to work. Each parameter is separated by an ampersand (&) and the name of the parameter.
Here is a sample of the parameters:
paymentType=A&amount=147.06&policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad&paymentMethod=A&script=Retail/accept.py&scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A&description=New Business Payment&firstName=Adam&surname=Har&addressLine1=20 Potters Road&city=London&postalCode=EC1 4JS&payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b&cardType=valid&continuousAuthority=true&makeCurrent=true
and in a list for ease of reading (without &'s)
paymentType=A
amount=147.06
policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad
paymentMethod=A
script=Retail/accept.py
scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A
description=New Business Payment
firstName=Adam
surname=Har
addressLine1=20 Chase road
city=London
postalCode=EC1 3PF
payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b
cardType=valid
continuousAuthority=true
makeCurrent=true
And here is my accepted URL parameters with the regex logic:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))&description=([a-zA-Z0-9 ]+s)&firstName=[A-Za-z]&surname=[A-Za-z]&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
again in a list:
registerPayment?outputType=xml
country=GB
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))
description=([a-zA-Z0-9 ]+s)
firstName=[A-Za-z]
surname=[A-Za-z]
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
My question is; why does my regex and sample match ok seperately, but dont when I put them all together ?
Additional question:
I am using the logic (([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+))) for the whole ScriptParams parameter (the &'s here are part of the parameter.) If I just want to get the 'uid' part and leave the rest, what expression would I need to target this (it is made up of A-z a-z 0-9 and dashes)?
thank you
UPDATE
I have tweaked your answer slightly, because the stub server I am using will not accept the (?:[\s-]) when it loads the file containing the URL templates. I have also incorporated a lot of % and 0-9 because the request is UTF encoded before it is matched (which I had not anticipated), and a few of the params have rogue spaces beyond my control. Other than that, your solution worked great :)
Here is my new version of the scriptParams regex:
&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+
This accepts the whole parameter, and works fine in the regex tester. Now when I link anything after this part, there is an unsuccessful match.
I do not understand why this is a problem as the regex seem to string together nicely otherwise. Any ideas are appreciated.
Here is the full regex:
paymentType=[-%a-zA-Z0-9 ]+&amount=[0-9]+.[0-9]{2}&policyUid=([-A-Za-z0-9]+)&paymentMethod=([%a-zA-Z0-9]+)&script=[%/.a-zA-Z0-9]+&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+&description=[%a-zA-Z0-9 ]+&firstName=[-%A-Za-z0-9]+&surname=[-%A-Za-z0-9]+&addressLine1=[-%a-zA-Z0-9 ]+&city=[-%a-zA-Z 0-9]+&postalCode=[-%a-zA-Z 0-9]+&payerUid=([-A-Za-z0-9]+)&cardType=[%A-Za-z0-9]+&continuousAuthority=[A-Za-z]+&makeCurrent=[A-Za-z]+
And here is the full set of URL params (with UTF encoding present):
paymentType=A&amount=104.85&policyUid=16a9cc22-0000-0000-5a96-5654d9a31f92&paymentMethod=A%20&script=RetailQuotes%2FacceptQuote.py%20&scriptParams=uid%3d16a9c958-0000-0000-5a96-565435311d07%26invokePCL%3dtrue%26paymentType%3dA%20&description=New%2520Business%2520Payment&firstName=Adam&surname=Har%20&addressLine1=26%2520Close&city=Potters%2520Town&postalCode=EC1%25206LR%20&payerUid=16a9c24e-0000-0000-5a96-5654b3f956e0&cardType=valid%20&continuousAuthority=true&makeCurrent=true
Thank you
PS
(Solved the server problem. Was a slight mistake I was making in the usage of URL params.)

First, your regex not all work, some are missing quantifiers, others have a $ for some reason and some parameters are even missing! Here's what they should have been:
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))
invokePCL=([a-z]+)
paymentType=A
description=([a-zA-Z0-9 ]+)
firstName=[A-Za-z]+
surname=[A-Za-z]+
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
And combined, you get:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))&invokePCL=([a-z]+)&paymentType=A&description=([a-zA-Z0-9 ]+)&firstName=[A-Za-z]+&surname=[A-Za-z]+&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
regex101 demo
[Note, I took your regexes where they matched and ran minimal edits to them].
For your second question, I'm not sure what you mean by the Uid part and that & are part of the parameter. Given that there are 3 Uids in the url with similar format (policy, scriptparams, user), you will have to put them in the expression, unless you know a specific pattern to the scriptparams' Uid.
In the expression below, I made use of the fact that only scriptparams' uid was in lowercase:
uid=[0-9a-f]+(?:-[0-9a-f]+)+
regex101 demo

Iterating over a string in Vimscript or Parse a JSON file

So I'm creating a vim script that needs to load and parse a JSON file into a local object graph. I searched and I couldn't find any native way to process a JSON file, and I don't want to add any dependencies to the script. So I wrote my own function to parse the JSON string (gotten from the file), but it's really slow. At the moment, I iterate through each character in the file like so:
let len = strlen(jsonString) - 1
let i = 0
while i < len
let c = strpart(jsonString, i, 1)
let i += 1
" A lot of code to process file....
" Note: I've tried short cutting the process by searching for enclosing double-quotes when I come across the initial double quotes (also taking into account escaping '\' character. It doesn't help
endwhile
I've also tried this method:
for c in split(jsonString, '\zs')
" Do a lot of parsing ....
endfor
For reference, a file with ~29,000 characters takes about 4 seconds to process, which is unacceptable.
Is there a better way to iterate over a string in vim script?
Or better yet, have I missed a native function to parse JSON?
Update:
I asked for no dependencies because I:
Didn't want to deal with them
Genuinely wanted some ideas for best way to do this without someone else's work.
Sometimes I just like to do things manually even though the problem has already been solved.
I'm not against plugins or dependencies at all, it's just that I'm curious. Thus the question.
I ended up creating my own function to parse the JSON file. I was creating a script that could parse the package.json file associated with node.js modules. Because of this, I could rely on a fairly consistent format and quit the processing whenever I'd retrieved the information I needed. This usually cut out large chunks of the file since most developers put the largest chunk of the file, their "readme" section, at the end. Because the package.json file is strictly defined, I left the process somewhat fragile. It assumed a root dictionary { } and actively looks for certain entries. You can find the script here: https://github.com/ahayman/vim-nodejs-complete/blob/master/after/ftplugin/javascript.vim#L33.
Of course, this doesn't answer my own question. It's only the solution to my unique problem. I'll wait a few days for new answers and pick the best one before the bounty ends (already set an alarm on my phone).

The simplest solution with the least dependencies is just using the json_decode vim function.
let dict = json_decode(jsonString)

Even though Vim's origin dates back a lot it happens that its internal string() eval() representation is that close to JSON that its likely to work unless you need special characters.
You can lookup the implementation here which even supports true/false/null if you want:
https://github.com/MarcWeber/vim-addon-json-encoding
Better use that library (vim-addon-manager allows to install dependencies easily).
Now it depends on your data whether this is good enough.
Now Benjamin Klein posted your question to vim_use which is why I'm replying.
Best and fast replies happen if you subscribe to the Vim mailinglist.
Goto vim.sf.net and follow the community link.
You cannot expect the Vim community to scrape stackoverflow.
I've added the keyword "json" and "parsing" to that little code that it can be found easier.
If this solution does not work for you you can try the many :h if_* bindings or write an external script which extracts the information you're looking for, or turns JSON into Vim's dictionary representation which can be read by eval() escaping special characters you care about correctly.

If you seek for completely correct solution omitting dependencies is one of the worst thing you can do. The eval() variant mentioned by #MarcWeber is one of the fastest, but it has its disadvantages:
Using solution for securing eval I mentioned in comment makes it no longer the fastest. In fact after you use this it makes eval() slower by more then an order of magnitude (0.02s vs 0.53s in my test).
It does not respect surrogate pairs.
It cannot be used to verify that you have correct JSON: it accepts some strings (e.g. "\<C-o>") that are not JSON strings and it allows trailing commas.
It fails to give normal error messages. It fails badly if you use vam#VerifyIsJSON I mentioned in p.1.
It fails to load floating point values like 1e10 (vim requires numbers to look like 1.0e10, but numbers like 1e10 are allowed: note “and/or” in the first paragraph).
. All of the above (except for the first) statements also apply to vim-addon-json-encoding mentioned by #MarcWeber because it uses eval. There are some other possibilities:
Fastest and the most correct is using python: pyeval('json.loads(vim.eval("varname"))'). Not faster then eval, but fastest among other possibilities. (0.04 in my test: approximately two times slower then eval())
Note that I use pyeval() here. If you want solution for vim version that lacks this functionality it will no longer be one of the fastest.
Use my json.vim plugin. It has an advantages of slightly better error reporting compared to failed vam#VerifyIsJSON, slightly worse compared to eval() and it correctly loads floating-point numbers. It can be used for verification of strings (it does not accept "\<C-a>"), but it loads lists with trailing comma just fine. It does not support surrogate pairs. It is also very slow: in the test I used (it uses 279702 character long strings) it takes 11.59s to load. Json.vim tries to use python if possible though.
For the best error reporting you can take yaml.vim and purge YAML support out of it leaving only JSON (I once have done the same thing for pyyaml, though in python: see markedjson library used in powerline: it is pyyaml minus YAML stuff plus classes with marks). But this variant is even slower then json.vim and should only be used if the main thing you need is error reporting: 207 seconds for loading the same 279702 character long string.
Note that the only variant mentioned that satisfies both requirements “no dependencies” and “no python” is eval(). If you are not fine with its disadvantages you have to throw away one or both of these requirements. Or copy-paste code. Though if you take speed into account only two candidates are left: eval() and python: if you want to parse json fast you really must use C and only these solutions spend most time in functions written in C.
Most other interpreters (ruby/perl/TCL) do not have pyeval() equivalent so they will be slower even if their JSON implementation is written in C. Some other (lua/racket (mzscheme)) have pyeval() equivalent, but e.g. luaeval('{}') is zero meaning that you will have to add additional step explicitly and recursively converting objects into vim dictionaries and lists (e.g. luaeval('vim.dict({})')) which will impact performance. Cannot say anything about mzeval(), but I have never heard about anybody actually using racket (mzscheme) with vim.

Swig and %rename filters

Is there a way to run multiple filters in a single %rename call in SWIG?
I know from the manual that I can use a line like this:
%rename("%(strip:[H3D])s") "";
which will turn all methods such as "H3DFoo" in to "Foo". There are other in-built filters for doing case transformation, but there is no documentation on how to do multiple steps.
Using another %rename replaces the filter, and I haven't found a separator to run multiple filters on the string. So, it appears possible to convert type casing or remove a prefix and not both.
In this particular case it might be possible to use the regex filter, but it would be nice to be able to both remove a prefix and convert type casing. The other option is to put a %rename on every single declaration, but this defeats the purpose of %rename being able to apply to a module in general.

I think I'd be inclined to go for the variant of %rename that can call a command if your rules are more complicated than a single variant or a regex.
I would use perl personally and it has a plethora of CPAN modules for things like renaming, e.g.:
%rename("command:perl build/rename.pl <<<")
The manual warns against this because it's slow spawning processes to perform it. Given that typically you don't run SWIG very often I don't see that as a huge draw back.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to get a particular word/string in lua regex - json

Related

Regex getting the tags from an <a href= ...> </a> and the likes

NLTK letter 'u' in front of text result?

Regex not matching URL Params

Iterating over a string in Vimscript or Parse a JSON file

Swig and %rename filters

Categories

Resources