Determining a Part-of-Speech without nlkt or spacey but rather regex - nltk

How can I identify if a word is a noun, adjective or verb without using an external library like nlkt or spacey but simply using regex

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

I am currently trying to format a JSON object using LabVIEW and have ran into the issue where it adds additional quotation marks invalidating my JSON formatting. I have not found a way around this so I thought just formatting the string manually would be enough.
Here is the JSON object that I have:
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":"["cat","dog","bird"]",
"count":3
}
}
Here is the JSON object I want with the quotation marks removed.
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":["cat","dog","bird"],
"count":3
}
}
I am still not an expert with regex and using a regex tester I was only able to grab the "objects" and "count" fields but I would still feel I would have to utilize substrings to remove the quotation marks.
Example I am using (would use a "count" to find the start of the next field and work backwards from there)
"([objects]*)"
Additionally, all the other Regex I have been looking at removes all instances of quotation marks whereas I only need a specific area trimmed. Thus, I feel that a specific regex replace would be a much more elegant solution.
If there is a better way to go about this I am happy to hear any suggestions!
Your question suggests that the built-in LabVIEW JSON tools are insufficient for your use case.
The built-in library converts LabVIEW clusters to JSON in a one-shot approach. Bundle all your data into a cluster and then convert it to JSON.
When it comes to parsing JSON, you use the path input terminal and the default type terminals to control what data is parsed from a JSON string.
If you need to handle JSON in a manner similar to say JavaScript, I would recommend something like the JSONText Toolkit which is free to use (and distribute) under the BSD licence. This allows more complex and iterative building of JSON strings from LabVIEW types and has text-path style element access along with many more features.
The Output controls from both my examples are identical - although JSONText provides a handy Pretty Print vi.
After using a regex from one of the comments, I ended up with this regex which allowed me to match the array itself.
(\[(?:"[^"]*"|[^"])+\])
I was able to split the the JSON string into before match, match and after match and removed the quotation marks from the end of 'before match' and start of 'after match' and concatenated the strings again to form a new output.

Custom reason for word wrap in VS Code extension to enable working with multiline values in Json

I write Jsons for an API that often requires to have multiline values because scripts are in between the data in the attributes. I've written an extension for me that can escape and unescape multiline values, therefore I can cycle between those states:
{
"value": "
multiline
value
"
}
{
"value": "multiline\n value"
}
However, in the un-escaped, formatted, status, I have an invalid Json, which just causes trouble. I have to switch between escaped and unescaped states to do any Json operation (like format), which I work around by replacing \n with \\n and back.
I have even considered switching to another format, but neither of those I tried had a killer feature making me switch. Among those: Jsonc (no multiline value support), XML (hard to write and read, but supports multiline values and indentation), YAML (would be an option, but does not support indentation in multiline values).
Can I force VS Code to render a specific sequence of characters as line break (in this case, it would be \\n) without changing the document data? The intended functionality is like what the Alt+Z word wrap does, just in a different place.
After some research, I've found the following:
it is not possible or hard to do to hijack the default editor and make it render lines in a way I want
even if I managed to, it might have an underlying issue of not tokenizing long lines
I have decided to go in a way, where I define a custom language, because:
* the API has a constant Json structure
* I can define a new grammar for the API scripts and embed it to the Jsons
This approach seems to be a way to go for me, although it might be a temporary solution. I'm losing Json validations, therefore I do not get a Json new line in value error, but I'm also losing errors in missing commas. This is something I want to approach with the following precautions:
if a JSON is valid and contains attributes of the API-defined classes, offer a button to switch to my defined language, which also un-escapes \\\n to \n
if language is my language, escape the newlines and try to pass it to a JSON validation.

How to read a file and write to other file in tcl with replacing values

I have three files: Conf.txt, Temp1.txt and Temp2.txt. I have done regex to fetch some values from config.txt file. I want to place the values (Which are of same name in Temp1.txt and Temp2.txt) and create another two file say Temp1_new.txt and Temp2_new.txt.
For example: In config.txt I have a value say IP1 and the same name appears in Temp1.txt and Temp2.txt. I want to create files Temp1_new.txt and Temp2_new.txt replacing IP1 to say 192.X.X.X in Temp1.txt and Temp2.txt.
I appreciate if someone can help me with tcl code to do same.
Judging from the information provided, there basically are two ways to do what you want:
File-semantics-aware;
Brute-force.
The first way is to read the source file, parse it to produce certain structured in-memory representation of its content, then serialize this content to the new file after replacing the relevant value(s) in the produced representation.
Brute-force method means treating the contents of the source file as plain text (or a series of text strings) and running something like regsub or string replace on this text to produce the new text which you then save to the new file.
The first way should generally be favoured, especially for complex cases as it removes any chance of replacing irrelevant bits of text. The brute-force way me be simpler to code (if there's no handy library to do this, see below) and is therefore good for throw-away scripts.
Note that for certain file formats there are ready-made libraries which can be used to automate what you need. For instance, XSLT facilities of the tdom package can be used to to manipulate XML files, INI-style file can be modified using the appropriate library and so on.

JSON alternatives (for the purpose of specifying configuration)?

I like json as a format for configuration files for the software I write. I like that it's lightweight, simple, and widely supported. However, I'm finding that there are some things I'd really like in json that it doesn't have.
Json doesn't have multiline strings or here documents ( http://en.wikipedia.org/wiki/Here_document ), and that is often very awkward when you want your json file to be human-readable and -editable. You can use arrays of strings, but that's a kludgy workaround.
Json doesn't allow comments.
If you look at the formats of unix configuration files, you see a lot of people designing their own awkward formats for things that it would really make more sense to do using some kind of general-purpose thing. For example, here's some code from an Apache config file:
RewriteEngine on
RewriteBase /temp
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0
RewriteCond %{REQUEST_URI} \.html
RewriteCond %{THE_REQUEST} HTTP/1\.1
RewriteRule t\.html t.xhtml [T=application/xhtml+xml]
Essentially, what's going on here is that they've invented an extremely painful way of writing a boolean function f(w,x,y,z)=w&!x&y&z. You want a logical "or"? They've got some separate (ugly) mechanism for that, too.
What this seems to point toward is some kind of data description language that is simple and Turing-incomplete, but still more expressive, flexible, and convenient than json. Does anyone know of such a language?
To my taste, XML is too complicated, and lisp expressions have the wrong features (Turing-completeness) and lack the right features (here documents, expressive syntax).
[EDIT] The title is misleading. I'm not literally interested in the next iteration of json. I'm not interested in languages that are a subset of javascript. I'm interested in alternative data-description languages.
The EDN format is one option based on Clojure literals. It is almost a superset of JSON, except that no special symbol separates keys and values in maps (as : does in JSON); rather, all elements are separated by whitespace and/or a comma and a map is encoded as a list with an even number of elements, enclosed in {..}.
EDN allows for comments (to newline using ;, or to end of the next element read using #_), but not here-docs. It is extensible to new types using a tag notation:
#myapp/Person {:first "Fred" :last "Mertz"}
The argument of the myapp/Person tag (i.e. {:first "Fred" :last "Mertz"}) must be a valid EDN expression, which makes it unextensible to here-doc support.
It has two built-in tags: #inst for timestamps and #uuid. It also supports namespaced symbol (i.e. identifier) and keyword (i.e. map key consts) types; it distinguishes lists (..) and vectors [..]. An element of any type may be used as a key in a map.
In the context of your above problem, one could invent an #apache/rule-or tag which accepts a sequence of elements, whose semantics I leave up to you!
Have a look at http://github.com/igagis/puu/
It is even simpler than JSON.
It has C++ style comments.
It is possible to format multiline strings and use escaped new line \n and tab \t chars if "real" new line or tab is needed.
Here is the example snippet:
"String object"
AnotherStringObject
"String with children"{
"child 1"
Child2
"child three"{
SubChild1
"Subchild two"
Property1 {Value1}
"Property two" {"Value 2"}
//comment
/* multi-line
comment */
"multi-line
string"
"Escape sequences \" \n \r \t \\"
}
R"qwerty(
This is a
raw string, "Hello world!"
int main(argc, argv){
int a = 10;
printf("Hello %d", a);
}
)qwerty"
}
Consider TOML.
Designed for configuration. Appears to be pretty friendly and powerful. Easy to read and supports a wide range of datatypes and structures. There are parsers for a lot of languages:
C
C#
C++
Common Lisp
Crystal
Dart
Erlang
Fortran
Go
Janet
Java
JavaScript
Julia
Kotlin
Lua
Nim
OCaml
Perl
Perl6/Raku
Python
Rust
Swift
V
The 'J' in JSON is "Javascript". If a particular desired syntax construct isn't in Javascript, then it won't be on JSON.
Heredocs are beyond JSON's purview. That's a language syntax construct for simplified multi-line string definition, but JSON is a transport notation. It has nothing to do with construction. It does, however, have multiline strings, simply by allowing \n newline characters within strings. There's nothing in JSON that says you can't have a linebreak in a string. As long as the containing quote characters are correct, it's perfectly valid. e.g.
{"x":"y\nz"}
is 100% legitimate valid JSON, and is a multiline string, whereas
{"x":"y
z"}
isn't and will fail on parsing.
There's always what I like to call "real JSON". JSON stands for JavaScript Object Notation, and JavaScript does have comments and something close enough to heredocs.
For the heredoc, you would use JavaScript's E4X inline XML:
{
longString: <>
Hello, world!
This is a long string made possible with the magic of E4X.
Implementing a parser isn't so difficult.
</>.toString() // And a comment
/* And another
comment */
}
You can use Firefox's JavaScript engine (FF is the only browser to support E4X currently) or you can implement your own parser, which really isn't so difficult.
Here's the E4X quickstart guide, too.
Since March 2018 you can use JSON5 which seems to have added everything you (& many others) were missing from JSON.
Short Example (JSON5)
{
// comments
unquoted: 'and you can quote me on that',
singleQuotes: 'I can use "double quotes" here',
lineBreaks: "Look, Mom! \
No \\n's!",
hexadecimal: 0xdecaf,
leadingDecimalPoint: .8675309, andTrailing: 8675309.,
positiveSign: +1,
trailingComma: 'in objects', andIn: ['arrays',],
"backwardsCompatible": "with JSON",
}
The JSON5 Data Interchange Format (JSON5) is a superset of JSON that
aims to alleviate some of the limitations of JSON by expanding its
syntax to include some productions from ECMAScript 5.1.
Summary of Features
The following ECMAScript 5.1 features, which are not supported in
JSON, have been extended to JSON5.
Objects
Object keys may be an ECMAScript 5.1 IdentifierName.
Objects may have a single trailing comma.
Arrays
Arrays may have a single trailing comma.
Strings
Strings may be single quoted.
Strings may span multiple lines by escaping new line characters.
Strings may include character escapes.
Numbers
Numbers may be hexadecimal.
Numbers may have a leading or trailing decimal point.
Numbers may be IEEE 754 positive infinity, negative infinity, and NaN.
Numbers may begin with an explicit plus sign.
Comments
Single and multi-line comments are allowed.
White Space
Additional white space characters are allowed.
GitHub: https://github.com/json5/json5
One important attribute of JSON (probably the most important) is that you can easily "flip" between the string representation and the representation in object form, and the objects used to represent the object form are relatively simple arrays and maps. This is what makes JSON so useful in a networking context.
The functions you want would conflict with this dual nature of JSON.
For configuration you could use an embeddable scripting language, such as lua or python, in fact this is not an uncommon thing to do for configuration. That gives you multiline strings or here documents, and comments. It also makes it easier to have things like the boolean function you describe. However, the scripting languages are, of course, Turing complete.
There is also ELDF.
Although it does not support comments, they can be emulated via empty keys:
config_var1 = value1
=some comment
config_var2 = value2

convert html entities to unicode(utf-8) strings in c? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to decode HTML Entities in C?
This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function should do:
input output
< <
> >
ä ä
ß ß
The function should have the signature char *html2str(char *html) or similar. I'm not reading byte by byte from a stream.
Is there a library function I can use?
There isn't a standard library function to do the job. There must be a large number of implementation available in the Open Source world - just about any program that has to deal with HTML will have one.
There are two aspects to the problem:
Finding the HTML entities in the source string.
Inserting the appropriate replacement text in its place.
Since the shortest possible entity is '&x;' (but, AFAIK, they all use at least 2 characters between the ampersand and the semi-colon), you will always be shortening the string since the longest possible UTF-8 character representation is 4 bytes. Hence, it is possible to edit in situ safely.
There's an illustration of HTML entity decoding in 'The Practice of Programming' by Kernighan and Pike, though it is done somewhat 'in passing'. They use a tokenizer to recognize the entity, and a sorted table of entity names plus the replacement value so that they can use a binary search to identify the replacements. This is only needed for the non-algorithmic entity names. For entities encoded as 'ß', you use an algorithmic technique to decode them.
This sounds like a job for flex. Granted, flex is usually stream-based, but you can change that using the flex function yy_scan_string (or its relatives). For details, see The flex Manual: Scanning Strings.
Flex's basic Unicode support is pretty bad, but if you don't mind coding in the bytes by hand, it could be a workaround. There are probably other tools that can do what you want, as well.