RDF - Right Strategy for defining new namespaces - namespaces

I am struggling to find the right strategy for defining new namespaces.
Strategy 1:
Use standard namespaces whenever possible - understood
Strategy 2:
Create own namespace to basically increase readability and maintainability, e.g.
#prefix dimd: <https://www.company.com/products/di#> .
dimd:Table a rdfs:Resource ;
rdfs:subClassOf dimd:Dataset .
dimd:column rdfs:range dimd:Column ;
rdfs:domain dimd:Dataset .
Strategy for long URIs?
But how do I deal with the long URIs e.g. datasets with columns
<https://mycompany.com/sub/dataset/connection/shared/catalog/EU/Population.csv> a dimd:Dataset ;
dimd:column <https://mycompany.com/sub/dataset/connection/shared/catalog/EU/Population.csv/SINGLE> .
<https://mycompany.com/sub/dataset/connection/shared/catalog/EU/Population.csv/SINGLE> dimd:Column ;
xsd:boolean .
Is there a notation to replace "https://mycompany.com/sub/dataset/connection/shared/catalog" with e.g. "instance"?
instance:/catalog/EU/Population.csv a dimd:Dataset ;
dimd:column instance:/catalog/EU/Population.csv/SINGLE> .
instance/catalog/EU/Population.csv/SINGLE> dimd:Column ;
xsd:boolean .
In this case the concept of namespace/prefix is not working due to the slashes, is it?

The slashes in the local part of your prefixed names need to be escaped with a backslash:
instance:\/catalog\/EU\/Population.csv
instance:\/catalog\/EU\/Population.csv\/SINGLE
To make these names more friendly and shorter, you could create a prefix for each catalog (with a trailing slash):
#prefix eu-population: <https://example.com/sub/dataset/connection/shared/catalog/EU/Population.csv/> .
#prefix us-population: <https://example.com/sub/dataset/connection/shared/catalog/US/Population.csv/> .
<https://example.com/sub/dataset/connection/shared/catalog/EU/Population.csv> a dimd:Dataset ;
dimd:column eu-population:SINGLE .
File vs. Dataset
You might want to differentiate between the catalog file and the dataset this file contains. This would not only allow you to make different statements about each entity (creation/update date, author etc.), but you could also use the shorter prefix to refer to the dataset:
eu-population: a dimd:Dataset ;
ex:file <https://example.com/sub/dataset/connection/shared/catalog/EU/Population.csv> ;
dimd:column eu-population:SINGLE .
So https://example.com/sub/dataset/connection/shared/catalog/EU/Population.csv would be the URI for the file, and https://example.com/sub/dataset/connection/shared/catalog/EU/Population.csv/ would be the URI for the dataset. (You could also consider omitting the .csv from the dataset URI, so that you don’t refer to a file type in case it changes in the future.)
However, if these URIs aren’t under your control, you should follow the definitions of the publisher.

Related

Rails - convert ascii to characters

I'm using Rails 5 to show database content in a web browser.
In the db, all of the special characters are written in their ascii form. For instance, instead of an apostrophe, it's written as '.
Thus, my view is showing the ascii code. Is there a way to convert them all to characters for the view?
To transform ANY string containing HTML character entities, using Rails:
CGI.unescape_html "It doesn't look right" # => "It doesn't look right"
The CGI module is in the Ruby standard library and is required by Rails by default. If you want to do the same in a non-Rails project:
require 'cgi'
CGI.unescape_html "It doesn't look right"
Based on your example here's a simple Ruby solution if you want to define your own helper
39.chr # => "'"
'''.delete('&#;').to_i.chr # => "'"
module ApplicationHelper
def ascii_to_char(ascii)
ascii.delete('&#;').to_i.chr
end
end
# in the views
ascii_to_char(''') # => "'"
If what you really need is full HTML escaping see #forsym's answer
Characters were fed through some "html entities" conversion before storing into the database. Go back an fix that.

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

Is it possible to search for a phrase in opengrok containing curly brackets?

I have tried using something like "struct a {" and "struct a {" to look for the declaration of "a". But it seems opengrok just ignores the curly brackets. Is there a way to search for the phrase "struct a {"?
Grok supports escaping special characters that are part of the query syntax.
The current list of special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
To escape these character use the \ before the character.
For example to search for (1+1):2 use the query: \(1\+1\)\:2
You should be able to search with "struct a {" (with quotes)
From OpenGrok documentation:
Escaping special characters:
Opengrok supports escaping special characters that are part of the query syntax. Current special characters are:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
To escape these character use the \ before the character. For example to search for (1+1):2 use the query: (1+1):2
NOTE on analyzers: Indexed words are made up of Alpha-Numeric and Underscore characters. One letter words are usually not indexed as symbols!
Most other characters (including single and double quotes) are treated as "spaces/whitespace" (so even if you escape them, they will not be found, since most analyzers ignore them).
The exceptions are: # $ % ^ & = ? . : which are mostly indexed as separate words.
Because some of them are part of the query syntax, they must be escaped with a reverse slash as noted above.
So searching for +1 or + 1 will both find +1 and + 1.
Valid FIELDs are
full
Search through all text tokens (words,strings,identifiers,numbers) in
index.
defs
Only finds symbol definitions (where e.g. a variable (function, ...)
is defined).
refs
Only finds symbols (e.g. methods, classes, functions, variables).
path
path of the source file (no need to use dividers, or if, then use "/"
Windows users, "" is an escape key in Lucene query syntax! Please don't use "", or replace it with "/"). Also note that if you want
just exact path, enclose it in "", e.g. "src/mypath", otherwise
dividers will be removed and you get more hits.
hist
History log comments.
type
Type of analyzer used to scope down to certain file types (e.g. just C
sources). Current mappings: [ada=Ada, asm=Asm, bzip2=Bzip(2), c=C,
clojure=Clojure, csharp=C#, cxx=C++, eiffel=Eiffel, elf=ELF,
erlang=Erlang, file=Image file, fortran=Fortran, golang=Golang,
gzip=GZIP, haskell=Haskell, jar=Jar, java=Java, javaclass=Java class,
javascript=JavaScript, json=Json, kotlin=Kotlin, lisp=Lisp, lua=Lua,
mandoc=Mandoc, pascal=Pascal, perl=Perl, php=PHP, plain=Plain Text,
plsql=PL/SQL, powershell=PowerShell script, python=Python, ruby=Ruby,
rust=Rust, scala=Scala, sh=Shell script, sql=SQL, swift=Swift,
tar=Tar, tcl=Tcl, troff=Troff, typescript=TypeScript,
uuencode=UUEncoded, vb=Visual Basic, verilog=Verilog, xml=XML,
zip=Zip] The term (phrases) can be boosted (making it more relevant)
using a caret ^ , e.g. help^4 opengrok - will make term help boosted
Opengrok search is powered by Lucene, for more detail on query syntax refer to Lucene docs.

What is the best value for "Unit Separator" in XML?

I used Unit Separator (US/0x1f) in database. When I export to XML 1.0 file, it is not accepted and leave the attribute with empty value.
I have data in database like this:
"option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"
I'm assuming to export to XML 1.0 file like this:
<elementname, attr1="option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"/>
However, the [US] is not accepted by XML 1.0. Any suggestions?
I can replace '\37' (oct 37, hex 1f) with something like "XXX", "$", "(0x1f)"... before writing to XML;
I can replace it when importing from XML and write to database. However, if I replace it with "& # x 1 F ;", which is the HTML Entity for Unit separator, I end up with "& a m p ; # x 1 F ;", which is definitely not what I wanted.
If I manually modify the XML file to "& # x 1 F ;", I can not use MSXML to load it, giving error "Invalid Unicode Character".
Any suggestions?
Thank you
Summary:
Let's make an analogy: Let's think about how the compiler works, there are two phases: "Pre-compile" and "Compile".
For XML File Generation, it acts like the "Compile" phase. E.g. convert "<" to "& l t ;"
However, the Unit Separator is not supported by XML 1.0, so the "Compile" phase will not convert it to HTML Entity "& # x 1 F ;"
So we have to seek solution in the "Pre-Compile" phase, which is our own application's responsibility.
When writing:
Option1: <unit>aaa</unit><unit>bbb</unit>
Option2: simply use "_x241F_" to replace "\37" in the string if "_x241F_" is not conflicting with any existing token in the string.
When reading:
According to Option1: Load the elements, catenate to a single string with "\37" as separator.
According to Option2: simply use "\37" to replace "_x241F_".
I've also found out that MSXML (even the highest version MSXML6.dll) will not load XML 1.1 .
So if we are unfortunately using MSXML, we have to write our own "Pre-Compile" code to handle the Unicode characters before feeding the "Compile" phase.
Note: I borrowed the idea of "_ x 2 4 1 F _" from here.
Thanks for everyone's help
There is no HTML entity for U+001F UNIT SEPARATOR. Besides, HTML entities would be irrelevant when dealing with generic XML.
The character references would be  and , in HTML and in XML, but the character is not allowed in HTML or in XML. For XML 1.0, which this seems to be about, please refer to section 2.2 Characters, where the normative definition is the following production (the associated comment is misleading, and comments are non-normative):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
The conclusions to be drawn depend on the meaning and purpose of UNIT SEPARATOR in the text. It has no generally defined meaning; it is up to applications to assign a meaning to it and process it accordingly.
Usually UNIT SEPARATOR is used to separate units of some kind, so the natural approach would be to process the incoming data so that instead of such separators, the data, when converted to XML format, has units denoted by markup. So for data like aaa[US]bbb[US]ccc where [US] is UNIT SEPARATOR, you would generate something like <unit>aaa</unit><unit>bbb</unit><unit>ccc</unit>.
This website
http://www.fileformat.info/info/unicode/char/1f/index.htm
suggests one of the following:
HTML Entity (decimal) 
HTML Entity (hex)

Why doesn't my variable interpolate correctly when I build up a Mysql query?

I am trying to write a regex expresstion in mysql from a Perl program. I want to have query such as this:
WHERE a.keywords REGEXP '[[:<:]]something[[:>:]]'
However, in Perl when I make this query I am getting error when concatenating:
for($i=0;$i<$count;$i++){
$where = $where . "'[[:<:]]$andkeywords[$i][[:>:]]' "; #errors
Where as this does not give me an error:
for($i=0;$i<$count;$i++){
$where = $where . "'[[:<:]] $andkeywords[$i] [[:>:]]' "; #no error
In the 'no error' code notice that there are extra spaces. But if I have extra spaces then I do not get the resuls I want because in the DB there are no 'extra spaces'.
Just for completeness sake, this works too:
for ($i = 0; $i < $count; $i++) {
$where .= "'[[:<:]]${andkeywords[$i]}[[:>:]]' ";
}
${blah} isn't valid outside of a string, but inside of a interpolatable string, it's equivalent to $blah.
I would have thought that this pattern is more common than the other answers, though... after all, how else do you want to type "foo${var}bar"? Obviously "foo$var\bar" doesn't work, since \b is a recognized escape sequence.
The reason in this case is that "$andkeywords[$i][[:>:]]" is being interpreted as a multi-dimensional array, and :>: is not a valid array index.
I personally prefer Mykroft's approach, but you could also achieve the same result by escaping the final opening bracket as so:
$where=$where."'[[:<:]]$andkeywords[$i]\[[:>:]]' ";
<Obligatory security moan>
Please use a DBI parameter for each regex value instead of interpolating it. Why?
There are no longer any constraints on what characters are allowed. Currently, if any element of #andkeywords contains a quote, backslash or special regex character, things will break. E.g. the keyword "O'Reilly" will cause a database error.
People won't be able to construct malicious keywords to reveal information they shouldn't see or wreak havoc. (Imagine if a user entered "'; drop database;" as a keyword.) This is called an SQL injection attack, and the web is rife with poorly coded websites that are susceptible to them. Don't let yours be one of them.
Even if #andkeywords is not populated from user-entered data, it takes almost no extra effort to use DBI parameters, and your code will be safe for use in future unknown environments.
</Obligatory security moan>
I've never really trusted the autoreplacment of variables in strings like that. You may want to consider explicitly doing the concatenation you want like this:
for($i=0;$i<$count;$i++){
$where=$where . "'[[:<:]]" . $andkeywords[$i] . "[[:>:]]' ";
EDIT:
As ephemient points out the generally accepted way to do this inline is
for($i=0;$i<$count;$i++){
$where=$where . "'[[:<:]]${andkeywords[$i]}[[:>:]]' ";
Personally I find the first way more readable but as with all things Perl, TIMTOWTDI
It would be helpful if you would include the text of any error messages.
Something tells me that
for($i=0;$i<$count;$i++){
$where=$where . "'[[:<:]]" . $andkeywords[$i] . "[[:>:]]' ";
...
}
Could be simplified to
for (#andkeywords) {
$where .= qq('[[:<:]]${_}[[:>]]' );
...
}
Or perhaps
$where .= join ' ', map { qq('[[:<:]]${_}[[:>:]]') } #andkeywords;