How to get all allowed languages for Wikidata

How to get all allowed languages for Wikidata - mediawiki

I'm writing a tool for interacting with Wikidata where labels and descriptions are added to items. But I would like to validate that the language is supported before trying to add it.
So my question is how do I get a list of the allowed language codes. The documentation describes this as UserLanguageCode but gives no info on retrieving the allowed values.
I know I can get a list of all of the used languages by doing the following SQL operation on the database, but that is both slow and inefficient: SELECT DISTINCT term_language FROM wb_terms.
As an aside is the list of allowed languages the same for MonolingualText statements?

There is now an API for getting the supported content languages (API sandbox):
https://www.wikidata.org/w/api.php?action=query&meta=wbcontentlanguages&wbclcontext=term&format=json&formatversion=2
By default it just returns the language code, but you can add the name and/or autonym (name in that language) via the wbclprop parameter. (To control the language in which the name is returned, set the global uselang parameter.)
To get allowed monolingual text languages, set wbclcontext to monolingualtext instead of term; on Wikidata, you can also set it to term-lexicographical for all language codes supported on lexicographical data (almost but not quite identical to the term languages).

User hoo on IRC channel #wikidata found this solution:
Get the JSON payload at this address:
https://www.wikidata.org/w/api.php?action=paraminfo&modules=wbsetlabel
And extract
modules[0].parameters[8].type
There are indeed less languages in this list than all the UI languages for MediaWiki.

Related

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/

Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.

For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

IBM Websphere scripting ID 'containment paths' explained (I.e. AdminConfig.getid(...) )

I have been digging through IBM knowledge center and have returned frustrated at the lack of definition to some of their scripting interfaces.
IBM keeps talking about a containment path used in their id definitions and I am only assuming that it corresponds to an xml element within file in the websphere config folder hierachy, but that's only from observation. I haven't found this declared in any documentation as yet.
Also, to find an ID, there is a syntax that needs to be used that will retrieve it, yet I cannot find a reference to the possible values of 'types' used in the AdminConfig.getid(...) function call.
So to summarize I have a couple questions:
What is the correct definition of the id "heirachy" both for fetching an id and for the id itself?
e.g. to get an Id of the server I would say something like: AdminConfig.getid('/Cell:MYCOMPUTERNode01Cell/Node:MYCOMPUTERNode01/Server:server1'), which would give me an id something like: server1(cells/MYCOMPUTERNode01Cell/nodes/MYCOMPUTERNode01/servers/server1|server.xml#server_1873491723429)
What are the possible /type values in retrieving the id from the websphere server?
e.g. in the above example, /Cell, /Node and /Server are examples of type values used in queries.
UPDATE: I have discovered this document that outlines the locations of various configuration files in the configuration directory.
UPDATE: I am starting to think that the configuration files represent complex objects with attributes and nested attributes. not every object with an 'id' is query-able through the AdminConfig.getid(<containment_path>). Object attributes (and this is not attributes in the strict xml sense as 'attributes' could be nested nodes with a simple structure within the parent node) may be queried using AdminConfig.showAttribute(<element_id>, <attribute_name>). This function will either return the string value of the inline attribute of the element itself or a string list representation of the ids of the nested attribute node(s).
UPDATE: I have discovered the AdminConfig.types() function that will display a list of all manipulatible object types in the configuration files and the AdminConfig.parent() function that displays what nodes are considered parents of the specified node.
Note: The AdminConfig.parent() function does not reveal the parents in order of their hierachy, rather it seems to just present a list of parents. e.g. AdminConfig.parent('JDBCProvider') gives us this exact list: 'Cell\nDeployment\nNode\nServer\nServerCluster' and even though Server comes before ServerCluster , running AdminConfig.parent('Server') reveals it's parents as: 'Node\nServerCluster'. Although some elements can have no parents - e.g. Cell - Some elements produce an error when running the parent function - e.g. Deployment.
Due to the apparent lack of a reciprocal AdminConfig.children() function, It appears as though obtaining a full hierachical tree of parents/children lies in calling this function on each of the elements returned from the AdminConfig.parent(<>) call and combining the results. That being said, a trial and error approach is mostly fruitful due to the sometimes obvious ordering.

Formating Output in Information Design Tool

When working with IDT 4.1 and when making a query in business layer, is it possible to format the yielded numbers? What I mean is - the output looks something like "1.9982121921**E7**" (please noctice E7 part). I would like BO to display the whole number without any suffixes.
Additionally, it would be even better to add a delimiter after thousands, millions,...

Is 1.9982121921**E7** the value that is returned from the database? Thus not a number but a string (alphanumeric)? In that case, you'll have to change the select statement and use a database function to trim the non-numeric characters off (e.g. SUBSTR, MID, LEFT, …).
Once you have numeric data, you can use the Display Format function to change the layout of your object.
If you're not happy with the predefined formats to choose from, you can always define a custom format. The formatting options are described in the Information Design Tool User Guide (links to the documentation for IDT in BI 4.1 SP3), section 12.10.23 Creating and editing display formats for business layer objects.
Right-click and object and select Create Display Format… from the context menu.
Or click the Create Display Format… button in the object's properties (located in the Advanced tab).

Set the type to numeric and enable the high precision check mark.

What data format is this?

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>

I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.

It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

Where to translate message strings - in the view or in the model?

We have a multilingual (PHP) application and use gettext for i18n. There are a few classes in the backend/model that return messages or message formats for printf().
We use xgettext to extract the strings that we want to translate.
We apply the gettext function T_() in the frontend/view - this seems to be where it belongs. So far we kept the backend clean from T_() calls, this way we can also unit-test messages.
So in the frontend we have something like
echo T_($mymodel->getMessage());
or
printf(T_($mymodel->getMessageFormat()), $mymodel->getValue());
This makes it impossible to apply xgettext to extract the strings, unless we put some dummy T_("my message %s to translate") call in the MyModel class.
So this leads to the more general question:
Do you apply translation in the backend classes, resp. where do you apply translation and how do you keep track of the strings which you have to translate?
(I am aware of Question: poedit workaround for dynamic gettext.)

My backend classes usually output english strings with parameters left out. Example
["Good job %s you have %i points", "Paul", 10]
Then the key for the translation is the English string (since I don't really like message codes).

Translation is for me totally a View issue except for clearly defined business reasons, like having to store displayed messages as shown. The latter could e.g. happen if you want to store a sent invoice as delivered to the client.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to get all allowed languages for Wikidata - mediawiki

User hoo on IRC channel #wikidata found this solution: Get the JSON payload at this address: https://www.wikidata.org/w/api.php?action=paraminfo&modules=wbsetlabel And extract modules[0].parameters[8].type There are indeed less languages in this list than all the UI languages for MediaWiki.

Related

Regex getting the tags from an <a href= ...> </a> and the likes

IBM Websphere scripting ID 'containment paths' explained (I.e. AdminConfig.getid(...) )

Formating Output in Information Design Tool

What data format is this?

Where to translate message strings - in the view or in the model?

Categories

Resources