Wiktionary : Get list of pages for a given word - wiktionary

I would like to have a list of pages existing in wiktionary for a given word.
The case : I search the definition of the word მამა (means dad) in georgian. There is no page for this word in the georgian wiktionary so I would like to have a list of all synonyms.
I have searched and made test with this page :
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=revisions&format=json&rvprop=ids&rvlimit=10&titles=Foo&titles=Foo
Any idea?
Thanks for the help

There is no API for Wiktionary at the dictionary level, only at the wiki level, which is probably not much help. See this question for more details: How to retrieve Wiktionary word content?
So there is no API that will allow you to query for synonyms or even to query for Georgian language entries or data.
You can fetch Wiktionary pages either in raw wikitext or in HTML, and you can download entire database dumps in an XML format, which you can try to parse from scratch.
One thing that will help in the case of Georgian is that is has a unique alphabet so the vast majority of Wiktionary entries in the Georgian script will be for the Georgian language. (But there are also a very small number of entries for words in Laz, Mingrelian, Old Georgian, and Svan which will also use the Georgian script.)

Related

Find special characters in FileMaker with normal letter

I want to search for "Cole" in FileMaker. When I search for that string, I want to find entries like "Čole". When I use the internal search function of FileMaker, this entry does not show in the results.
It depends on the language you have selected for indexing the searched field. For example, if the selected language is Czech or Unicode, then you will get the behavior you describe. When the language is English or Default, you will get the behavior you expect.

Finding common phrases in rows that have dynamic content

I'm using MySQL, and I am trying to find common strings over a given character length within a series of messages that are highly dynamic, Each message may have a common phrase, but they will be appended with reference codes or names that don't match a specific format on either side of the string. for example, this is an example of the types of common phrases I'm trying to scan for, but has dynamic content embedded as well, and in different formats (https://screencast.com/t/rlABTWitQ)
The end result I am looking for is something akin to this (https://screencast.com/t/qXzrGNFuf)
Because of the highly variable nature of the formats of these messages, uses of substring_index and regexp (as much as my amateur familiarity with REGEXP has taken me), I can't seem to get anything going
SELECT LEFT("first_middle_last", CHAR_LENGTH("first_middle_last") - LOCATE('_', REVERSE("first_middle_last")));
I can't use something like this, as it would just strip out on a specific type of character. As you can see, the types of strings are too variant in format

TTL file format - I have no idea what this is

I have a file which has a structure, but I don't know what format it is, nor how to parse it. The file extension is ttl, but I have never encountered this before.
Some lines from the file looks like this:
<http://data.europa.eu/esco/label/790ff9ed-c43b-435c-b6b3-6a4a6e8e8326>
a skosxl:Label ;
skosxl:literalForm "gérer des opérations d’allègement"#fr .
<http://data.europa.eu/esco/label/98570af6-b237-4cdd-b555-98fe3de26ef8>
a skosxl:Label ;
esco:hasLabelRole <http://data.europa.eu/esco/label-role/neutral> , <http://data.europa.eu/esco/label-role/male> , <http://data.europa.eu/esco/label-role/female> ;
skosxl:literalForm "particleboard machine technician"#en .
<http://data.europa.eu/esco/label/aaac5531-fc8d-40d5-bfb8-fc9ba741ac21>
a skosxl:Label ;
esco:hasLabelRole "http://data.europa.eu/esco/label-role/female" , "http://data.europa.eu/esco/label-role/standard-female" ;
skosxl:literalForm "pracovnice denní péče o děti"#cs .
And it goes on like this for 400 more MB. Additional attributes are added, for some, but not all nodes.
It reminds me of some form of XML, but I don't have much experience working with different formats. It also looks like something that can be modeles as a graph.
Do you have any idea what data format it is, and how I could parse it in python?
Yes, #Phil is correct that is turtle syntax for storing RDF data.
I would suggest you import it into an RDF store of some sort rather than try and parse 400MB+ yourself. You can use GraphDB, Blazegraph, Virtuso and the list goes on. A search for RDF stores should give many other options.
Then you can use SPARQL to query the RDF store (which is like SQL for relational databases) using Python RDFlib. Here is an example from RDFLib.
That looks like turtle - a data description language for the semantic web.
The :has label and :label are specified for two different semantic libraries defined to share data (esco and skosxl there should not be much problem finding these libraries with a search engine, assuming the data is in the semantic web) . :literal form could be thought of as the value in an XML tag.
They represent ontologies in a data structure:
Subject : 10
Predicate : Name
Object : John
As for python, read the data as a file, use the subject as the keys of a dictionary, put the values in a database, its unclear what you want to do with the data.
Semantic data is open, incomplete and could have an unusual, complex structure. The example above is very simple the primer linked above may help.

What is the likely meaning of this character sequence? A&#C

I'm working on an application that imports data from a CSV file. I am told that the data in the CSV file comes from SAP, which I am totally unfamiliar with.
My client indicates that there is an issue. One column of data in the CSV file contains postal addresses. Sometimes, the system doesn't see a valid address. Here is a slightly fictionalized example:
1234 MAIN ST A&#C HOUSTON
As you can see, there is a street number, a street name, and a city, all in capital letters. There is no state or zip code specified. In the CSV file, all addresses are assumed to be in the same state.
Normally, where there is text between the street name and city, it is an apartment number or letter. In the above example, we get errors when we try to use the address with other services, such as Google geolocation. One suggested fix is to simply strip out there special characters, but I believe that there must be a better way.
I want to know what this A&#C means. It looks like some sort of escape sequence, but it isn't in a format I'm familiar with. Please tell me what these strange character sequence means.
I'm not totally sure, but I doubt there's a "canonical" escape sequence that looks like this. In the ABAP environment, # is used to replace non-printable characters. It might be that the data was improperly sanitized when importing into the SAP system in the first place, and when writing to the output file, some non-printable character was replaced by #. Another explanation might be that one of the field contained a non-ASCII unicode character (like,   ) and the export program failed to convert that to the selected target codepage. It's hard to tell without examining the actual source dataset. Of course, it might also be some programming error or a weird custom field separator...

QR codes Limits

I have to generate codes with custom fields: id of field+name of field+values of the field.
How long is the data I can encode inside the QRcode? I need to know how many fields\values I can insert.
Should I use XML or JSON or CSV? What is most generic and efficient?
XML / JSON will not qualify for a QR code's alphanumeric mode since it will include lower-case letters. You'll have to use byte mode. The max is 2,953 characters. But, the practical limit is far less -- perhaps a few hundred characters.
It is far better to encode a hyperlink to data if you can.
As Terence says, no reader will do anything with XML/JSON except show it. You need a custom reader anyway to do something useful with that data. (Which suggests this is not a good use case for QR codes.) But if you're making your own reader, you can use gzip compression to make the payload much smaller. Your reader would know to unzip it.
You might get away with something workable but this is not a good approach in general.
The maximum number of alphanumeric characters you can have is 4,296. Although this will require the lowest form of error correction and will be very hard to scan.
JSON is generally more efficient at data storage than XML.
However, you will need to write your own app to scan the code - I don't know of any which will process raw JSON or XML. All the scanners will show you the text, though.