I am using a list of keywords to put into a meta tag for a localized Japanese website (based on an English one). The site is for the Japanese regional branch of a client, so the native speakers from the branch have translated the keywords themselves for us. When we sent the list to them, we had it organized with commas, as one would typically do for keywords:
foo, bar, baz
However, it seems that the Japanese language (in which I have pretty much 0 expertise) has its own comma character, and they used that character when translating the list of keywords. So something akin to the above (from Google translate, used purely for example and not for translation accuracy, for curious people I used fu instead of foo) would come out as something like:
フー、バー、バズ
This uses the Japanese comma character, 、, instead of a normal Latin comma, ,.
Will this affect how the keywords are used? Is 、 preferred for separating the keyword tokens for Japanese-targeted SEO, or is ,?
I searched through Google for some hint at what to do, but any pages I found dealing with localized keyword text were either using a Latin-based alphabet (like French), or for a couple of Japanese ones I found did not actually display any examples that might have even suggested which comma character to use (they really only talked about not using literal translations, which we've already done by having native speakers translate the content). The one place I found with an essentially duplicate question to mine was the forum question posted here, but it has no answers (and isn't likely to get any since it's 1.5 years old...).
Note: I've seen talk about the lack of use of keywords by SEO engines. The client wants keywords, though, so we will be doing keywords, meaning there's little use in bringing up this point in comments/answers.
The keywords have to be separated by the , character, no matter which language the keywords are in. For keywords it is defined that the value "must be a set of comma-separated tokens", which is defined as:
[…] a string containing zero or more tokens each separated from the next by a single "," (U+002C) character […]
Note that this , is not part of the keywords. It's like a reserved character. If a keyword itself should contain a ,, it would have to be encoded (for example as ,).
If you hand over keywords for translation, you shouldn't include the separator character (unless it is part of the keyword itself).
So better send the translator a list like …
foo
bar
baz
… instead of "foo, bar, baz".
Related
Is there a unicode character that is specifically not meant to be used normally, but instead only functions as a CSV separator? I know CSV stands for comma separated, but I use it here since it is the most common term for the concept I'm trying to ask about. Basically I would like to know whether there is a code point that was only added to unicode for the purpose of being used as a separator character between records in a text file.
Yes, 0x1C … 0x1F. They were specifically created for what you intend (and then standardised into ANSI_X3.4-1968 and later into Unicode).
Summary from English Wikipedia:
Can be used as delimiters to mark fields of data structures. If used for hierarchical levels, US is the lowest level (dividing plain-text data items), while RS, GS, and FS are of increasing level to divide groups made up of items of the level beneath it.
I am using Delphi 2009 to build up a string variable containing a simple JON string from values I get from a database. This results in a string of the form below (although the real string could be much longer)
{"alice#example.com": {"first":"Alice", "id": 2},"bob#example.com": {"first":"Bob", "id":1},"cath#example.com": {"first":"Cath", "id":3},"derek#example.com": {"first":"Derek", "id": 4}}
This string gets sent as a header called Recipient-Variables in an email to a company.
The instructions I have for sending the emails to the company say
Note The value of the “Recipient-Variables” header should be
valid JSON string, otherwise we won’t be able to parse it. If
your “Recipient-Variables” header exceeds 998 characters,
you should use folding to spread the variables over multiple lines.
I have looked at these SO posts to try to understand what is meant by folding but cannot really understand the replies as they often seem to be referencing a particular editor.
notepad++ user defined regions with folding
Folding JSON at specific points
Can you customize code folding?
Please can somebody use my example to show me what syntax I should use or what characters I need to insert in my string to comply with the instruction and fold my JSON string, say in between the records for bob and cath?
(BTW I understand what is meant by folding when viewing JSON or other code in an editor but I don't understand how a simple JSON string needs to be formatted in order for the folding to happen at a specific place)
I finally found the answer myself so posting here to help others, just in case.
The answer is given in this document on rfc2822 standards, published in 2001 by the Network Working Group (P. Resnick, Editor)
https://www.rfc-editor.org/rfc/rfc2822#page-11
The document ...
specifies a syntax for text messages that are sent between computer
users, within the framework of "electronic mail" messages.
...and in particular describes how emails are constructed and in particular how to deal with long headers.
Section 2.2.3 talks about long header fields, > 998 characters, and says such headers need to be folded by inserting the CRLF characters followed immediately by some white space, eg a space character.
If the receiving server is following the same standards it will strip out the CRLF character before parsing the header, which will itself will include stripping space characters.
Though structured field bodies are defined in such a way that
folding can take place between many of the lexical tokens (and even
within some of the lexical tokens), folding SHOULD be limited to
placing the CRLF at higher-level syntactic breaks. For instance, if
a field body is defined as comma-separated values, it is recommended
that folding occur after the comma separating the structured items in
preference to other places where the field could be folded, even if
it is allowed elsewhere.
Later, in section 3.2.3 it explains how comments may be combined with folding white space.
So it seems that if generating the string through code, it is necessary to fold long header lines by detecting a higher level syntactic boundary, such as a comma, that is less than 988 characters from the start of the header (or the last fold point) and insert the three hex characters x0D0A20. This could be done after the header has been constructed or on the fly as it is generated.
As a follow up, I now notice that the Overbytes ICS component I am using (TSslSmtpCli) has a boolean property FoldHeaders so this might do all the work for me.
I'm using MySQL, and I am trying to find common strings over a given character length within a series of messages that are highly dynamic, Each message may have a common phrase, but they will be appended with reference codes or names that don't match a specific format on either side of the string. for example, this is an example of the types of common phrases I'm trying to scan for, but has dynamic content embedded as well, and in different formats (https://screencast.com/t/rlABTWitQ)
The end result I am looking for is something akin to this (https://screencast.com/t/qXzrGNFuf)
Because of the highly variable nature of the formats of these messages, uses of substring_index and regexp (as much as my amateur familiarity with REGEXP has taken me), I can't seem to get anything going
SELECT LEFT("first_middle_last", CHAR_LENGTH("first_middle_last") - LOCATE('_', REVERSE("first_middle_last")));
I can't use something like this, as it would just strip out on a specific type of character. As you can see, the types of strings are too variant in format
I didn't get what does it really mean, when someone refers to data types in html5.
I googled it, and found http://www.w3.org/TR/html-markup/datatypes.html
It says,
data types (microsyntaxes) that are referenced by attribute
descriptions
But, now I'm even confused what it means with micorsyntaxes.
Wikipedia says:
[...] the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language. This applies both to programming languages, where the document represents source code, and markup languages, where the document represents data.
So in order for an HTML document to be read and understood by a browser, it should adhere to the syntax of HTML: That is, it should follow the rules that define the language. A microsyntax is essentially a very small syntax, applying to a very specific thing.
A data type is simply a type of data. The HTML specifications refer to various data types (e.g. String, Token, Integer, Date, Set of comma-separated strings, etc) and the document you linked describes exactly what those things are. It does this by defining a set of rules, or a microsyntax.
E.g. the microsyntax which defines a Set of comma-separated strings is:
Zero or more strings that are themselves each zero or more characters, each optionally with leading and/or trailing space characters, and each separated from the next by a single "," (comma) character. Each string itself must not begin or end with any space characters, and each string itself must not contain any "," (comma) characters.
I need to test the working of Box Net search in my application. For this I need more information about the search pattern. I see search results are compared with both file title and content.
Search is showing different behaviour when I have file names with special characters? Will search work when I have special characters as file names?
Following is the query I am using
boxSearch = client.getSearchManager().search(searchFileName, boxDefaultRequestObject);
Can you share me the pattern used during search and characters allowed and in what character combination results are seen?
Here are some resources on search:
https://support.box.com/hc/en-us/articles/200519888-How-do-I-search-for-files-and-folders-in-Box-
Box's search returns folder/file names and content, and it also accepts booleans. Just don't use mixed case (aNd is NOT okay, while AND or and is okay).
Box also accepts special characters in uploads and search. See the description here, as this was a fairly recent product update that came in mid-2013.
Additional special character support – Box will add support for more types of special characters across the Box website, desktop and mobile apps. Once the change is live, Box products will support almost all printable characters (except / \ or empty file names; also will not support leading or trailing spaces on files and folders).