Custom Translator - For Japanese, I need to use Zenkaku [] and () rather than hankaku [] and () - microsoft-translator

We are enclosing UI strings, like a button name, in Zenkaku brackets in Japanese. I trained my EN to JA project using our TM where all the UI strings are in Japanese style of "[]". After the training, the translation quality was improved based on our styles and terminologies. But the brackets used are still in the English style.
e.g.:
Source1: Click Users.
TM imported for training: [ユーザー]をクリックします。 == Japanese style "[]"
MT result: [ユーザー]をクリックします。 == English style "[]"
Source2: Thursday, March 17, 2022
TM imported for training: 2022年3月17日(木) == Japanese style "()"
MT result: 2022年3月17日(木) == English style "()"
How I can tell my Custom MT to use Japanese style brackets/parentheses?

Related

Does text from a rich text editor not inherit styles when rendered in an HTML document?

Just to make things clear, I have used an RTE in the backend to store some description. Later, through an api, I am receiving the description along with other details as a response. Now the styles are intact till now. For example, bold headings. But when I render it in the HTML document using innerHTML property, all I see is unformatted text. The headings are not bold anymore.
Here's a part of response:
</p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">Features</span> \n </p>\r\n\n\r\n<p>Gives even skin tone, smoother complexion and sculpted facial features.
Clearly, font-style="bold" can be seen here. But after this, the rendered version does not contain those styles.
Here's the full response:
"cart_count":2,
"images":[
],
"success":true,
"message":"Sucessfully",
"data":{
"product_id":1,
"name":"Dr G Butterfly Gua Sha",
"category_id":1,
"category":"Skin Tool",
"description":"<p>Dr G Butterfly Rose Quartz Gua Sha is a beauty and wellness tool designed to heal and enhance natural beauty. It lifts and sculpts your face, drains the lymph node, which reduces puffy eyes and face. By scraping with repeated strokes on the surface of the skin, this tool helps stimulate muscles and increases the blood flow. \n </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">Features</span> \n </p>\r\n\n\r\n<p>Gives even skin tone, smoother complexion and sculpted facial features. Reduces the signs of ageing and gives younger-looking skin. Increases lymphatic function. Stimulates blood circulation. Improves the appearance of dark circles and reduces under-eye puffiness. </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">How To Use \n</span></p>\r\n\n\r\n<p>Apply Dr G oil or Dr G gel as per your skin type covering the face and neck. </p>\r\n<p>Hold the butterfly gua sha tool firmly and sweep across gently up and out, starting with the neck, cheeks, jawline, chin, around the mouth, and slowly glide under the eyes, across your eyebrows and from your forehead up to your hairline. </p>\r\n<p>You can sweep it 3-5 times per area. </p>\r\n<p>Recommended at least a few times a week for best results. </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">About Dr G</span> \n </p>\r\n\n\r\n<p>Dr G offers luxury skincare products, backed by over a decade of dermatology expertise and on-ground practice. Made for Indian weather conditions, with variants for different skin types, including sensitive skin, and to address specific skin concerns - these innovative products are a perfect balance of nature and science. Drawing from ancient Ayurveda and combining natural extracts with skin-safe science, Dr G's range of products bridge modern skincare with holistic science.</p>",
"short_description":"Sculpts, Tones, Reduces Puffiness, Lifts",
"max_quantity":500,
"status":1,
"in_stock":1,
"measurement":[
{
"is_cart":true,
"ordered_quantity":2,
"is_wish":false,
"discounted_price":1400.0,
"weight":"200 Gram",
"price":1400.0,
"prod_id":1,
"percentage":100,
"max_quantity":500
}
]
}
}
The HTML from your response isn't valid. You can easily test it, if you copy the HTML string from your response to a text file with .html file ending and open it with your browser (index.html for example). Or use a validator like this one: https://www.freeformatter.com/html-validator.html
Let's pick one part from the HTML string which has wrong characters and gets displayed unformatted:
<span style=\"font-weight: bold;\">Features</span> \n
If you remove the backslashes \ here this peace gets rendered correctly:
<span style="font-weight: bold;">Features</span> \n
I would reccomend you to encode the HTML before sending it to the frondend. You could use Base64 which can be easily encoded in the backend and decoded on the frontend before displaying it.
If this "wrong" characters are already there when you recive this HTML (on your Backend) you have to parse it first to clean it.

python pylatex line spacing, units and math equations in strings

I have a text block as a string that contains some SI units and equations. How can I for example use superscript numbers (e.g. 10^-10 m^2) and math equations in strings? Greek letters and e.g. the ± symbol work fine.
from pylatex import Document, Section, Subsection, Command, Figure
from pylatex.utils import italic, bold, NoEscape
doc = Document('Test', geometry_options = {"head": "2cm","margin": "2cm","bottom": "2cm"})
with doc.create(Section('Header 1')):
doc.append('The average area is less than 10m^2 (±0.5m^2).')
doc.generate_pdf(clean_tex = False,compiler='pdflatex')
I also wonder how I can define the line spacing (linespread) in pylatex.

extract text from html tags using regex

My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)
<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
How to find exact regex to get the plain text?
You might be better of using a parser here:
import html, xml.etree.ElementTree as ET
# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
# construct the dom
root = ET.fromstring(html.unescape(string))
# search it
for p in root.findall("*"):
print(p.text)
This yields
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
Obviously, you might want to change the xpath, thus have a look at the possibilities.
Addendum:
It is possible to use a regular expression here, but this approach is really error-prone and not advisable:
import re
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')
print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']
The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.
You can do this with Javascript with a simple selector method and then retrieving the .innerHTML property.
//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML
let text = div[0].innerHTML;
This will select the element whose HTML you want to retrieve and then it will pull the inner HTML text, assuming you only want what is between the HTML tags and not the tags themselves.
Regex is not necessary for this. You'd have to implement the Regex with JS or some back-end and as long as you can insert a JS script into your project, then you can get the inner HTML.
If you're scraping data, your library in whatever language will most likely have selector methods and ways to easily retrieve the HTML text without the need for Regex.

how to determine past perfect tense from POS tags

The past perfect form of 'I love.' is 'I had loved.' I am trying to identify such past perfects from POS tags (using NLTK, spacy, Stanford CoreNLP). What POS tag should I be looking for? Instead .. should I be looking for past form of the word have .. will that be exhaustive?
I PRP PRON
had VBD VERB
loved VBN VERB
. . PUNCT
The complete POS tag list used by CoreNLP (and I believe all the other libraries trained on the same data) is available at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
I think your best best is to let the library annotate a list of sentences where you want to identify a specific verbal form and manually derive a series of rules (e.g., sequences of POS tags) that match what you need. For example you could be looking for VBD ("I loved"), VBD VBN ("I had loved"), VBD VBG ("I was loving somebody"), etc...

Can OpenNLP use HTML tags as part of the training?

I'm creating a training set for the TokenNameFinder using html documents converted into plain text, but my precision is low and I want to use the HTML tags as part of the training. Like words in bold, and sentences in differents margin sizes.
Will OpenNLP accept and use those tags to create rules?
Is there another way to make use of those tags to improve precision?
It is not clear what you mean with using HTML tags to train OpenNLP.
The train input is an annotated tokenized sentence:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of <START:company> Elsevier N.V. <END> , the Dutch publishing group .
To train an OpenNLP model using the standard tooling you need annotations follows this convention. Note that the annotations does not follow the XML standard.
You can embed annotations directly to the HTML documents you will use for training. It might even help the classifier with the extra context, but I've never read any experimental results about it.
You should keep in mind that the training data should be tokenized. It means that you should include white spaces between words and punctuation, as well as between text elements and html:
<p> <i> Mr . <START:person> Vinken <END> </i> is chairman of <b> <START:company> Elsevier N.V. <END> </b>, the Dutch publishing group .