How to exclude text from translation using /TranslateArray - microsoft-translator

We're trying to use the Microsoft Translator API to batch translate text. Each piece of text may contain text we don't want translated (normally social network #handles or hashtags). We've tried to wrap these parts of the text like is shown in the documentation:
<div class="notranslate">This will not be translated.</div>
This works fine when passing text to the /Translate single API. However, when we pass multiple pieces of text to the /TranslateArray API, we can't work out the correct syntax. Any text item which contains the notranslate div is not returned in the response.
Here's the body we're trying to use:
curl -i -X POST \
-H "Ocp-Apim-Subscription-Key:******" \
-H "Content-Type:text/html" \
-d \
'<TranslateArrayRequest>
<AppId />
<From>en</From>
<Options>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/html</ContentType>
</Options>
<Texts>
<div xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">With great power comes great <div class="notranslate">#responsibility</div> </div>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">Hello World</string>
</Texts>
<To>fr</To>
</TranslateArrayRequest>' \
'https://api.microsofttranslator.com/V2/Http.svc/TranslateArray'
Any ideas on the correct format to pull this off?

The section as posted doesn't match the schema for the request: the first <div> needs to be a <string> element.
<Texts>
<div xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">With great power comes great <div class="notranslate">#responsibility</div> </div>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">Hello World</string>
</Texts>
Try:
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">With great power comes great <div class="notranslate">#responsibility</div> </string>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">Hello World</string>
</Texts>
If this doesn't work, then it's possible that because the request is XML, you may also need to XML-escape the markup within the string element:
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
With great power comes great <div class="notranslate">#responsibility</div>
</string>

Related

How do I do a regex only the specific selection between two tags?

There have been dozens of similar questions that was asked but my question is about a specific selection between the tags. I don't want the entire selection from <a href to </a>, I only need to target the "> between those tags itself.
I am trying to convert a href links into wikilinks. For example, if the sample text has:
Light is light.
<div class="reasons">
I wanted to edit the file itself and change from Link into [[link.html|Link]]. The basic idea that I have right now uses 3 sed edits as follows:
Link -> <a href="link.html|Link</a>
<a href="link.html|Link</a> -> [[link.html|Link</a>
[[link.html|Link</a> -> [[link.html|Link]]
My problem lies with the first step; I can't find the regex that only targets "> between <a href and </a>.
I understand that the basic idea would need to be the search target between lookaround and lookbehind. But trying it on regexr showed a fail. I also tried using conditional regex. I can't find the syntax I used but it either turned an error or it worked but it also captured the div class.
Edit: I'm on Ubuntu and using a bash script using sed to do the text manipulation.
The basic idea that I have right now uses 3 sed edits
Assuming you've also read the answers underneath those dozens of similar questions, you could've known that it's a bad idea to parse HTML with sed (regex).
With an HTML-parser like xidel this would be as simple as:
$ xidel -s 'Link' -e 'concat("[[",//a/#href,"|",//a,"]]")'
$ xidel -s 'Link' -e '"[["||//a/#href||"|"||//a||"]]"'
$ xidel -s 'Link' -e 'x"[[{//a/#href}|{//a}]]"'
[[link.html|Link]]
Three different queries to concatenate strings. The 1st query uses the XPath concat() function, the 2nd query uses the XPath || operator and the 3rd uses xidel's extended string syntax.

Xidel extract data inside the tag -- raw output

Pleased to be member of StackOverflow, a long time lurker in here.
I need to parse text between two tags, so far I've found a wonderful tool called Xidel
I need to parse text in between
<div class="description">
Text. <tag>Also tags.</tag> More text.
</div>
However, said text can include HTML tags in it, and I want them to be printed out in raw format. So using a command like:
xidel --xquery '//div[#class="description"]' file.html
Gets me:
Text. Also tags. More text.
And I need it to be exactly as it is, so:
Text. <tag>Also tags.</tag> More text.
How can I achieve this?
Regards, R
Can be done in a couple of ways with Xidel, which is why I love it so much.
HTML-templating:
xidel -s file.html -e "<div class='description'>{inner-html()}</div>"
XPath:
xidel -s file.html -e "//div[#class='description']/inner-html()"
CSS:
xidel -s file.html -e "inner-html(css('div.description'))"
BTW, on Linux: swap the double quotes for single and vice versa.
You can show the tags by adding the --output-format=xml option.
xidel --xquery '//div[#class="description"]' --output-format=xml file.html

How to parse out text from a span using curl in bash?

I am using Geektools (bash desktop widget app for mac) to try and display text from a website. I have been trying to cURl the site and then grep the text but I am finding it more difficult than I imagined. Just looking for some help.
HTML:
<div is>
<div class="page-status status-none">
<span class="status font-large">
All Systems Operational
</span>
<span class="last-updated-stamp font-small"></span>
</div>
Above is the span that is displayed when I cURL the site. I just need to display the text "All Systems Operational".
Thank in advance for your assistance.
getting in the habit of using regular expressions with html is a slippery slope; it's not the right tool for the job, as mentioned here; I'd suggest either
hxselect from the html-xml-utils package
xidel
both of which let you use css3 selectors to target content in your input
for example:
curl -s $website_url | hxselect '.status.font-large'
All Systems Operational
You could pipe the output of curl into gawk. This gawk command seems to do the trick (I'm using Cygwin's gawk on Windows):
gawk "/status font-large/ {wantedLine=NR+1} {if (NR==wantedLine) {print $0}}"
I was able to parse out the status that I was looking for by using Nokogiri.
curl -s $Website_URL | nokogiri -e 'puts $_.at_css("span.status").text'

Find specific tags in a HTML file

I have some html files and want to extract the contents between some tags:
The title of the page
some tagged content here.
<p>A paragraph comes here</p>
<p>A paragraph comes here</p><span class="more-about">Some text here</span><p class="en-cpy">Copyright © 2012 </p>
I just want these tags: head, p
but as could be seen in the second paragraph, the last tag is which starts with p but is not my desires tag, and I don't want its content.
I used following script for extracting my desired text, but I can't filter out the tags such as the last one in my example.... How is it possible to extract just <p> tags?
grep "<p>" $File | sed -e 's/^[ \t]*//'
I have to add that, the last tag (which I don't want to appear in the output) is right after one of my desired tags (as is in my example) and using grep command all the content of that line would be returned as output... (This is my problem)
Don't. Trying to use regex to parse HTML is going to be painful. Use something like Ruby and Nokogiri, or a similar language + library that you are familiar with.
to extract text between <p> and </p>, try this
perl -ne 'BEGIN{$/="</p>";$\="\n"}s/.*(<p>)/$1/&&print' < input-file > output-file
or
perl -n0l012e 'print for m|<p>.*?</p>|gs'
xmllint --html --xpath "//*[name()='head' or name()='p']" "$file"
If you're dealing with broken HTML you might need a different parser. Here's a "one-liner" basically the same using lxml. Just pass the script your URL
#!/usr/bin/env python3
from lxml import etree
import sys
print('\n'.join(etree.tostring(x, encoding="utf-8", with_tail=False).decode("utf-8") for x in (lambda i: etree.parse(i, etree.HTMLParser(remove_blank_text=1, remove_comments=1)).xpath("//*[name()='p' or name()='head']"))(sys.argv[0])))

selenium xpath scrape of mixed content html span

I'm trying to scrape a span element that has mixed content
<span id="span-id">
<!--starts with some whitespace-->
<b>bold title</b>
<br/>
text here that I want to grab....
</span>
And here's a code snippet of a grab that identifies the span. It picks it up without a problem but the text field of the webelement is blank.
IWebDriver driver = new FirefoxDriver();
driver.Navigate().GoToUrl("http://page-to-examine.com");
var query = driver.FindElement(By.XPath("//span[#id='span-id']"));
I've tried adding /text() to the expression which also returns nothing. If I add /b I do get the text content of the bolded text - which happens to be a title that I'm not interested in.
I'm sure with a bit of xpath magic this should be easy but I'm not finding it so far!! Or is there a better way? Any comments gratefully received.
I've tried adding /text() to the expression which also returns nothing
This selects all the text-node-children of the context node -- and there are three of them.
What you refer to "nothing" is most probably the first of these, which is a white-space-only text node (thus you see "nothing" in it).
What you need is:
//span[#id='span-id']/text()[3]
Of course, there are other variations possible:
//span[#id='span-id']/text()[last()]
Or:
//span[#id='span-id']/br/following-sibling::text()[1]
XSLT-based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="node()|#*">
"<xsl:copy-of select="//span[#id='span-id']/text()[3]"/>"
</xsl:template>
</xsl:stylesheet>
This transformation simply outputs whatever the XPath expression selects. When applied on the provided XML document (comment removed):
<span id="span-id">
<b>bold title</b>
<br/>
text here that I want to grab....
</span>
the wanted result is produced:
"
text here that I want to grab....
"
I believe the following xpath query should work for your case. following-sibling useful for what you're trying to do.
//span[#id='span-id']/br/following-sibling::text()