creating a common embedding for two languages - deep-learning

My task deals with multi-language like (english and hindi). For that I need a common embedding to represent both languages.
I know there are methods for learning multilingual embedding like 'MUSE', but this represents those two embeddings in a common vector space, obviously they are similar, but not the same.
So I wanted to know if there is any method or approach that can learn to represent both embedding in form of a single embedding that represents the both the language.
Any lead is strongly appreciated!!!

I think a good lead would be to look at past work that has been done in the field. A good overview to start with is Sebastian Ruder's talk, which gives you a multitude of approaches, depending on the level of information you have about your source/target language. This is basically what MUSE is doing, and I'm relatively sure that it is considered state-of-the-art.
The basic idea in most approaches is to map embedding spaces such that you minimize some (usually Euclidean) distance between the both (see p. 16 of the link). This obviously works best if you have a known dictionary and can precisely map the different translations, and works even better if the two languages have similar linguistic properties (not so sure about Hindi and English, to be honest).
Another recent approach is the one by Multilingual-BERT (mBERT), or similarly, XLM-RoBERTa, but those learn embeddings based on a shared vocabulary. This might again be less desirable if you have morphologically dissimilar languages, and also has the drawback that they incorporate a bunch of other, unrelated, languages.
Otherwise, I'm unclear on what exactly you are expecting from a "common embedding", but happy to extend the answer once clarified.

Related

What language should I use for editing documents?

Document editors are nice but they have their limitations.
What is a good alternative to them?
I already know HTML and CSS and while they can do the job, they are ill-suited for printed documents.
I was thinking in learning LaTeX, because many scholars use it. But I wonder if someone would recommend another language such as postscript.
LaTeX is fine. You don't want to write postscript by hand.
I’m using LaTeX almost exclusively nowadays, at least for text documents (everything from CV over letters to manuals).
For quick one-off notes, I’m actually using Markdown (without a renderer. I just think that Markdown preserves document structure quite nicely even when used in text-only mode).
For presentations and spreadsheets, I use appropriate applications, though. In particular, I don’t think LaTeX is that well-suited to do the former (depending on your style of presentations, obviously. Mine have next to no text though …).
I finally got a chance to write an entire paper in LaTeX for my final semester of College and found it to be easier than I thought it would be. A couple of the nice things I found about it were
A fairly lightweight syntax for most things (tables being the only real offender, but no one can get text tables right).
An extremely wide array of syntax for doing anything from automatically marking up a chemical formula to writing inline lists.
Beautiful output automatically.
Extremely easy to write modular documents where I might store a chapter in a file and then simply \include{} it in another. One particularly nice use I found for this was to include code that I had written in the document simply by referencing the files.
Wonderful support for footnotes and bibliographic references.
Libraries for just about anything you can imagine.
The major drawbacks are, IMHO:
A lack of any real direction or life in the language. It feels dead, and not because it's done.
A frustrating build process, although there are tools to help with that, from a simple bash script to a full fledged make file.
If you're interested in learning LaTeX, I would recommend starting out by reading the Not So Short Introduction to LaTeX 2e PDF.
However, I decided against using LaTeX for most things that I write these days specifically because it feels dead and has a frustrating build process. I instead switched over to MultiMarkdown, as it is well supported and can be transformed into a large array of other formats, including LaTeX which can then be hand massaged if you really need to in order to get it the format expected by some publication. If you haven't played with MultiMarkdown or Markdown before, then I highly recommend checking them out. The syntax is extremely lightweight and natural, even compared to LaTeX. I find that except for some of the higher level typographical constructs, MultiMarkdown supports everything I need on a regular basis.
My 2 cents.
It depends on what you want to do. If you are planning to write a formal document, maybe for printing too, just go for LaTex.
Not difficolt as it may appear at the very beginning but professional and fulfilling.
If Web is your goal, go for HTML / CSS.
OpenOffice or Word would do the trick in most cases; do not underestimate them, if you are going to use them (example for job) take time to learn them.
To expand on zzzzBov's commmment, LaTeX is SUPPOSED to allow the writer to concentrate on the content and allow the compiler/documentclass to handle formatting (and that usually is true). If you use HTML/CSS to format you will probably be spending more time (rather than less) doing formatting. Imagine that the LaTeX documentclass is the CSS, only it is already written for you, and your LaTeX source is the content, only the tags are more functional (such as italics or equations) than for patching between the HTML and the CSS (<div ...>). I recommend the LaTeX wikibook as an easy way to start, and the short-math-guide, it if you need mathematics. Enjoy!

What are situations with western languages where you'd use HTML 5's Ruby element?

HTML 5 is introducing a new element: <ruby>; here's the W3C's description:
The ruby element allows one or more spans of phrasing content to be marked with ruby annotations. Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. In Japanese, this form of typography is also known as furigana.
They then go on to give a few examples of Ruby annotations in use for Chinese and Japanese text. I'm wondering though: is this element going to be useful only for east-asian HTML documents, or are there good semantic applications for the <ruby> element in other western languages like English, German, Spanish, etc.?
id-ee-oh-SINK-ruh-sees
Could be useful for people learning English, as our writing system has many idiosyncrasies that make it somewhat less than phonetic.
As a linguist, I can see the benefits in using <ruby> for marking up linguistic examples with various theoretical notational conventions. One example that comes to mind is indicating tonal levels in autosegmental phonology. Here's a quick example I threw together that can be seen in the latest Webkit/Chromium (at least):
http://miketaylr.com/code/western_ruby.html
Currently, this type of notation is left for LaTex and friends, and if on the web, generally a non-accessible image.
As I understand it, ruby annotations are not really relevant in Western languages because Western alphabets are (more or less) phonetic. In Japanese they are used to give a pronunciation guide for logographic characters which don't have obvious pronunciations (unless you've memorized them). I suppose the Western analog would be IPA notation in brackets following a word, but those are rarely used and I don't know if Ruby annotations would be appropriate for them.
My list:
theoretical notational conventions (miketylr's answer)http://miketaylr.com/code/western_ruby.html
language learning (Adam Bellaire's answer) id-ee-oh-SINK-ruh-sees foo idiosyncrasies bar - made with ascii 'nbsp' art
abbreviation, acronym, initialism (possibly - why hover?)
learning technical terms of English origin accidentally translated to your non-english native language
I'm often forced to do the latter in uni. While the translated terminology is often consistent, very often it's not at all self-explaining or not as much as the original english one.
Also the same term may have been translated using several translation systems by different authors/groups.
Another problem group is when, for example, queue, row, series (and sometimes tuple) are translated to the very same word in your language.
Given a western language with less users, and the low percentage of technical people in the population, this actually makes learning the topic much easier directly from English and then learn the translations in a second step.
Ruby could be a tool to transform this into a one-step process, providing either the translations or the original as a "Furigana".

Which tools do you use to analyze text?

I'm in need of some inspiration. For a hobby project I am playing with content analysis. I am basically trying to analyze input to match it to a topic map.
For example:
"The way on Iraq" > History, Middle East
"Halloumni" > Food, Middle East
"BMW" > Germany, Cars
"Obama" > USA
"Impala" > USA, Cars
"The Berlin Wall" > History, Germany
"Bratwurst" > Food, Germany
"Cheeseburger" > Food, USA
...
I've been reading a lot about taxonomy and in the end, whatever I read concludes that all people tag differently and therefor the system is bound to fail.
I thought about tokenized input and stop word lists, but they are of course a lot of work to come up with and build. Building the relevant links between words and topics seems exhausting and also never ending cause whatever language you deal with, it's very rich and most languages also heavily rely on context. Let alone maintaining it.
I guess I need to come up with something smart and train it with topics I want it to be able to guess. Kind of like an Eliza bot.
Anyway, I don't believe there is something that does that out of the box, but does anyone have any leads or examples for technology to use in order to analyze input in order to extract meaning?
Hiya. I'd first look to OpenCalais for finding entities within texts or input. It's great, and I've used it plenty myself (from the Reuters guys).
After that you can analyze the text further, creating associations between entities and words. I'd probably look them up in something like WordNet and try to typify them, or even auto-generate some ontology that matches the domain you're trying to map.
As to how to pull it all together, there's many things you can do; the above, or two- or three-pass models of trying to figure out what words are and mean. Or, if you control the input, make up a format that is easier to parse, or go down the murky path of NLP (which is a lot of fun).
Or you could look to something like Jena for parsing arbitrary RDF snippets, although I don't like the RDF premise myself (I'm a Topic Mapper). I've written stuff that looks up words or phrases or names in WikiPedia, and rate their hitrate based on the semantics found in the WikiPedia pages (I could tell you the details more if requested, but isn't it more fun to work it out yourself and come up with something better than mine? :), ie. number of links, number of SeeAlso, amount of text, how big the discussion page, etc.
I've written tons of stuff over the years (even in PHP and Perl; look to Robert Barta's Topic Maps stuff on CPAN, especially the TM modules for some kick-ass stuff), from engines to parsers to something weird in the middle. Associative arrays which breaks words and phrases apart, creating cumulative histograms to sort their components out and so forth. It's all fun stuff, but as to shrink-wrapped tools, I'm not so sure. Everyones goals and needs seems to be different. It depends on how complex and sophisticated you want to become.
Anyway, hope this helps a little. Cheers! :)
SemanticHacker does exactly what you want, out-of-the-box, and has a friendly API. It's somewhat inaccurate on short phrases, but just perfect for long texts.
“The way on Iraq” > Society/Issues/Warfare and Conflict/Specific Conflicts
“Halloumni” > N/A
“BMW” > Recreation/Motorcycles/Makes and Models
“Obama” > Society/Politics/Conservatism
“Impala” > Recreation/Autos/Makes and Models/Chevrolet
“The Berlin Wall” > Regional/Europe/Germany/States
“Bratwurst” > Home/Cooking/Meat
“Cheeseburger” > Home/Cooking/Recipe Collections; Regional/North America/United States/Maryland/Localities
Sounds like you're looking for a Bayesian Network implementation. You may get by using something like Solr.
Also check out CI-Bayes. Joseph Ottinger wrote an article about it on theserverside.net earlier this year.

Writing XSS Filter for (X)HTML Based on White List

I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.
All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.
As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!

What is semantic markup, and why would I want to use that?

Like it says.
Using semantic markup means that the (X)HTML code you use in a page contains metadata describing its purpose -- for example, an <h2> that contains an employee's name might be marked class="employee-name". Originally there were some people that hoped search engines would use this information, but as the web has evolved semantic markup has been mostly used for providing hooks for CSS.
With CSS and semantic markup, you can keep the visual design of the page separate from the markup. This results in bandwidth savings, because the design only has to be downloaded once, and easier modification of the design because it's not mixed in to the markup.
Another point is that the elements used should have a logical relationship to the data contained within them. For example, tables should be used for tabular data, <p> should be used for textual paragraphs, <ul> should be used for unordered lists, etc. This is in contrast to early web designs, which often used tables for everything.
Semantics literally means using "meaningful" language; in Web Development, this basically means using tags and identifiers which describe the content.
For example, applying IDs such as #Navigation, #Header and #Content to your <div> tags, rather than #Left, and #Main, or using unordered lists for a list of navigational links, rather than a table.
The main benefits are in future maintenance; you can easily change the layout or the presentation without losing the meaning of your content. You navigation bar can move from the left to the right, or your links displayed horizontally rather than vertically, without losing the meaning.
From http://www.digital-web.com/articles/writing_semantic_markup/ :
semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information.
Besides the already mentioned goal of allowing software to 'understand' the data, there are more practical applications in using it to translate between ontologies, or for mapping between dis-similar representations of data - without having to translate or standardize the data (which can result in a loss of information, and typically prevents you from improving your understanding in the future).
There were at least 2 sessions at OSCon this year related to the use of semantic technologies. One was on BigData (slides are available here: http://en.oreilly.com/oscon2008/public/schedule/proceedings, the other was the guys from FreeBase.
BigData was using it to map between two dis-similar data models (including the use of query languages which were specifically created for working with semantic data sets). FreeBase is mapping between different data sets and then performing further analysis to derive meaning across those data sets.
Related topics to look into: OWL, OQL, SPARQL, Franz (AllegroGraph, RacerPRO and TopBraid).
Here is an example of a HTML5, semantically tagged website that I've been working on that uses the recently accepted Micro-formats as specified at http://schema.org along with the new more semantic tagging elements of HTML5.
http://blog-to-book.com/view/stuff/about/semantic%20web
Googles has a handy Semantic tagging test tool that will show you how adding semantic tags to content enables search engines to 'understand' far more about your web pages.
Here is the test tool: http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fblog-to-book.com%2Fview%2Fstuff%2Fabout%2Fsemantic+web&view=
Notice how google now knows that the 'things' on the page are books, and they have an isbn13 identifier. Adding additional metadata, such as price and author enables further inferences to be made.
Hope this points you in some interesting directions. More detailed semantic tagging can be achieved using the Good Relations Ontology which is pretty much the most comprehensive I can think of right now.