Is there any library for auto detection of phrasal (compound) verbs in English texts? And maybe other kinds of word groups that form a special meaning?
I'm not aware of a library for that. However, you might find this article useful. It discusses using n-gram (Markvov) models and statistical methods (such as SVM) to detect not only phrasal verbs, but also other phrasal terms.
[[http://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/79/39]]
"Using Statistical Features to Find Phrasal Terms in Text Collections".
André Luiz da Costa Carvalho, Edleno Silva de Moura, Pável Calado.
Journal of Information and Data Management 1(3), October 2010: 583–597.
Related
I have set of short phrases and a set of texts. I want to predict if a phrase is related to an article. A phrase that isn't appearing in the article may still be related.
Some examples of annotated data (not real) is like this:
Example 1
Phrase: Automobile
Text: Among the more affordable options in the electric-vehicle marketplace, the 2021 Tesla Model 3 is without doubt the one with the
most name recognition. It borrows some styling cues from the company's
Model S sedan and Model X SUV, but goes its own way with a unique
interior design and an all-glass roof. Acceleration is quick, and the
Model 3's chassis is playful as well—especially the Performance
model's, which receives a sportier suspension and a track driving
mode. But EV buyers are more likely interested in driving range than
speediness or handling, and the Model 3 delivers there too. The base
model offers up to 263 miles of driving range according to the EPA,
and the more expensive Long Range model can go up to 353 per charge.
Label: Related (PS: For a given text, one and only one phrase is labeled 'Related' with it. All others are 'Unrelated')
Example 2
Phrase: Programming languages
Text: Python 3.9 uses a new parser, based on PEG instead of LL(1). The new parser’s performance is roughly comparable to that of the old
parser, but the PEG formalism is more flexible than LL(1) when it
comes to designing new language features. We’ll start using this
flexibility in Python 3.10 and later.
The ast module uses the new parser and produces the same AST as the
old parser.
In Python 3.10, the old parser will be deleted and so will all
functionality that depends on it (primarily the parser module, which
has long been deprecated). In Python 3.9 only, you can switch back to
the LL(1) parser using a command line switch (-X oldparser) or an
environment variable (PYTHONOLDPARSER=1).
Label: Related(i.e. all other phrases are 'Unrelated')
I think I may have to use, for example, pre-trained BERT, because this kind of prediction needs additional knowledge. But this does not seem like a standard classification problem so I can't find out-of-the-box codes. May I have some advice on how to combine existing wheels and train it?
I'm pretty new with Custom Translator and I'm working on a fashion-related EN_KO project.
There are many cases where a single English term has two possible translations into Korean. An example: if "fastening"is related to "bags, backpacks..." is 잠금 but if it's related to "clothes, shoes..." is 여밈.
I'd like to train the machine to recognize these differences. Could it be useful to upload a phrase dictionary? Any ideas? Thanks!
The purpose of training a custom translation system is to teach it how to translate terms in context.
The best way to teach the system how to translate is training with parallel documents of full sentence prose: the same document in two languages. A translation memory extract in a TMX or XLIFF file is the best material, but many other document formats are suitable as well, as long as you have both languages. Have at least 10000 sentences in both languages, upload to http://customtranslator.ai, and build a custom system with it.
If you have documents in Korean that are representative of the terminology and style you want to achieve, without an English match, you can automatically translate those to English, and add to the training material as parallel documents. Be sure to not use the automatically translated documents in the other direction.
A phrase dictionary is of limited help, because it is unaware of context. It is useful only in bootstrapping your custom system or for very rare terms where you cannot find or create a sentence.
This might go to cs or cstheory stack exchange, but I have seen the most questions tagged with formal-verification here.
Is there extensive literature on using denotational semantics for program verification?
With a quick search I have found
Wolfgang Polak. Program verification based on denotational semantics. In Conference Record of the Eighth ACM Symposium on Principles of Programming Languages, pages 149-158. ACM, January 1981.
http://www.pocs.com/Papers/POPL-81.pdf
The Foundations of Program Verification, 2nd Edition Jacques Loeckx, Kurt Sieber
ISBN: 978-0-471-91282-8
and this course:
https://moves.rwth-aachen.de/teaching/ss-15/sv-sw/
Also, is there a practical program verification tool for some language using denotational semantics?
If you have used indeed.com before, you may know that for the keywords you look for, it returns a traditional search results as long as multiple search refinement options on the left side of screen.
For example, searching for keyword "designer", the refinement options are:
Salary Estimate
$40,000+ (45982)
$60,000+ (29795)
$80,000+ (15966)
$100,000+ (6896)
$120,000+ (2828)
Title
Floral Design Specialist (945)
Hair Stylist (817)
GRAPHIC DESIGNER (630)
Hourly Associates/Co-managers (589)
Web designer (584)
more »
Company
Kelly Services (1862)
Unlisted Company (1133)
CyberCoders Engineering (1058)
Michaels Arts & Crafts (947)
ULTA (818)
Elance (767)
Location
New York, NY (2960)
San Francisco, CA (1633)
Chicago, IL (1184)
Houston, TX (1057)
Seattle, WA (1025)
more »
Job Type
Full-time (45687)
Part-time (2196)
Contract (8204)
Internship (720)
Temporary (1093)
How does it gather statistics information so quickly (e.g. the number of job offers in each salary range). Looks like the refinement options are created in realtime since minor keywords load fast too.
Is there a specific SQL technique to create such feature? Or is there a manual on the web explaining the tech behind this?
The technology used in Indeed.com and other search engines is known as inverted indexing which is at the core of how search engines work (e.g Google). The filtering you refer to ("refinement options") are known as facets.
You can use Apache Solr, a full-fledged search server built using Lucene and easily integrable into your application using its RESTful API. Comes out-of-the-box with several features such as faceting, caching, scaling, spell-checking, etc. Is also used by several sites such as Netflix, C-Net, AOL etc. - hence stable, scalable and battle-tested.
If you want to dig deep into facet based filtering works, lookup Bitsets/Bitarrays and is described in this article.
Why do you think that they load "too fast"? They certainly have nice, scaled architecture, they use caching for sure, they might be using some denormalized datastore to accelerate some computations and queries.
Take a look at google and number of web pages worldwide - you also think that google works too fast?
In addition to what Mios said and as Daimon mentioned it does use a denormalized doc store. Here is a link to Indeed's tech talk about its docstore
http://engineering.indeed.com/blog/2013/03/indeedeng-from-1-to-1-billion-video/
Also another related article on their Engineering blog:
http://engineering.indeed.com/blog/2013/10/serving-over-1-billion-documents-per-day-with-docstore-v2/
I greatly enjoyed Douglas Crockford's recent lecture series, particularly the talk which covered the history of programming languages. I'd like to learn about this subject in more detail.
Consider this question language agnostic. I'm not interested in books that teach programming. I'm interested in books which discuss decisions made during the design of one or more languages.
Following three are IMO the must-read books for any programming langauges junky :)
Project Oberon by Niklaus Wirth
Language Implementation Patterns by Terence Parr
Programming Language Pragmatics by Michael Scott
Every 15 years, the ACM puts on a History of Programming Languages conference (affectionately known as HoPL). The proceedings are of exceptionally high quality, and are available, unfortunately only behind the ACM paywall. (However, if you access them from a university, college or school IP address, you should be able to access them.)
For HoPL-III (2007), Guido van Rossum wanted to submit a paper about Python, but he wasn't able to meet the review requirements in time, so he published it in form of a blog instead.
Several presenters also published their papers for free, in addition to the official conference proceedings. Also, several presenters gave the same talk again, at a different venue. For example, Guy L. Steele, Jr. and Richard P. "Dick" Gabriel repeated their "50 in 50" talk (which, as you can imagine if you've ever seen a talk by Guy Steele or Dick Gabriel, is not really a talk, more like multimedia performance art crossed with poetry slam meets Broadway), which presents 50 programming languages in 50 words each.
As #Missing Faktor mentioned above, not only Project Oberon, but all of Niklaus Wirth's languages are tremendously well documented: Algol-60, Algol-X, Algol-W, Pascal, Modula-2, and Oberon.
Structure and Interpretation of Computer Programs. I have a print copy, but it's now available online for free:
http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-4.html#%_toc_start
The Design and Evolution of C++
http://www2.research.att.com/~bs/dne.html
Programming Language Essentials
Rationale for the Design of the Ada Programming Language:
http://www.amazon.com/Rationale-Design-Programming-Language-Companion/dp/0521392675
Although the book discusses the original version of the language, it still makes interesting reading. For each design decision, rationale and discussion is included, both from the point view the programmer and compiler implementer.
"Architecture of Concurrent Programs", by the late Per Brinch Hansen, includes a good overview of the design and rationale for his Concurrent Pascal language, which added monitors (and other things) to his Sequential Pascal, a proper subset of Pascal.
The big thing missing from Sequential Pascal is pointers. However, given the restrictions intended to be placed on Sequential Pascal programs, everything you can do with a pointer you can also do with an array index, and in a more secure way, "secure" in the sense that it is impossible (and checked by the compiler!) to do illegal things.