Machine learning of word structure [closed] - language-agnostic

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on a system that can create made up fanatsy words based on a variety of user input, such as syllable templates or a modified Backus Naur Form. One new mode, though, is planned to be machine learning. Here, the user does not explicitly define any rules, but paste some text and the system learns the structure of the given words and creates similar words.
My current naïve approach would be to create a table of letter neighborhood probabilities (including a special end-of-word "letter") and filling it by scanning the input by letter pairs (using whitespace and punctuation as word boundaries). Creating a word would mean to look up the probabilities for every letter to follow the current letter and randomly choose one according to the probabilities, append, and reiterate until end-of-word is encountered.
But I am looking for more sophisticated approaches that (probably?) provide better results. I do not know much about machine learning, so pointers to topics, techniques or algorithms are appreciated.

I think that for independent words (an especially names), a simple Markov chain system (which you seem to describe when talking about using letter pairs) can perform really well. Feed it a lexicon and throw it a seed to generate a new name based on what it learned. You may want to tweak the prefix length of the Markov chain to get nicely sounding results (as pointed out in a comment to your question, 2 letters are much better than one).
I once tried it with elvish and orcish names dictionaries and got very satisfying results.

Related

Is it good to keep multi-valued attributes in a table existing in a 24/7 running service? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am working on a project where I am using a table with a multi-valued attribute having 5-10 values. Is it good to keep multivalued attributes or should I normalize it into normal forms ?
But I think that it unnecessarily increases the no of rows.If we have 10 multi values for an attribute then each row or tuple will be replaced with new 10 rows which might increase the query running time.
Can anyone give suggestions on this?
The first normal form requests that each attribute be atomic.
I would say that the answer to this question hinges on the “atomic”: it is too narrow to define it as “indivisible”, because then no string would be atomic, as it can be split into letters.
I prefer to define it as “a single unit as far as the database is concerned”. So if this array (or whatever it is) is stored and retrieved in its entirety by the application, and its elements are never accessed inside the database, it is atomic in this sense, and there is nothing wrong with the design.
If, however, you plan to use elements of that attribute in WHERE conditions, if you want to modify individual elements with UPDATE statements or (worst of all) if you want the elements to satisfy constraints or refer to other tables, your design is almost certainly wrong. Experience shows that normalization leads to simpler and faster queries in that case.
Don't try to get away with few large table rows. Databases are optimized for dealing with many small table rows.

Optimal way to check if given sentence(query) contains any of the predefined keywords [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
It might look like a simple question already answered countless times, but I could not find the optimal way(using some db).
I have a list of few thousands keywords(let's say abusive words). Whenever someone posts a message(long sentence or a paragraph), I want to check if the given sentence contains any of the keywords, so that I can block user or take other actions.
I am looking for a db/schema which can solve the above problem and gives response in a few milliseconds(<15ms).
There are many dbs which solves the reverse of the above problem: given the keywords, find documents containing keywords(text search).
Try ClickHouse for your workload.
According to docs:
multiMatchAny(...) returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses hyperscan library. For patterns to search substrings in a string, it is better to use multiSearchAny since it works much faster.
The length of any of the haystack string must be less than 2^32 bytes.

Encode probability distribution in single cell of table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Is there any mechanism for storing all the information of a probability distribution (discrete and/or continuous) into a single cell of a table? If so, how is this achieved and how might one go about making queries on these cells?
Your question is very vague. So only general hints can be provided.
I'd say there are two typical approaches for this (if I got your question right):
you can store some complex data into a single "cell" (how you call it) inside a database table. Easiest for this is to use JSON encoding. So you have an array of values, encode that to a string and store that string. If you want to access the values again you query the string and decode it back into an array. Newer versions of MariaDB or MySQL offer an extension to access such values on sql level too, though access is pretty slow that way.
you use an additional table for the values and store only a reference in the cell. This actually is the typical and preferred approach. This is how the relational database model works. The advantage of this approach is that you can directly access each value separately in sql, that you can use mathematical operations like sums, averages and the like on sql level and that you are not limited in the amount of storage space like you are when using a single cell. Also you can filter the values, for example by date ranges or value boundaries.
In the end, taking all together, both approaches offer the same, though they require different handling of the data. The fist approach additionally requires some scripting language on the client side to handle encoding and decoding, but that typically is given anyway.
The second approach us considered cleaner and will be faster in most of the cases, except if you always access to whole set of values at all times. So a decision can only be made when knowing more specific details about the environment and goal of an implementation.
Say we have a distribution in column B like:
and we want to place the distribution in a single cell. In C1 enter:
=B1
and in C2 enter:
=B1 & CHAR(10) & C1
and copy downwards. Finally, format cell C13 with wrap on:

"Stop words" list for English? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the".
Where can I find some lists of these uninteresting words?
Is a list of these words the same as a list of the most frequently used words in English?
update: these are apparently called "stop words" and not "skip words".
The magic word to put into Google is "stop words". This turns up a reasonable-looking list.
MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.
these are called stop words, check this sample
Depending on the subdomain of English you are working in, you may have/wish to compile your own stop word list. Some generic stop words could be meaningful in a domain. E.g. The word "are" could actually be an abbreviation/acronym in some domain. Conversely, you may want to ignore some domain specific words depending on your application which you may not want to ignore in the domain of general English. E.g. If you are analyzing a corpus of hospital reports, you may wish to ignore words like 'history' and 'symptoms' as they would be found in every report and may not be useful (from a plain vanilla inverted index perspective).
Otherwise, the lists returned by Google should be fine. The Porter Stemmer uses this and the Lucene seach engine implementation uses this.
Get statistics about word frequency in large txt corpora. Ignore all words with frequency > some number.
I think I used the stopword list for German from here when I built a search application with lucene.net a while ago. The site contains a list for English, too, and the lists on the site are apparaently the ones that the lucene project use as default, too.
Typically these words will appear in documents with the highest frequency.
Assuming you have a global list of words:
{ Word Count }
With the list of words, if you ordered the words from the highest count to the lowest, you would have a graph (count (y axis) and word (x axis) that is the inverse log function. All of the stop words would be at the left, and the stopping point of the "stop words" would be at where the highest 1st derivative exists.
This solution is better than a dictionary attempt:
This solution is a universal approach that is not bound by language
This attempt learns what words are deemed to be "stop words"
This attempt will produce better results for collections that are very similar, and produce unique word listings for items in the collections
The stop words can be recalculated at a later time (with this there can be caching and a statistical determination that the stop words may have changed from when they were calculated)
This can also eliminate time based or informal words and names (such as slang, or if you had a bunch of documents that had a company name as a header)
The dictionary attempt is better:
The lookup time is much faster
The results are precached
Its simple
Some else came up with the stop words.

How do you call those little annoying cases you have to check all the time? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
How do you call those little annoying cases that has to be checked , like "it's the first time someone entered a record / delete the last record in a linked list (in c implementation) / ... " ?
The only term I know translates not-very-nicely to "end-cases" . Does it have a better name?
Edge cases.
Corner cases
Ever prof I have ever had has referred to them as boundary cases or special cases.
I use the term special cases
I call it work ;-).
Because they pay me for it.
But edge cases (as mentioned before) is probably a more correct name.
I call them "nigglies". But, to be honest, I don't care about the linked list one any more.
Because memory is cheap, I always implement lists so that an empty list contains two special nodes, first and last.
When searching, I iterate from first->next to last->prev inclusively (so I'm not looking at the sentinel first/last nodes).
When I insert, I use this same limit to find the insertion point - that guarantees that I'm never inserting before first or after last, so I only ever have to use the "insert-in-the-middle" case.
When I delete, it's similar. Because you can't delete the first or last node, the deletion code only has to use the "insert-in-the-middle" case as well.
Granted, that's just me being lazy. In addition, I don't do a lot of C work any more anyway, and I have a huge code library to draw on, so my days of implementing new linked lists are long gone.