how to find dataset decription like - deep-learning

Give me the data description(any data set we perform translation from one language to another language). It must contain the following:
Number of sentences
No. of tokens
No. of unique words

Related

How to use Key-Value pair in Relational database(MySql)?

I wanted to use a relational database(MySql) to store my data as key-value pair.
I would be getting no. of key-value pairs dynamically.
I can create a simple table to store them in separate columns.
Values can be of type- int, varchar, text or date.
The problem which I am facing is:
When I need to run a query on key whose value should be an integer and I need to use and greater than or less than query with it. Same case when I need to use between query with date fields.
How can I achieve it?
------------------------------------------------Edit---------------------------------------------------
For greater clarity, I am providing the background for this question which I have divided into three parts:
1. Data 2: Use Case 3. Possible Designs
1. Data
Suppose I'm creating data store for census of a country**(Just an example)**. Fields for storing data would be different for male, female, boy or girl and also it will vary according to the person's profession. The number of fields depends on the requirement which can increase up to 500 or more.
2. Use Case
Show a paginated list of persons whose monthly income is between $7000 to $10000. User can click on any page number and the database should directly fetch the data for that page number. For example, if we are showing 10 results in a page and user clicks on the 5th page then we should show him the list of the person's from 40 to 50.
Some of the values belonging to a particular group store description which can have large data. So they should be stored as TEXT.
3. Possible Designs
I can create a separate table for each different type and store their data in respective fields. But the problem I'm thinking about this approach is that MySQL table has a maximum row size limit of 65,535 bytes. Going by this approach and storing all data horizontally might cross the max size limit. As the number of fields are not fixed and can change as per requirement.
Instead of storing data horizontally I can store them vertically using Entity Attribute Value design(key-value pair). For now, the increase in the number of rows due to this design is not a problem. Using this I can store data of all male, female or child in the same table. But the problem with this approach is:
I will lose the Datatype of certain important fields. I can not query and get the list of persons whose income is more than 1000.
For storing data or all fields in single Value type, I need to make it varchar. But some fields store large data which requires TEXT as the type.
Considering the above problem, I thought that instead of creating only one value field, I will create multiple value fields like value_int, value_varchar, value_date or value_text.
DB structure
For this problem, I will be using MySQL and cannot change the DB due to certain restrictions. So I am looking for a design with MySQL only.
Going by key-value approach is a good idea or not? Or any other possible design which can be used?
In very general terms, if you know the entities and attributes of your problem domain, and the data is relational, I'd use a relational schema (your "possible design 1"). If you actually encounter problems with maximum row width, your problem domain might contain logical subgroupings of attributes, so you can split them into separate table.
For instance:
Person (id, name, ...)
Person_demographics (person_id, age, location, ...)
Person_finance (person_id, income, wealth...)
If you don't know the entities and attributes in advance, I recommend using MySQL's JSON support. or XML support. This gives you access to much better query options than EAV.
The problem with EAV-like solutions in your scenario is that any non-trivial queries end up being incredibly complicated - "find all responses where salary is between x and y, and the age is z, in locations (a, b, c)" turns into a horrible mess of SQL, but with XPath this is pretty straightforward.

How to perform inexact matches on two data sets

i'm trying to compare two data sets (vendor masters) from two systems. we are moving to one system, so we want to avoid duplication. the issue is that the names, addresses, etc could be slightly different. for example, the name might end in 'Inc' or 'Inc.' or the address could be 'St' or 'Street'. the vendor masters have been dumped to excel, so i was thinking about pulling them into access to compare them, but i'm not sure how to handle the inexact matches. the data fields i need to compare are: name, address, telephone number, feder tax id (if populated), contact name
Here is how I would proceed. You will rarely get answers like this on Stack Exchange, since your question if not focused enough. This is a rather generic set of steps not specific to a particular tool (i.e. database or spreadsheet). As I said in my comments, you'll need to search for specific answers (or ask new ones) about the particular tools you use as you go. Without knowing all the details, Access can certainly be useful in doing some preliminary matching, but you could also utilize Excel directly or even Oracle SQL since you have it as a resource.
Back up your data.
Make a copy of your data for matching purposes.
Ensure that each record for both sets of data have a unique key (i.e. AutoNumber field or similar), so that until you have a confirmed match the records can always be separately identified.
Create new matched-key table and/or fields containing the list of matched unique key values.
Create new "matching" fields and copy your key fields into these new fields.
Scrub the data in all possible matching fields by
Removing periods and other punctuation
Choosing standard abbreviations and replacing all variations by the same value in all records. Example: replace "Incorporation" and "Inc." with "Inc"
Trim excess spaces from the end and between terms
Formatted all phone numbers exactly the same way, or better yet remove all space and punctuation for comparison purposes, excluding extension information: ##########
Parse and split multi-term fields into separate fields. Name -> First, Middle, Last Name fields; Address -> Street number, street name, extra address info.
The parsing process itself can identify and reconcile formatting differences.
Allows easier matching on terms separately.
Etc., etc.
Once the matching fields are sufficiently scrubbed, now match on the different fields.
Define matching priorities, that is which field or fields are likely to produce reliable matches with the least amount of uncertainty.
For records containing Tax ID numbers, that seems like the most logical place to start since an exact match on that number should be valid OR can indicate mistakes in your data.
For each type of match, update the matched-key fields mentioned above
For each successive matching query, exclude records that already have a match in the matched-key table/fields.
Refine and repeat all these steps until you are satisfied that all matches have been found.
Add all non-matched records to your final merged record set.
You never said how many records you have. If possible, it may be worth your organization's time to manually verify the automated matches by listing them side by side and manually tweaking them when needed.
But even if you successfully pair non-exact matches, someone still needs to make the decision of which record to keep for the merged system. I imagine you might have matches on company name and tax id--essentially verifying the match--but still have different addresses and/or contact name. There is no technical answer that will help you know which data to keep or discard. Once again, human review should be done to finalize the merged records. If you set this up correctly, a couple human eyeballs could probably go through thousands of record in just a day.

Modifying a table to accept different data elements

I have created a workflow system where users can see their work in a queue, mark it complete, and create an outgoing correspondence letter. In the Access DB, essentially they are going into each record which has an "Incomplete" indicator and marking items as "Complete" from the queue table. After doing this, an insert is triggered, which inserts customer data based on that record being marked into a letter table. A pdf is then batched/printed based on the inserted data from this letter table.
The beginning scope of the project was a single transaction type with essentially limited data fields Such as: Name, Address, State, Zip. I am finding now that I am being asked to expand the transaction types in the queue. Different letters however, require different fields and not all fields are readily available for each letter (Some letters need and have an account issued date, others do not, etc).
My question pertains to the methodology of "expanding" the fields available for letter insert. I need to add more fields for the new transaction types in the letter table. But these fields are transaction specific, so for example, I may have some kind of audit letter that does not use account premiums, but now I have another transaction that uses only account premiums. All my letters in the table will share that field if I create it, regardless of whether they actually use it or not. So if that is the case, how should I deal with the nulls? I am wondering if I should still create the field aptly named (such as "AccountPremium", or should i use generic names like "extraField1" and just match it to the point in the letter??

How to find a list of rows that contain any of the given INTs in a column that contains INTs as comma separated values

I have a case where we are maintaining a table containing resources. This table has a varchar column that contains role ids as comma separated values (I know normalizing SHOULD have been the way to go, but can't change a long running working system). E.g. role_ids column contains '1,4,6,9,10' and another row contains '5,10,15'.
Then, for a user in system, I have the associated role ids as a list, e.g. 4,15. Now I need to find 'any in many', i.e. any resource that may have any of the role ids present in resource.role_ids column.
This question is something similar to this one, but the solution expected is not expected in Grails.
I'm looking for a MySQL solution - either a query or a stored procedure. Though finding a set of resources could have been achieved using 'FIND_IN_SET()', but don't want to perform multiple calls to DB with each of user's role_id list.
Use a function like this one, to turn your lists into individual records, then join everything up normally.

How can I select MySQL fulltext index entries for automplete functionality?

Is there a way to select entries from fulltext index in MySQL?
No, not that I know of. It would be a great feature though.
I built a search interface with autocomplete on top of MySQL. I run a daily job that scans all columns in all tables that I want to search in, extract words with regular expressions, then store the words in a separate table. I also have a many-to-many table with one column to hold the id of the object, and one column to hold the id of the word so as to record the fact that "word is part of text belonging to object".
The autocomplete works by taking the words typed into the box, and then generating a query that goes like:
SELECT obj.title
FROM obj_word
INNER JOIN obj
ON obj_word.obj_id = obj.id
INNER JOIN word
ON obj_word.word_id = word.id
WHERE word.word IN ('word1', 'word2', 'word3') -- generated dynamically, word1 etc are typed by the user
GROUP BY obj.id
HAVING COUNT(DISTINCT word.id) = 3 -- the 3 is generated, because user typed 3 words.
This works fairly well for me, but I don't have huge amounts of data to work with.
(the actual implementation is slightly fancier, beause the last word is matched with LIKE to allow partial matches)
EDIT:
I just learned that the myisam_ft_dump utility may be used to extract a list of words from the index file. The command line goes something like this:
myisam_ftdump -d film_text 1 > D:\tmp\out.txt
Here, -d means dump (get a list of all entries), film_text is the name of a MyISAM table with a full text index, 1 is one, and ordinal identifying which index you want to dump.
I must say, the utility works, but I am not surer it is fast enough to use this for pulling a live list for autocompletion. You could of course have a periodical job that runs the command and dumps it to file. Unfortunately this dumps index entries not individual, unique words.
My hunch is you could use this utility as a means to extract the words, but it will need processing to turn it into a proper autocomplete list.