MySQL: Where to find sample database with large text content? - mysql

For a college assignment I need to run an experiment related to MySQL. I have chosen to compare the performance of LIKE to MATCH AGAINST in the context of searching text fields.
The MySQL sample "Employees" DB was recommended to us but it doesn't really have any text fields with long text values. I'm of the opinion that i'd need something along the lines of a paragraph of text for each record to give decent results. I guess something like a paragraph about each employees background would be ideal. BUT, there are approximately 300k employees in that database.
Also, i'm guessing that the text values would want to be pretty distinct for each employee. Couldn't just use the same few for all employees.
Am I right in my assumptions and if so. Are there any other sample databases out there that would suit me?
Any ideas?

You might try any of the following links
Library of free data models
Wikimedia dumps in xml and sql form
MySQL documentation #1 #2

You might want to look at this previous post. One of those dataset HAS to be in mysql if not you can use www.talend.com to load it into your application.
Where can I find sample databases with common formatted data that I can use in multiple database engines?

Related

Column Captions in MySQL (in addition to column names)

In MS Access (2003), a column can have both a "name" (required obviously) and a "caption" (optional). It can also have a "description" (optional) and an "input mask".
The main advantage is that the column name can be code friendy (eg lower case with underscores), while the caption can be user friendly (eg title case / proper case, with spaces, or with completely different words).
If a column has a caption, then it is used by default in forms.
[Q1] Can this be achieved in MySQL? I'm most interested in the captions, but would also welcome advise re column linked description and input mask. I haven't searched re input mask yet.
I'd prefer to use an approach which can be implemented in almost any SQL database. So, the less proprietary, less hacky, and more "standard SQL" (sic) the approach, the better.
I note that, in MySQL, a column can have a "comment". I could use these, and use code to extract them when required. But (1) this is inconvenient and hacky and (2) it seems to me that the comment should be used for another purpose (eg notes to self or other developers, or to advise re data entry).
I'm currently using phpMyAdmin (via XAMPP), but I used MySQL Workbench previously.
I can't find any information about this on either of those sites or on Stack Overflow.
These keywords and links are related to the topic, but don't answer the question:
COLUMN_NAME
COLUMN_COMMENT
MS Access: setting table column Caption or Description in DDL?
MySQL query to get column names?
MySQL 5.7 Reference Manual :: 19 INFORMATION_SCHEMA Tables
Commenting your MySQL database... (yada yada)
Thanks in advance.
#Jim Garrison
Thanks for replying. I didn't vote for that answer because it's not doing a lot of what Access does. Not your fault though - you're just the messenger! I'm not sure how one should thank someone for their effort in Stack Overflow, since the comment area says not to use it for that purpose. Cheers anyway for the information.
I had an idea, although it too is far from ideal.
I could create a separate table to hold supplemental metadata (captions/aliases, descriptions, input masks etc) about all other tables in the database.
Its structure would be:
id | table_name | column_name | friendly_alias | description | input_mask
Unfortunately, it would be a chore to maintain, unless there would be a way of automatically populating it with table names and column names, and updating these values automatically if they were renamed, inserted or deleted.
It really would be much better if it were an extension of the built in information schema table(s) though, with null values allowed for those optional fields.
Could it be a separate table, but with a 1:1 relationship with the built in information schema table(s)?
Bear in mind that I'm an amateur enthusiast by the way!
There's nothing builtin that is easily usable without metadata queries. However, you can use column aliases and name-quoting to get whatever name you want, as in
select column_1 as `Date of Birth`,
column_2 as `Home Address`,
etc.
MySQL does allow comments, as you've noted, but you can't use them directly for queries. phpMyAdmin does show them, so if you're using phpMyAdmin that is the best solution/workaround.
However, I think you're over complicating things. This isn't the same as Access in that forms and labels are not automatically created as part of MySQL when you create a database/table. You've got to then write code (in whatever programming language you wish) to interact with that database. When you're writing code, especially standard SQL, you don't want to refer to table names like "address of the buyer", you'll want "address" because, well, that's how programming works and we programmers don't want to have to type the same long variable name again and again. And many systems choke on spaces. So in your application, you can display to the user "Please enter here the permanent shipping address of the buyer using the standard address scheme of their home country so that it's accepted by the postal service with minimal hassle", but there's no way you'd use that as a variable name. I hope this makes sense and isn't insulting; I'm just trying to demonstrate that your table names don't really correspond to anything the user sees. Access is a bit different because the program tries to make it easy for you to create a database structure and then quickly edit the form a user will use to insert or modify data. Therefore it makes a bit more sense to be able to provide a comment that it uses whenever referring to that field.
Anyway, you asked about keeping most to standard SQL, and the concept of referring to a table by a comment is nonstandard, so in the interest of being able to implement it in any database, I'd suggest using the table name in your queries.
Hope this helps a bit.

How to turn a huge live database into a small testing database?

I'm currently developing an API for a company that didn't do a very good job on maintaining a good test database with test data. The MySQL database structure is quite big and complicated and the live database should be around 160-200GB.
And because I'm quite lazy and don't want to create test data for all the table from scratch, I was wondering what would be the best way to turn such a big database into a smaller test database that keeps all data with their relationships in a correct form. Is there an easy way to this with some kind of script that checks the database model and knows what kind of data it needs to keep or delete when reducing the database to smaller size?
Or am I doomed and have to go through the tedious task of creating my own test data?
Take a look at Jailer which describes itself as a "Database Subsetting and Browsing Tool". It is specifically designed to select a subset of data, following the database relationships/constraints to include all related rows from linked tables. To limit the amount of data you export, you can set a WHERE clause on the table you are exporting.
The issue of scrubbing your test data to remove customer data is still there, but this will be easier once you have a smaller subset to work with.
In adition to Liath recomend:
maybe its a hard way but u can just export your schema (no data) and then make a stored procedure to iterate on your (original) tables and make a simple:
insert into dest_table (fields) (select * from origin_table where (`external_keys already inserted`) limit 100)
or somethink like.
thanks to #Liath : external_keys already inserted you hav to make a filter to ensure that any external key of this table already exist on your test database. So you also need to iterate your tables in order by external keys
another way its to export your data and edit the sql.dump file to remove the unwanted data (realy hard way)
I would suggest that it doesn't matter how thorough you are the risk of getting live customer details into a test database is too high. What happens if you accidentally email or charge a real customer for something your testing!?
There are a number of products out there such as RedGate's Data Generator which will create test data for you based on your schema (there is a free trial I believe so you can check it meets your needs before committing).
Your other alternative? Hire a temp to enter data all day!
ETA: Sorry - I just saw you're looking more at MySQL rather than MSSQL which probably rules out the tool I recommended. A quick google produces similar results.

sql command or Dynamic programming?

suppose I have 1 GB of data in my database. I want to do something like this:
If user searches for a sentence, say 'Hello world I am here.', the program should be able to return the data (rows) where this exact string is found and also the rows which have similar texts e.g., 'Hello world is a famous string, I am sure!'.
My question is: Which one will be more efficient- an sql command or a dynamic programming concept?
If sql is more efficient, what is the command that can be used for doing the same?
I am using mysql 5.6
You want to use the full text capabilities of MySQL, which are documented here.
Basically, the data structure that you need is an inverted index. For each word, this contains the positions of the word in all the documents. With this information, you can start to piece things together.
In most cases, you are much better off doing this using established software, rather than writing your own. I don't want to stop you, if you really want to, but the problem may be harder than you think.

Building a MTurk-like app -- how to use a db when column names change for each task?

I'm building a very simple MTurk-ish app in Rails. The idea is that people will upload csvs containing whatever columns they want (e.g., some id, name of a user, some piece of text, a link, whatever -- these columns will change from task to task), and these csvs will contain all the information for the MTurk task.
My question is: how would I store these csvs in a database? One way is to store each csv row as a blob of unstructured data in MySQL (i.e., I basically leave each row as a string and stick this into a MySQL column). A maybe better way is to use a NoSQL database like MongoDB, where I don't need a predefined schema.
Suggestions? Which way is better, or is there another option? I am using Rails for this, so options that work well with Rails would be great.
Well you pretty much answered your own question.
Either use a NoSQL Document based database (like MongoDB) or split up the cvs and save it in a 1:n correlation within your database as key value pairs attached to a row and column each. your idea to store blobs isn't quite ideal however as it would restrict you from searching within the columns.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.