I have a table which stores admin login requests.
-- desc AdminLogins
+----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+----------------+
| id | int(10) | NO | PRI | NULL | auto_increment |
| adminID | mediumint(9) | YES | MUL | NULL | |
| loginTimestamp | datetime | YES | | NULL | |
| browser | varchar(255) | YES | | NULL | |
+----------------+--------------+------+-----+---------+----------------+
The browser field contains the user agent, I want to produce some graphs showing each user agents popularity over the last 6 months but I'm stuck with my query.
so far I have
select distinct browser, left(loginTimestamp, 7) from AdminLogins group by left(loginTimestamp, 7);
But it isn't returning what I'm after.
Ideally I'd be grouping by the first 7 characters of the timestring, and seeing the distinct user agents for each period.
select date(loginTimestamp) as logindate,
group_concat(distinct browser) as useragents
from AdminLogins
group by logindate
This is only a partial answer, and doesn't provide the SQL you need, but does provide information that is going to be necessary for you at some point soon.
User agents are complicated, and may contain lots of detail - you can sometimes identify not just the browser, but the OS, the browser version, and sometimes random crap you'll probably never care about like what version of the .NET framework they have installed.
If you're just dumping the complete user agent string into a MySQL table without doing any kind of processing on it beforehand, you will NOT be able to sanely extract human-meaningful information - much less information that can be reasonably used to form pretty graphs - from that table using SQL alone.
Instead, pull out all the user agents for the time period you're interested in and do your processing using a programming language, instead of SQL. If you're using PHP, you'll want to get an up-to-date browscap.ini file and use the get_browser function to parse the user agent.
You may want to consider restructuring your existing tracking code to call get_browser on user agents when you record them, and record all of the details you care about (e.g. browser, OS, major browser version number, minor browser version number) in separate columns. Then it'll be possible in future to extract useful information using just SQL.
Related
The Problem
I landed a small gig to develop an online quoting system for an electronic distributor. He has roughly a half million parts - one little screw is considered a part, one little led, etc. So there are a LOT of parts.
One Important Note: This is only a RFQ ( Request for Quote ). There are no prices client-side, or totals, or anything to do with money. Just collecting a list of part numbers to send to my client.
I had to collect the part data from multiple sources (vendor website, scanned paper catalog, Excel spreadsheets, CSV files, and even a few JSON files. It was exhausting, but I got it done.
Results
Confusing at first. I had dozens of product categories, and some products had so many attributes that were not common to any other products. I could see this project getting very complicated, and given the fact I bid this job at $900 even, I had to simplify this somehow.
This is what I came up with, and received client approval.
Current Columns
+--------------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+--------------+------+-----+---------+-------+
| Datasheets | varchar(128) | YES | | NULL | |
| Image | varchar(85) | YES | | NULL | |
| DigiKey_Part_Number | varchar(46) | YES | | NULL | |
| Manufacturer_Part_Number | varchar(47) | YES | | NULL | |
| Manufacturer | varchar(49) | YES | | NULL | |
| Description | varchar(34) | YES | | NULL | |
| Quantity_Available | int(11) | YES | | NULL | |
| Minimum_Quantity | int(11) | YES | | NULL | |
+--------------------------+--------------+------+-----+---------+-------+
so all products will fit this page template (menu on bottom is error in screenshot):
Autocomplete Off The Table?
Early on in the design, I implemented a nice autocomplete feature:
BUT .. given the number of products in the table, is this even
practical anymore ???
FINAL PRODUCT COUNT: 223,347
What changes do I need to make to PRODUCTS table so that querying the table will not take forever?
These are the only queries the app will be making ( not sure if this info will help in your solution advice )...
Get all products by category:
Select * from products where category = 'semiconductors'
Get single product:
Select * from products where Manufacturer_Part_Number = '12345'
Get product count by category
I think those three actually cover everything I need to do. Maybe a couple more, but not many.
In closing...
Is there a way to "index" this table with 223000 records where searching by one or more columns can be done efficiently?
I am very new to database design, and know I do need to index SOMETHING, but ... WHAT???
Thank you for taking the time to look at this post.
Regards,
John
Listing the queries is mandatory to answering your question. Thanks for including them.
INDEX(category)
INDEX(Manufacturer_Part_Number)
But I suggest your second query should include Manufacturer, too. Then this would be better it:
INDEX(Manufacturer, Manufacturer_Part_Number)
Everything NULL? Seems unlikely.
(I've done jobs like yours; I can't imagine bidding only $900 for all that scraping.)
What will you do when there are a thousand items in a single category or manufacturer? A UI with a thousand-item list sucks.
For how to handle "so many attributes", I recommend http://mysql.rjweb.org/doc.php/eav (I should charge you $899 for the research that went into that document. Just kidding.)
Don't they need other lookups, like "Flash drive", which need to match "FLASH DRV"?
223K rows -- no problem. The VARCHARs seem to be too short; were they based on the data?
And the table needs a PRIMARY KEY.
We have a scenario where we want to store the 'status'(say of a 'user')
We want to impose restriction for the allowed values of 'user' status.
So we considered two alternatives:
Have 'status' as a column of enumeration type in the 'user' table
Have a separate table for 'status' and populate with the allowed values during DB initialisation and have it as foreign key in the 'user' table.
Can you suggest which is a better approach and why
Appreciate if references are shared on what is the best practice
Enum is less preferred. Do a separate table for statuses. With a separate table it will be easy to change or add statuses, add relative data (just add a new field in your status table if you ever need it in the future), easy to get a list of distinct statuses. You will also have an option to set a status field in the main table to be NULL or to set for some other value by default. You can reuse statuses in the other table.
If you have only 2 statuses, say 'active' and 'inactive', just use a BOOL(or TINYINT) field type in the main table.
(There are many Q&A debating ENUM vs TINYINT UNSIGNED vs VARCHAR(..).)
If the set of options is not likely to change often, then I vote for ENUM.
Acts and feels like a human-readable string.
1 byte. (I would not make an enum with 256+ options; not even more than, say, a dozen.)
I would consider starting with option "unknown" instead of making the column nullable. This is a crude way to deal with spelling errors in input.
BOOL:
may have some hiccups; I avoid it.
In the grand scheme of things, it usually does not save enough space to matter.
I will consider using SET or *INT for a large number of boolean flags.
Any column (enum/tinyint/bool) with poor cardinality will not be useful alone in an index such as INDEX(status). OTOH, a composite index may be useful, such as INDEX(status, create_date).
ENUM examples
Some enums encountered that have more than 2 options; you judge whether they are good or bad:
Database Column ENUM
| mysql | sql_data_access | enum('CONTAINS_SQL','NO_SQL','READS_SQL_DATA','MODIFIES_SQL_DATA') |
| mysql | interval_field | enum('YEAR','QUARTER','MONTH','DAY','HOUR','MINUTE','WEEK','SECOND |
| mysql | ssl_type | enum('','ANY','X509','SPECIFIED') |
| performance_schema | TIMER_NAME | enum('CYCLE','NANOSECOND','MICROSECOND','MILLISECOND','TICK') |
| common_schema | hint_type | enum('step_into','step_over','step_out','run') |
| common_schema | statement_type | enum('sql','script','script,sql','unknown') |
| mworld | Continent | enum('Asia','Europe','North America','Africa','Oceania','Antarctic |
| try | priority | enum('LOW','NORMAL','HIGH','UBER') |
| alerts | Stage | enum('DISCOVER','NOTIFY','ACK','CLEAR') |
| todo | stage | enum('unk','load','priming','running','stopping') |
| zip | z_type | enum('STANDARD','UNIQUE','','PO BOX ONLY','Community Post Office', |
I've encountered a strange problem when I was using PuTTY to query the following MySQL command: select * from gts_camera
The output seems extremely weird:
As you can see, putty outputs loads of "PuTTYPuTTYPuTTY..."
Maybe it's because of the table attribute set:
mysql> describe gts_kamera;
+---------+----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+----------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| datum | datetime | YES | | CURRENT_TIMESTAMP | |
| picture | longblob | YES | | NULL | |
+---------+----------+------+-----+-------------------+----------------+
This table stores some big pictures and their date of creation.
(The weird ASCII-characters you can see on top of the picture is the content.)
Does anybody know why PuTTY outputs such strange stuff, and how to solve/clean this?
Cause I can't type any other commands afterwards. I have to reopen the session again.
Sincerely,
Michael.
The reason this happens is because of the contents of the file (as you have a column defined with longblob). It may have some characters that Putty will not understand, therefore it will break as it is happening with you.
There is a configuration that may help though.
You can also not select every column in that table (at least not the *blob ones) as:
select id, datum from gts_camera;
Or If you still want to do it use the MySql funtion HEX:
select id, datum, HEX(picture) as pic from gts_camera;
I am using MySQL for temporary storage of the result of one billion or more results, where the results are calculated by processes executing in parallel.
Each result is calculated using a function [f] on the representations [r1] and [r2] of objects identified respectively by [o1] and [o2].
Currently, I use three tables to execute this process:
(1) A table mapping object identifiers to their representations:
mysql> describe v2_3282_fp;
+----------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------+------+-----+---------+-------+
| objid | text | YES | | NULL | |
| representation | text | YES | | NULL | |
+----------------+------+------+-----+---------+-------+
(2) A table holding jobs that each compute process should retrieve amd calculate:
mysql> describe v2_3282_job;
+----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+----------------+
| jobid | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| workerid | int(11) | YES | | NULL | |
| pairid1 | text | YES | | NULL | |
| pairid2 | text | YES | | NULL | |
+----------+---------------------+------+-----+---------+----------------+
(3) A table holding the results of compute jobs:
mysql> describe v2_3282_res;
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| resultid | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| pairid1 | text | YES | | NULL | |
| pairid2 | text | YES | | NULL | |
| pairscore | double(36,18) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
(the pairscore type is dynamically determined during execution, and not fixed to (36,18) .)
Once the representations have been registered, one process continually scans the result table for new results to transfer to an object existing in memory, and the remaining processes retrieve jobs to compute until they receive a job with a pair of identifiers signalling the end of computation.
During unit tests with 1,000,000 or so computations, this system works just fine.
However, as the demands to use this system have grown to 1,000,000,000+, I see that the system eventually gets bogged down in swapping back and forth between memory and disk.
When I check the system memory and swap space in use, the system memory used is completely used, but typically less than 20% of swap is used.
I have read that MySQL performance is best when entire tables can be read into memory, and resorting to disk I/O is the major bottleneck.
This seems to be the case for me as well, as running computations on my systems with 12 GB and 16 GB of RAM eventually requires more and more time between worker process cycles, though my lone system with 64 GB never seems to encounter this issue.
While the straightforward answer is, "Hey buddy, buy more RAM.", I think there is a more fundamental design issue that is causing my system to degrade as the demands grow. I know that MySQL is a well-engineered product widely used, and that database and table design consideration can greatly impact performance.
So without resorting to the brute force resolution of buying more memory, I am looking for suggestions on how to improve the engineering of the MySQL table design I came up with.
While I know the basics of MySQL table normalization and can create queries to implement my needs, I do not know much about each type of database engine, the details of indexing, and other database-specific design considerations.
The questions I have are:
(1) Would performance be any different if I split the result and job tables into smaller tables instead of single large tables? (I think not.)
(2) I currently issue a limit clause programatically to retrieve a fixed number of results in each retrieval cycle. However, I don't know if this can be further optimized over the simple "SELECT ... FROM [result table] LIMIT start, size". (I think so.)
(3) Does it make sense to tell the worker processes to sleep between cycles in order to let MySQL "catch up"? (I think not.)
My appreciation in advance for any advice from those experienced in database and table design.
I'm attempting to design a small database for a customer. My customer has an organization that works with public and private schools; for every school that's involved, there's an implementation (a chapter) at each school.
To design this, I've put together two tables; one for schools and one for chapters. I'm not sure, however, if I should merge the two together. The tables are as follows:
mysql> describe chapters;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| school_id | int(10) unsigned | NO | MUL | | |
| is_active | tinyint(1) | NO | | 1 | |
| registration_date | date | YES | | NULL | |
| state_registration | varchar(10) | YES | | NULL | |
| renewal_date | date | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
7 rows in set (0.01 sec)
mysql> describe schools;
+----------------------+------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+------------------------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| full_name | varchar(255) | NO | MUL | | |
| classification | enum('high','middle','elementary') | NO | | | |
| address | varchar(255) | NO | | | |
| city | varchar(40) | NO | | | |
| state | char(2) | NO | | | |
| zip | int(5) unsigned | NO | | | |
| principal_first_name | varchar(20) | YES | | NULL | |
| principal_last_name | varchar(20) | YES | | NULL | |
| principal_email | varchar(20) | YES | | NULL | |
| website | varchar(20) | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+----------------------+------------------------------------+------+-----+---------+----------------+
12 rows in set (0.01 sec)
(Note that these tables are incomplete - I haven't implemented foreign keys yet. Also, please ignore the varchar sizes for some of the fields, they'll be changing.)
So far, the pros of keeping them separate are:
Separate queries of schools and
chapters are easier. I don't know if
it's necessary at the moment, but
it's nice to be able to do.
I can make a chapter inactive
without directly affecting the
school information.
General separation of data - the fields in
"chapters" are directly related to
the chapter itself, not the school
in which it exists. (I like the
organization - it makes more sense
to me. Also follows the "nothing but the key" mantra.)
If possible, we can collect school
data without having a chapter
associated with it, which may make
sense if we eventually want people
to select a school and autopopulate
the data.
And the cons:
Separate IDs for schools and
chapters. As far as I know, there
will only ever be a one-to-one
relationship between the two, so
doing this might introduce more
complexity that could lead to errors
down the line (like importing data
from a spreadsheet, which is unfornately
something I'll be doing a lot of).
If there's a one-to-one ratio, and
the IDs are auto_increment fields,
I'm guessing that the chapter_id and
school_id will end up being the same - so why not just put them in a single table?
From what I understand, the chapters
aren't really identifiable on their
own - they're bound to a school, and
as such should be a subset of a
school. Should they really be
separate objects in a table?
Right now, I'm leaning towards keeping them as two separate tables; it seems as though the pros outweigh the cons, but I want to make sure that I'm not creating a situation that could cause problems down the line. I've been in touch with my customer and I'm trying to get more details about the data they store and what they want to do with it, which I think will really help. However, I'd like some opinions from the well-informed folks on here; is there anything I haven't thought of? The bottom line here is just that I want to do things right the first time around.
I think they should be kept separate. But, you can make the chapter a subtype of a school (and the school the supertype) and use the same ID. Elsewhere in the database where you use SchoolID you mean the school and where you use ChapterID you mean the chapter.
CREATE TABLE School (
SchoolID int unsigned NOT NULL AUTO_INCREMENT,
CONSTRAINT PK_School PRIMARY KEY (SchoolID)
)
CREATE TABLE Chapter (
ChapterID int unsigned NOT NULL,
CONSTRAINT PK_Chapter PRIMARY KEY (ChapterID)
CONSTRAINT FK_Chapter_School FOREIGN KEY (ChapterID) REFERENCES School (SchoolID)
)
Now you can't have a chapter unless there's a school first. If such a time occurred that you had to allow multiple chapters per school, you would recreate the Chapter table with ChapterID as identity/auto-increment, add a SchoolID column populated with the same value and put the FK on this one to School, and continue as before, only inserting the ID to SchoolID instead of ChapterID. If MySQL supports inserting explicit values to an autoincrement column, then making it SchoolID autoincrement ahead of time could save you trouble later (unless switching a regular column to autoincrement is supported in which case no issues there).
Additional benefits of keeping them separate:
You can make foreign key relationships directly with SchoolID or ChapterID so that the data you're storing is always correct (for example, if no chapter exists yet you can't store related data for such a thing until it is created).
Querying each table separately will perform better as the rows don't contain extraneous information.
A school can be created with certain required columns, but the chapter left uncreated (temporarily). Then, when it is created, you can have some NOT NULL columns in it as well.
keep them separate.
they may be 1-1 currently... however these are clearly separate concepts.
will they eventually want to input schools which do not have chapters? perhaps as part of a sales lead system?
can there really only be one chapter per school or just one active chapter ? what about across time? is it possible they will request a report with all chapters in the past 10 years at x school ?
You said the links will always be 1 to 1, but does a school always have a chapter can it change chapters? If so, then keeping chapters separate is a good idea.
Another reason to keep them separate is if the amount of information about the two entities combined would make the length of the records longer than the database backend can handle. One-to_one tables are often built to keep the amount of data that needs to be stored in a record down to an appropriate size.
Further is the requirement a firm 1-1 or is does it have the potential to be 1-many? If the second, make it a separate table now. Id there the potential to have schools without chapters? Again I'd keep them separate.
And how are you intending to query this data, will you generally need the data about both the chapter and school in the same queries, then you might put them in one table if you are sure there is no possibility of it turning into a 1-many relationship. However a proper join with the join fields indexed should be fast anyway.
I tend to see these as separate entities and would keep them separte unless there was a critcal performance problem that would lead to putting them to gether. I think that having separate entities in separate table from the start tends to be less risky than putting them together. And performance would normally be perfectly acceptable as long as the indexing is correct and may even be better if you don't normally need to query data from both tables all the time.