MySQL - how to check which items in an arbitrary list (~1,000 items) are in a table?

MySQL - how to check which items in an arbitrary list (~1,000 items) are in a table? - mysql

Here's my problem...
I need to be able to check which items in a list of about 1,000 items (the needles) are in a fairly large table containing about ~500,000 rows (the haystack).
My question is, what's the best/fastest/most efficient way to do this?
I know that I can create a SQL statement like this:
SELECT id FROM haystack WHERE id IN (ID1, ID2, ID3, ..., IDn)
(assuming ID1, ID2, ID3, ..., IDn are the the needles.)
However, I'm not sure how performant or wise that is if the needles list contains 1,000+ items.
I also know that, if my needles list was in a table of it's own, I could join that table to the haystack table. However, the needles list isn't already in a table.
So - I guess another possible option is to put those 1,000 items into a temporary table and then join that to the haystack table. If that's the best option - then what's the best way to quickly load 1,000 items into a temporary table? (E.g., 1,000 individual INSERT statements? Insert all rows in a single INSERT statment? Is there a limit on how long an INSERT statement can be?)
A third possible option - write the needles list to a text file, then use LOAD DATA INFILE to load that into a (temporary) table, then join the temp table to the haystack table. But, wow... that seems like a lot of overhead.
Is there another, better option?
For what it's worth, the context of this is PHP, and I'm getting the needles list from a JSON web-service response, and using MySQLi for the database interaction.

According to this benchmark, it is faster in your case to use a temporary table and the JOIN method.
I am not sure though that's not a premature optimisation. You should perform your own benchmark and determine if the added complexity deserves the effort. I would recommend going with the simple IN method and only start to optimise when you detect a performance issue.
Just remember that according to the manual:
The number of values in the IN list is only limited by the max_allowed_packet value.

I think your query SELECT id FROM haystack WHERE id IN (ID1, ID2, ID3, ..., IDn) would be fine. I have a very similar use case where I have millions of "needles" and I pass them to the IN clause in blocks of 10,000 via PDO with no issues.
I would add that the column you are checking should be indexed. In my case it is the primary key of the table.

If the needles are going to be used to query the haystack frequently, you absolutely want to create a new table. For this example, I'm going to assume that the needles are int values and will label them as id in the table needle.
First, you need to create the table
CREATE TABLE needle (
id INT(11) PRIMARY KEY
)
Next, you need to insert the values
INSERT INTO needle (id)
VALUES (ID1),
(ID2),
...,
(IDn)
Now, you can query haystack using a join.
SELECT h.id
FROM haystack h
JOIN needle n
ON h.id = n.id
If this is an infrequent query and the number of needles won't grow beyond the 1,000, using the IN clause won't hurt your performance greatly.

Related

How to get Count for large tables?

Sample Table:
+----+-------+-------+-------+-------+-------+---------------+
| id | col1 | col2 | col3 | col4 | col5 | modifiedTime |
+----+-------+-------+-------+-------+-------+---------------+
| 1 | temp1 | temp2 | temp3 | temp4 | temp5 | 1554459626708 |
+----+-------+-------+-------+-------+-------+---------------+
above table has 50 million records
(col1, col2, col3, col4, col5 these are VARCHAR columns)
(id is PK)
(modifiedTime)
Every column is indexed
For Ex: I have two tabs in my website.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
As I have 50 million records, the count with those criteria takes too much time to get the result.
Note: I would change records data(rows in table) sometime. Insert new rows. Delete not needed records.
I need a feasible solution instead of querying the whole table. Ex: like caching the older count. Is anything like this possible.

While I'm sure it's possible for MySQL, here's a solution for Postgres, using triggers.
Count is stored in another table, and there's a trigger on each insert/update/delete that checks if the new row meets the condition(s), and if it does, add 1 to the count. Another part of the trigger checks if the old row meets the condition(s), and if it does, subtracts 1.
Here's the basic code for the trigger that counts the rows with temp2 = '5':
CREATE OR REPLACE FUNCTION updateCount() RETURNS TRIGGER AS
$func$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt + 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING NEW;
END IF;
IF TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt - 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING OLD;
END IF;
RETURN new;
END
$func$ LANGUAGE plpgsql;
Here's a working example on dbfiddle.
You could of course modify the trigger code to have dynamic where expressions and store counts for each in the table like:
CREATE TABLE someTableCount
(
whereExpr text,
cnt INT
);
INSERT INTO someTableCount VALUES ('temp2 = ''5''', 0);
In the trigger you'd then loop through the conditions and update accordingly.

FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
That would benefit from a 'composite' index:
INDEX(col1, col2)
because it would be "covering". (That is, all the columns needed in the query are found in a single index.)
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
You apparently already have the optimal (covering) index:
INDEX(col3)
Now, let's look at it from a different point of view. Have you noticed that search engines no longer give you an exact count of rows that match? You are finding out why -- It takes too long to do the tally not matter what technique is used.
Since "col1" gives me no clue of your app, nor any idea of what is being counted, I can only throw out some generic recommendations:
Don't give the counts.
Precompute the counts, save them somewhere and deliver 'stale' values. This can be handy if there are only a few different "values" being counted. It is probably not practical for arbitrary strings.
Say "about nnnn" in the output.
Play some tricks to decide whether it is practical to compute the exact value or just say "about".
Say "more than 1000".
etc
If you would like to describe the app and the columns, perhaps I can provide some clever tricks.
You expressed concern about "insert speed". This is usually not an issue, and the benefit of having the 'right' index for SELECTs outweighs the slight performance hit for INSERTs.

It sounds like you're trying to use a hammer when a screwdriver is needed. If you don't want to run batch computations, I'd suggest using a streaming framework such as Flink or Samza to add and subtract from your counts when records are added or deleted. This is precisely what those frameworks are built for.
If you're committed to using SQL, you can set up a job that performs the desired count operations every given time window, and stores the values to a second table. That way you don't have to perform repeated counts across the same rows.

As a general rule of thumb when it comes to optimisation (and yes, 1 SQL server node#50mio entries per table needs one!), here is a list of few possible optimisation techniques, some fairly easy to implement, others maybe need more serious modifications:
optimize your MYSQL field type and sizes, eg. use INT instead of VARCHAR if data can be presented with numbers, use SMALL INT instead of BIG INT, etc. In case you really need to have VARCHAR, then use as small as possible length of each field,
look at your dataset; is there any repeating values? Let say if any of your field has only 5 unique values in 50mio rows, then save those values to separate table and just link PK to this Sample Table,
MYSQL partitioning, basic understanding is shown at this link, so the general idea is so implement some kind of partitioning scheme, e.g. new partition is created by CRONJOB every day at "night" when server utilization is at minimum, or when you reach another 50k INSERTs or so (btw also some extra effort will be needed for UPDATE/DELETE operations on different partitions),
caching is another very simple and effective approach, since requesting (almost) same data (I am assuming your value1%, value2%, value3% are always the same?) over and over again. So do SELECT COUNT() once a while, and then use differencial index count to get actual number of selected rows,
in-memory database can be used alongside tradtional SQL DBs to get often-needed data: simple key-value pair style could be enough: Redis, Memcached, VoltDB, MemSQL are just some of them. Also, MYSQL also knows in-memory engine,
use other types of DBs, e.g NoSQL DB like MongoDB, if your dataset/system can utilize different concept.

If you are looking for aggregation performance and don't really care about insert times, I would consider changing your Row DBMS for a Column DBMS.
A Column RDBMS stores data as columns, meaning each column is indexed independantly from the others. This allows way faster aggregations, I have switched from Postgres to MonetDB (an open source column DBMS) and summing one field from a 6 milions lines table dropped down from ~60s to 50ms. I chose MonetDB as it supports SQL querying and odbc connections which were a plus for my use case, but you will experience similar performance improvements with other Column DBMS.
There is a downside to Column storing, which is that you lose performance on insert, update and delete queries, but from what you said, I believe it won't affect you that much.

In Postgres, you can get an estimated row count from the internal statistics that are managed by the query planner:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'mytable';
Here you have more details: https://wiki.postgresql.org/wiki/Count_estimate
You could create a materialized view first. Something like this:
CREATE MATERIALIZED VIEW mytable AS SELECT * FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
You can also materialize directly the count queries. If you have 10 tabs, then you should have to materialize 10 views:
CREATE MATERIALIZED VIEW count_tab1 AS SELECT count(*) FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
CREATE MATERIALIZED VIEW count_tab2 AS SELECT count(*) FROM the_table WHERE col2 like "value2%" and col3 like "value3%";`
...
After each insert, you should refresh views (asynchronously):
REFRESH MATERIALIZED VIEW count_tab1
REFRESH MATERIALIZED VIEW count_tab2
...

As noted in the critique, you have not posted what you have tried. So I would assume that the limit of question is exactly what you posted. So kindly report results of exactly that much
What is the current time you are spending for the subset of the problem, i.e. count of [col1 like "value1%" and col2 like "value2%"] and 2nd [col3 like "value3%]
The trick would be to scan the data source once and make the data source smaller by creating an index. So first create an index on col1,col2,col3,id. Purpose of col3 and id is so that database scans just the index. And I would get both counts in same SQL
select sum
(
case
when col1 like 'value1%' and col2 like 'value2%' then 1
else 0
end
) cnt_condition_1,
sum
(
case
when col3 like 'value3%' then 1
else 0
end
) cnt_condition_2
from table
where (col1 like 'value1%' and col2 like 'value2%') or
(col3 like 'value3%')
```
So the 50M row table is probably very wide right now. This should trim it down - on a reasonable server I would expect above to return in a few seconds. If it does not and each condition returns < 10% of the table, second option will be to create multiple indexes for each scenario and do count for each so that index is used in each case.

If there is no bulk insert/ bulk updates happening in your system, Can you try vertical partitioning in your table? By vertical partitioning, you can separate the data block of col1, col2 from other data of the table and so your searching space will reduce.
Also, indexing on every columns doesn't seem to be the best approach to go with. Index wherever it is absolutely needed. In this case, I would say Index(col1,col2) and Index(col3).
Even after indexing, you need to look into the fragmentation of those indexes and modify it accordingly to get the best results. Because, sometimes 50 million index of one column can sit as one huge chunk, which will restrict multi processing capabilities of your SQL server.

Each Database has their own peculiarities in how to "enhance" their RDBMS. I can't speak for MySQL or SQL Server but for PostgreSQL you should consider making the indexes that you search as GIN (Generalized Inverted Index)-based indexes.
CREATE INDEX name ON table USING gin(col1);
CREATE INDEX name ON table USING gin(col2);
CREATE INDEX name ON table USING gin(col3);
More information can be found here.
-HTH

this will work:
select count(*) from (
select * from tablename where col1 like 'value1%' and col2 like 'value2%' and col3
like'value3%')
where REGEXP_LIKE(col1,'^value1(.*)$') and REGEXP_LIKE(col2,'^value2(.*)$') and
REGEXP_LIKE(col1,'^value2(.*)$');
try not to apply index on all the columns as it slows down the processing of a sql
query and have it in required columns only.

Can I add rows to MySQL before removing all old rows (except same primary)?

If I have a table that has these rows:
animal (primary)
-------
man
dog
cow
and I want to delete all the rows and insert my new rows (that may contain some of the same data), such as:
animal (primary)
-------
dog
chicken
wolf
I could simply do something like:
delete from animal;
and then insert the new rows.
But when I do that, for a split second, 'dog' won't be accessible through the SELECT statement.
I could simply insert ignore the new data and then delete the rest, one by one, but that doesn't feel like the right solution when I have a lot of rows.
Is there a way to insert the new data and then have MySQL automatically delete the rest afterward?
I have a program that selects data from this table every 5 minutes (and the code I'm writing now will be updating this table once every 30 minutes), so I would like to be as accurate as possible at all times, and I would rather have too many rows for a split second than too few rows for the same time.
Note: I know that this may seem like it is unnecessary but I just feel like if I leave too many of those unlikely possibilities in different places, there will be times where things go wrong.

You may want to use TRUNCATE instead of DELETE here. TRUNCATE is faster than DELETE and resets the table back to its empty state (meaning IDENTITY columns are reset to original values as well).
Not sure why you're having problems with selecting a value that was deleted and re-added, maybe I'm missing some context. But if you're wiping the table clean, you might want to use truncate instead.

You could add another column timestamp and change the select statement to accommodate this scenario where it needs to check for the latest value.
If this is for school, I would argue that you need a timestamp and that is what your professor is looking for. You shouldn't need to truncate a table to get the latest values, you need to adjust the thinking behind the table and how you are querying data. Hope this helps!
Check out these:
How to make a mysql table with date and time columns?
Why not update values instead?
My other questions would be:
How are you loading this into the table?
What does that code look like?
Can you change the way you Select from the table?
What values are being "updated" and change in such a way that you need to truncate the entire table?

If you don't want to add new column, there is an other method.
1. At first step, update table in any way that mark all existing rows for deletion in future. For example:
UPDATE `table_name` SET `animal`=CONCAT('MUST_BE_DELETED_', `animal`)
At second step, insert new rows.
On final step, remove all marked rows:
DELETE FROM `table_name` WHERE `animal` LIKE 'MUST_BE_DELETED_%'

You could implement this by having the updated_on column as timestamp and you may even utilize some default values, but let's go with an example without them.
I presume the table would look something like this:
CREATE TABLE `new_table` (
`animal` varchar(255) NOT NULL,
`updated_on` timestamp,
PRIMARY KEY (`animal`)
) ENGINE=InnoDB
This is just a dummy table example. What's important are the two queries later on.
You would simply perform a query to insert the data, such as:
insert into my_table(animal)
select animal from my_view where animal = 'dogs'
on duplicate key update
updated_on = current_timestamp;
Please notice that my_view is your table/view/query by which you supply the values to insert into your table. Also notice that you need to have primary/unique key constraint on your animal column in this example, in order to work.
Then, you proceed with the following query, to "purge" (delete) the old values:
delete from my_table
where updated_on < (
select *
from (
select max(updated_on) from my_table
) as max_date
);
Please notice that you could make a separate view in order to obtain this max_date value for updated_on entry. This entry should indicate the timestamp for your last updated/inserted values in a previous query, so you could proceed with utilizing it in a where clause in order to issue deletion of old records that you don't want/need anymore.
IMPORTANT NOTE:
Since you are doing multiple queries and it's supposed to be a single operation, I'd advise you to utilize it within a single trancations and to utilize a proper rollback on various potential outcomes (i.e. in case of mysql exceptions). You might wish to utilize a proper stored procedure for that.

how to compare huge table of mysql

I have a huge table of mysqlwhich contains more than 33 million records .How I could compare my table to found non duplicate records , but unfortunately select statement doesn't work. Because it's huge table.
Please provide me a solution

First, Create a snapshot of your database or the tables you want to compare.
Optionally you can also limit the range of data you want to compare , for example only 3 years of data. This way your select query won't hog all the resources.
Snapshot will be bunch of files each representing a table containg your primary key or business key for each record ( I am assuming you can compare data based on aforementioned key . If thats not the case record all the field in your file)
Next, read each records from the file and do a select against the corresponding table. If there are more than 1 record you know it is a duplicate
Thanks

Look at the explain plan and see if what the DB is actually doing for the NOT IN.
You could try refactoring, with an index on subscriber as Roy suggested if necessary. I'm not familiar enough with MySQL to know whether the optimizer will execute these identically.
SELECT *
FROM contracts
WHERE NOT EXISTS
( SELECT 1
FROM edms
WHERE edms.subscriber=contracts.subscriber
);
-- or
SELECT C.*
FROM contracts AS C
LEFT
JOIN edms AS E
ON E.subscriber = C.subscriber
WHERE E.subscriber IS NULL;

SQL: Select Keys that doesn't exist in one table

I got a table with a normal setup of auto inc. ids. Some of the rows have been deleted so the ID list could look something like this:
(1, 2, 3, 5, 8, ...)
Then, from another source (Edit: Another source = NOT in a database) I have this array:
(1, 3, 4, 5, 7, 8)
I'm looking for a query I can use on the database to get the list of ID:s NOT in the table from the array I have. Which would be:
(4, 7)
Does such exist? My solution right now is either creating a temporary table so the command "WHERE table.id IS NULL" works, or probably worse, using the PHP function array_diff to see what's missing after having retrieved all the ids from table.
Since the list of ids are closing in on millions or rows I'm eager to find the best solution.
Thank you!
/Thomas
Edit 2:
My main application is a rather easy table which is populated by a lot of rows. This application is administrated using a browser and I'm using PHP as the intepreter for the code.
Everything in this table is to be exported to another system (which is 3rd party product) and there's yet no way of doing this besides manually using the import function in that program. There's also possible to insert new rows in the other system, although the agreed routing is to never ever do this.
The problem is then that my system cannot be 100 % sure that the user did everything correct from when he/she pressed the "export" key. Or, that no rows has ever been created in the other system.
From the other system I can get a CSV-file out where all the rows that system has. So, by comparing the CSV file and my table I can see if:
* There are any rows missing in the other system that should have been imported
* If someone has created rows in the other system
The problem isn't "solving it". It's making the best solution to is since there are so much data in the rows.
Thanks again!
/Thomas

We can use MYSQL not in option.
SELECT id
FROM table_one
WHERE id NOT IN ( SELECT id FROM table_two )
Edited
If you are getting the source from a csv file then you can simply have to put these values directly like:
I am assuming that the CSV are like 1,2,3,...,n
SELECT id
FROM table_one
WHERE id NOT IN ( 1,2,3,...,n );
EDIT 2
Or If you want to select the other way around then you can use mysqlimport to import data in temporary table in MySQL Database and retrieve the result and delete the table.
Like:
Create table
CREATE TABLE my_temp_table(
ids INT,
);
load .csv file
LOAD DATA LOCAL INFILE 'yourIDs.csv' INTO TABLE my_temp_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(ids);
Selecting records
SELECT ids FROM my_temp_table
WHERE ids NOT IN ( SELECT id FROM table_one )
dropping table
DROP TABLE IF EXISTS my_temp_table

What about using a left join ; something like this :
select second_table.id
from second_table
left join first_table on first_table.id = second_table.id
where first_table.is is null
You could also go with a sub-query ; depending on the situation, it might, or might not, be faster, though :
select second_table.id
from second_table
where second_table.id not in (
select first_table.id
from first_table
)
Or with a not exists :
select second_table.id
from second_table
where not exists (
select 1
from first_table
where first_table.id = second_table.id
)

The function you are looking for is NOT IN (an alias for <> ALL)
The MYSQL documentation:
http://dev.mysql.com/doc/refman/5.0/en/all-subqueries.html
An Example of its use:
http://www.roseindia.net/sql/mysql-example/not-in.shtml
Enjoy!

The problem is that T1 could have a million rows or ten million rows, and that number could change, so you don't know how many rows your comparison table, T2, the one that has no gaps, should have, for doing a WHERE NOT EXISTS or a LEFT JOIN testing for NULL.
But the question is, why do you care if there are missing values? I submit that, when an application is properly architected, it should not matter if there are gaps in an autoincrementing key sequence. Even an application where gaps do matter, such as a check-register, should not be using an autoincrenting primary key as a synonym for the check number.
Care to elaborate on your application requirement?

OK, I've read your edits/elaboration. Syncrhonizing two databases where the second is not supposed to insert any new rows, but might do so, sounds like a problem waiting to happen.
Neither approach suggested above (WHERE NOT EXISTS or LEFT JOIN) is air-tight and neither is a way to guarantee logical integrity between the two systems. They will not let you know which system created a row in situations where both tables contain a row with the same id. You're focusing on gaps now, but another problem is duplicate ids.
For example, if both tables have a row with id 13887, you cannot assume that database1 created the row. It could have been inserted into database2, and then database1 could insert a new row using that same id. You would have to compare all column values to ascertain that the rows are the same or not.
I'd suggest therefore that you also explore GUID as a replacement for autoincrementing integers. You cannot prevent database2 from inserting rows, but at least with GUIDs you won't run into a problem where the second database has inserted a row and assigned it a primary key value that your first database might also use, resulting in two different rows with the same id. CreationDateTime and LastUpdateDateTime columns would also be useful.
However, a proper solution, if it is available to you, is to maintain just one database and give users remote access to it, for example, via a web interface. That would eliminate the mess and complication of replication/synchronization issues.
If a remote-access web-interface is not feasible, perhaps you could make one of the databases read-only? Or does database2 have to make updates to the rows? Perhaps you could deny insert privilege? What database engine are you using?

I have the same problem: I have a list of values from the user, and I want to find the subset that does not exist in anther table. I did it in oracle by building a pseudo-table in the select statement Here's a way to do it in Oracle. Try it in MySQL without the "from dual":
-- find ids from user (1,2,3) that *don't* exist in my person table
-- build a pseudo table and join it with my person table
select pseudo.id from (
select '1' as id from dual
union select '2' as id from dual
union select '3' as id from dual
) pseudo
left join person
on person.person_id = pseudo.id
where person.person_id is null

Can I optimize my database by splitting one big table into many small ones?

Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?

No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.

you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)

Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.

Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008