Hi I'm a server developer and we have a big mysql database(biggest table has about 0.5 billion rows) running 24-7.
And there's a lot of broken data. Most of them are logically wrong and involves multi-source(multiple tables, s3). And since it's kinda logically complicated, we need Rails model to clean them (can't be done with pure sql queries)
Right now, I am using my own small cleansing framework and using AWS Auto Scaling Group to scale up instances and speed up. But since the database is in-service, I have to be careful(table locks and other stuffs) and limit the process amount.
So I am curious about
How do you (or big companies) clean your data while the database is in-service?
Do you use temporary tables and swap? or just update/insert/delete to an in-service database?
Do you use a framework or library or solution to clean data in efficient way? (such as distributed processing)
How do you detect messed up data real-time?
Do you use a framework or library or solution to detect broken data?
So I have face a problem similar in nature to what you are dealing with but different in scale. this is how I would approach the situation.
First access the infrastructure concerns, like can the data base be offline or restricted from use for a few hours for maintenance, if so read on.
Next you need to define what constitutes as "broken data".
Once you arrive at your definition of "broken data" translate a way to programmatically identify it.
Write a script that leverages your programatic identification algorithm and run some test.
Then back up you data records in preparation.
Then given the scale of your data set you will probably need to increase your server resources as not to bottle neck the script.
Run the script
Test your data to assess the effectiveness of your script
If needed adjust and rerun
Its possible to do this with out closing of the database for maintenance but I think you will get better results if you do. Also since this is a rails app I would look at the model validations that your app has and input field validations to prevent "broken data" in real time.
So I'm going to attempt to create a basic monitoring tool in VB.net. Now I'd like some advice on how basically to tackle the logging and reporting side of things so I'd appreciate some responses from users who I'm sure have a better idea than me and can tell me far more efficient ways of doing things.
So my plan is to have a client tool, which will read from a MySQL database values and basically change every x interval, I'm thinking 10/15 minutes at the moment. This side of the application is quite easy, I mean I can get something to read a database every x amount of time and then change labels and display alerts based on them. - This is all well documented and I am probably okay with that.
The second part is to have a client that sits in the system tray of the server gathering the required information. Now the system tray part I think will probably be the trickiest bit of this, however that's not really part of my question.
So I assume I can use the normal information gathering commands and store them perhaps as strings and I can then connect to the same database and add them to the relevant fields. For example if I had a MySQL table called "server" and a column titled "Connection" I could check if the server has an internet connection for example and store the result as the value 1 for yes and 0 for no and then send a MySQL command to the table to update the "connection" value to either 0/1.
Then I assume the monitoring tool I can run a MySQL query to check the "Connection" column and if the value is = 0 change a label or flag an error and if 1 report that connectivity is okay?
My main questions about the above are listed below.
Is using a MySQL database the most efficient way of doing something like this?
Obviously if my database goes down there's no more reporting, I still think that's a con I'll have to live with though.
Storing everything as values within the code is the best way to store my data?
Is there anything particular type of format I should use in the MySQL colum, I was thinking maybe tinyint(9)?
Is the above method redundant and pointless?
I assume all these database connections could cause some unwanted server load, however the 15 minute refresh time should combat that.
Is there a way to properly combat delays with perhaps client updating not in time for the reporter so it picks up false data, perhaps a fail safe for a column containing last updated time?
You probably don't need the tool that gathers information per se. The web app (real time monitor) can do that, since the clients are storing their information in the same database. The web app can access the database every 15 minutes and display the data, without the intermediate step of saving it again. This will provide the web app with the latest information instead of a potential 29-minute delay.
In other words, the clients are saving the connection information once. Don't duplicate it in the database.
MySQL should work just about as well as anything.
It's a bad idea to hard code "everything". You can use application settings or a MySQL table if you need to store IPs, etc.
In an application like this, the conversion will more than offset the data savings of a tinyint. I would use the most convenient data type.
I am about 70% of the way through developing a web application which contains what is essentially a largeish datatable of around 50,000 rows.
The app itself is a filtering app providing various different ways of filtering this table such as range filtering by number, drag and drop filtering that ultimately performs regexp filtering, live text searching and i could go on and on.
Due to this I coded my MySQL queries in a modular fashion so that the actual query itself is put together dynamically dependant on the type of filtering happening.
At the moment each filtering action (in total) takes between 250-350ms on average. For example:-
The user grabs one end of a visual slider, drags it inwards, when he/she lets go a range filtering query is dynamically put together by my PHP code and the results are returned as a JSON response. The total time from the user letting go of the slider until the user has recieved all data and the table is redrawn is between 250-350ms on average.
I am concerned with scaleability further down the line as users can be expected to perform a huge number of the filtering actions in a short space of time in order to retrieve the data they are looking for.
I have toyed with trying to do some fancy cache expiry work with memcached but couldn't get it to play ball correctly with my dynamically generated queries. Although everything would cache correctly I was having trouble expiring the cache when the query changes and keeping the data relevent. I am however extremely inexperienced with memcached. My first few attempts have led me to believe that memcached isn't the right tool for this job (due to the highly dynamic nature of the queries. Although this app could ultimately see very high concurrent usage.
So... My question really is, are there any caching mechanisms/layers that I can add to this sort of application that would reduce hits on the server? Bearing in mind the dynamic queries.
Or... If memcached is the best tool for the job, and I am missing a piece of the puzzle with my early attempts, can you provide some information or guidance on using memcached with an application of this sort?
Huge thanks to all who respond.
EDIT: I should mention that the database is MySQL. The siite itself is running on Apache with an nginx proxy. But this question is related purely to speeding up and reducing the database hits, of which there are many.
I should also add that the quoted 250-350ms roundtrip time is fully remote. As in from a remote computer accessing the website. The time includes DNS lookup, Data retrieval etc.
If I understand your question correctly, you're essentially asking for a way to reduce the number of queries against the database eventhough there will be very few exactly the same queries.
You essentially have three choices:
Live with having a large amount of queries against your database, optimise the database with appropriate indexes and normalise the data as far as you can. Make sure to avoid normal performance pitfalls in your query building (lots of ORs in ON-clauses or WHERE-clauses for instance). Provide views for mashup queries, etc.
Cache the generic queries in memcached or similar, that is, without some or all filters. And apply the filters in the application layer.
Implement a search index server, like SOLR.
I would recommend you do the first though. A roundtrip time of 250~300 ms sounds a bit high even for complex queries and it sounds like you have a lot to gain by just improving what you already have at this stage.
For much higher workloads, I'd suggest solution number 3, it will help you achieve what you are trying to do while being a champ at handling lots of different queries.
Use Memcache and set the key to be the filtering query or some unique key based on the filter. Ideally you would write your application to expire the key as new data is added.
You can only make good use of caches when you occasionally run the same query.
A good way to work with memcache caches is to define a key that matches the function that calls it. For example, if the model named UserModel has a method getUser($userID), you could cache all users as USER_id. For more advanced functions (Model2::largerFunction($arg1, $arg2)) you can simply use MODEL2_arg1_arg2 - this will make it easy to avoid namespace conflicts.
For fulltext searches, use a search indexer such as Sphinx or Apache Lucene. They improve your queries a LOT (I was able to do a fulltext search on a 10 million record table on a 1.6 GHz atom processor, in less than 500 ms).
In my MS Access application I have several forms that are very data intensive (several subforms based on even more tables). My users are complaining that when opening the data across the network the load times are unbearably long.
I have do have a slit front end / back end setup using the excellent autofe application.
One solution I have come up with to the problem is instead of docmd.close when the user clicks the "Save & Close" button I me.visible = false. The user then has the long wait time the first time after the application is loaded but for later loads performance is improved by a noticeable amount.
So far this has been working fairly well. I am just concerned that there may be some hidden gotchas hidden in this strategy that I haven't encountered yet.
My users aren't overly intelligent and I don't use the application myself so I can't expect to get meaningful feedback if something is behaving erratically.
Anyone else employed this strategy successfully or know of a good reason not to do it?
Anyone else employed this strategy successfully or know of a good reason not to do it?
Yes, that strategy is similar to recipe #8.1 Accelerate the Load Time of Forms from the second edition of the Access Cookbook. However that recipe pre-loads a set of forms, with WindowMode:=acHidden, at database startup. So the tradeoff is that database startup takes longer, but subsequent form opens (for the pre-loaded forms) are comparatively fast.
The discussion for that recipe didn't mention any drawbacks for that technique. In limited use, I haven't discovered any. And since it seems to improve your users' experience, I would continue to use it.
Beyond that, I would take a hard look at the amount of data your forms pull from the back-end database. Limit the number of rows retrieved as the Record Sources for the main and subforms. Give the user a method to select a different record or small set of records. Also make sure you use indexing to support Record Source WHERE and ORDER BY clauses. Avoid WHERE conditions that use functions which will force a full table scan to figure out which rows to exclude from the Record Source. Similar considerations apply to combo and list boxes which use saved queries or SELECT statements as their Record Sources; if you can't limit the rows, at least make sure to optimize data retrieval.
At first, just hiding the form is not too bad, I think.
I would dig a bit more on WHY your load times are so long. You mentionned several subforms. Are they all displayed at the same time, or are they in the various pages of a Tab control ?
In the latter case, you could quite easily unbind the subforms that are not visible, and bind them on the PageClick event. That makes a big difference in performance.
EDIT:
Also, a bit out of scope for this question, but good for every performance issue:
-did you double check that the foreign keys in the related tables are properly indexed ?
-make sure the back-end is regularly compacted.
Are you making sure that the data gets refreshed in an appropriate timeframe?
Yes, I've doen the same thing myself in very complex forms which had about 10 or 15 tabs each with a subform. Worked for at least ten years. You had to watch for varous form level values or unbound controls which you assume start as null or zero. But once it's running smoothly it should run just fine. We had to this back in Access 97 days because Access would crash with out of memory errors after the users had opened and closed varous forms thousands of times per day.
We're thinking of "growing" a little MS-Access DB with a few tables, forms and queries for multiple users. (Using a different back-end is another, but more long-term option that is unfortunately currently not acceptable.)
Most users will be read-only, but there will be a few (currently one or two) users that have to be able to do changes (while the read-only users are also using the DB). We're not so much concerned about the security aspects, but more about some of the following issues:
How can we make sure that the write-user can make changes to the table data while other users use the data? Do the read-users put locks on tables? Does the write-user have to put locks on the table? Does Access do this for us or do we have to explicitly code this?
Are there any common problems with "MS Access transactions" that we should be aware of?
Can we work on forms, queries etc. while they are being used? How can we "program" without being in the way of the users?
Which settings in MS Access have an influence on how things are handled?
Our background is mostly in Oracle, where is Access different in handling multiple users? Is there such thing as "isolation levels" in Access?
Any tips or pointers to helpful articles would be greatly appreciated.
I find the answers to this question to be problematic, confusing and incomplete, so I'll make an effort to do better.
Q1: How can we make sure that the write-user can make changes to the table data while other users use the data? Do the read-users put locks on tables? Does the write-user have to put locks on the table? Does Access do this for us or do we have to explicitly code this?
Nobody has really answered this in any complete fashion. The information on setting locks in the Access options has nothing to do with read vs. write locking. No Locks vs. All Records vs. Edited Record is how you set the default record locking for WRITES.
No locks means you are using OPTIMISTIC locking, which means you allow multiple users to edit the record and then inform them after the fact if the record has changed since they launched their own edits. Optimistic locking is what you should start with as it requires no coding to implement it, and for small users populations it hardly ever causes a problem.
All Records means that the whole table is locked any time an edit is launched.
Edited Record means that fewer records are locked, but whether or not it's a single record or more than one record depends on whether your database is set up to use record-level locking (first added in Jet 4) or page-level locking. Frankly, I've never thought it worth the trouble to set up record-level locking, as optimistic locking takes care of most of the problems.
One might think that you want to use record-level pessimistic locking, but the fact is that in the vast majority of apps, two users are almost never editing the same record. Now, obviously, certain kinds of apps might be exceptions to that, but if I ran into such an app, I'd likely try to engineer it away by redesigning the schema so that it would be very uncommon for two users to edit the same record (usually by going to some form of transactional editing instead, where changes are made by adding records, rather than editing the existing data).
Now, for your actual question, there are a number of ways to accomplish restricting some users to read-only and granting others write privileges. Jet user-level security was intended for this purpose and works fine insofar as it's "security" for any meaningful definition of the term. In general, as long as you're using a Jet/ACE data store, the best security you're going to get is that provided by Jet ULS. It's crackable, yes, but your users would be committing a firable offense by breaking it, so it might be sufficient.
I would tend to not implement Jet ULS at all and instead just architect the data editing forms such that they checked the user's Windows logon and made the forms read-only or writable depending on which users are supposed to get which access. Whether or not you want to record group membership in a data table, or maintain Windows security groups for this purpose is up to you. You could also use a Jet workgroup file to deal with it, and provide a different system.mdw file for the write users. The read-only users would log on transparently as admin, and those logged on as admin would be granted only read-only access. The write users would log on as some other username (transparently, in the shortcut you provide them for launching the app, supplying no password), and that would be used to set up the forms as read or write.
If you use Jet ULS, it can become really hairy to get it right. It involves locking down all the tables as read-only (or maybe not even that) and then using RWOP queries to provide access to the data. I haven't done but one such app in my 14 years of professional Access development.
To summarize my answers to the parts of your question:
How can we make sure that the write-user can make changes to the table data while other users use the data?
I would recommend doing this in the application, setting forms to read/only or editable at runtime depending on the user logon. The easiest approach is to set your forms to be read-only and change to editable for the write users when they open the form.
Do the read-users put locks on tables?
Not in any meaningful sense. Jet/ACE does have read locks, but they are there only for the purpose of maintaining state for individual views, and for refreshing data for the user. They do not lock out write operations of any kind, though the overhead of tracking them theoretically slows things down. It's not enough to worry about.
Does the write-user have to put locks on the table?
Access in combination with Jet/ACE does this for you automatically, particularly if you choose optimistic locking as your default. The key point here is that Access apps are databound, so as soon as a form is loaded, the record has a read lock, and as soon as the record is edited, whether or not it is write-locked for other users is determined by whether you are using optimistic or pessimistic locking. Again, this is the kind of thing Access takes care of for you with its default behaviors in bound forms. You don't worry about any of it until the point at which you encounter problems.
Does Access do this for us or do we have to explicitly code this?
Basically, other than setting editability at runtime (according to who has write access), there is no coding necessary if you're using optimistic locking. With pessimistic locking, you don't have to code, but you will almost always need to, as you can't just leave the user stuck with the default behaviors and error messages.
Q2: Are there any common problems with "MS Access transactions" that we should be aware of?
Jet/ACE has support for commit/rollback transactions, but it's not clear to me if that's what you mean in this question. In general, I don't use transactions except for maintaining atomicity, e.g., when creating an invoice, or doing any update that involves multiple tables. It works about the way you'd expect it to but is not really necessary for the vast majority of operations in an Access application.
Perhaps one of the issues here (particularly in light of the first question) is that you may not quite grasp that Access is designed for creating apps with data bound to the forms. "Transactions" is a topic of great importance for unbound and stateless apps (e.g., browser-based), but for data bound apps, the editing and saving all happens transparently.
For certain kinds of operations this can be problematic, and occasionally it's appropriate to edit data in Access with unbound forms. But that's very seldom the case, in my experience. It's not that I don't use unbound forms -- I use lots of them for dialogs and the like -- it's just that my apps don't edit data tables with unbound forms. With almost no exceptions, all my apps edit data with bound forms.
Now, unbound forms are actually fairly easy to implement in Access (particularly if you name your editing controls the same as the underlying fields), but going with unbound data editing forms is really missing the point of using Access, which is that the binding is all done for you. And the main drawback of going unbound is that you lose all the record-level form events, such as OnInsert, BeforeUpdate and so forth.
Q3. Can we work on forms, queries etc. while they are being used? How can we "program" without being in the way of the users?
This is one of the questions that's been well-addressed. All multi-user or replicated Access apps should be split, and most single-user apps should be, too. It's good design and also makes the apps more stable, as only the data tables end up being opened by more than one user at a time.
Q4. Which settings in MS Access have an influence on how things are handled?
"Things?" What things?
Q5. Our background is mostly in Oracle, where is Access different in handling multiple users? Is there such thing as "isolation levels" in Access?
I don't know anything specifically about Oracle (none of my clients could afford it even if they wanted to), but asking for a comparison of Access and Oracle betrays a fundamental misunderstanding somewhere along the line.
Access is an application development tool.
Oracle is an industrial strength database server.
Apples and oranges.
Now, of course, Access ships with a default database engine, originally called Jet and now revised and renamed ACE, but there are many levels at which Access the development platform can be entirely decoupled from Jet/ACE, the default database engine.
In this case, you've chosen to use a Jet/ACE back end, which will likely be just fine for small user populations, i.e., under 25. Jet/ACE can also be fine up to 50 or 100, particularly when only a few of the simultaneous users have write permission. While the 255-user limit in Jet/ACE includes both read-only and write users, it's the number of write users that really controls how many simultaneous users you can support, and in your case, you've got an app with mostly read-only users, so it oughtn't be terribly difficult to engineer a good app that has no problems with the back end.
Basically, I think your Oracle background is likely leading you to misunderstand how to develop in Access, where the expected approach is to bind your forms to recordsources that are updated without any need to write code. Now, for efficiency's sake it's a good idea to bind your forms to subsets of records, rather than to whole tables, but even with an entire table in the recordsource behind a data editing form, Access is going to be fairly efficient in editing Jet/ACE tables (the old myth about pulling the whole table across the wire is still out there) as long your data tables are efficiently indexed.
Record locking is something you mostly shouldn't have any cause to worry about, and one of the reasons for that is because of bound editing, where the form knows what's going on in the back end at all times (well, at intervals about a second apart, the default refresh interval). That is, it's not like a web page where you retrieve a copy of the data and then post your edits back to the server in a transaction completely unconnected to the original data retrieval operation. In a bound environment like Access, the locking file on the back-end data file is always going to be keeping track of the fact that someone has the record open for editing. This prevents a user's edits from stomping on someone else's edits, because Access knows the state and informs the user. This all happens without any coding on the part of the developer and is one of the great advantages of the bound editing model (aside from not having to write code to post the edits).
For all those who are experienced database programmers familiar with other platforms who are coming to Access for the first time, I strongly suggest using Access like an end user. Try out all the point and click features. Run the form and report wizards and check out the results that they produce. I can't vouch for all of them as demonstrating good practices, but they definitely demonstrate the default assumptions behind the way Access is intended to be used.
If you find yourself writing a lot of code, then you're likely missing the point of Access.
The first thing to do (if not already done) is to split your database into a front end (with all the forms/reports etc) and a back end(with all the data). The second thing is to setup version control on the front end.
The way I have done that in a lot of my databases is to have the users run a small “jumper” database to open the main database. This jumper does the following things
• Checks to see if the user has the database on their C drive
• If they do not then install and run
• If they do then check what version they have
• If the version numbers do not match then copy down the latest version
• Open the database
This whole checking process normally takes under half a second. Using this model you can do all your development on a separate database then when you are ready to “release” you just put the new mde up onto the network share and the next time the user opens the jumper the latest version is copied down.
There are also other things to think about in multiuser database and it might be worth checking for the common mistakes such as binding a form to a whole table etc
i think Access is a best choice for your case. But you have to split database, see:
http://accessblog.net/2005/07/how-to-split-database-into-be-and-fe.html
•How can we make sure that the write-user can make changes to the table data while other users use the data? Do the read-users put locks on tables? Does the write-user have to put locks on the table? Does Access do this for us or do we have to explicitly code this?
there are no read locks unless you put them explicitly. Just use "No Locks"
•Are there any common problems with "MS Access transactions" that we should be aware of?
should not be problems with 1-2 write users
•Can we work on forms, queries etc. while they are being used? How can we "program" without being in the way of the users?
if you split database - then no problem to work on FE design.
•Which settings in MS Access have an influence on how things are handled?
What do you mean?
•Our background is mostly in Oracle, where is Access different in handling multiple users? Is there such thing as "isolation levels" in Access?
no isolation levels in access.
BTW, you can then later move data to oracle and keep access frontend, if you have lot of users and big database.
Table or record locking is available in Access during data writes. You can control the Default record locking through Tools | Options | Advanced tab:
No Locks
All Records
Edited Record
You can set this on a form's Record Locks or in your DAO/ADO code for specific needs.
Transactions shouldn't be a problem if you use them correctly.
Best practice: Separate your tables from All your other code. Give each user their own copy of the code file and then share the data file on a network server. Work on a 'test' copy of the code (and a link to a test data file) and then update user's individual code files separately. If you need to make data file changes (add tables, columns, etc), you will have to have all users get out of the application to make the changes.
See other answers for Oracle comparison.
The correct way of building client/server Microsoft Access applications where the data is stored in a RDBMS is to use the Linked Table method. This ensures Data Isolation and Concurrency is maintained between the Microsoft Access client application and the RDBMS data with no additional and unnecessary programming logic and code which makes maintenance more difficult, and adds to development time.
see: http://claysql.blogspot.com/2014/08/normal-0-false-false-false-en-us-x-none.html
Access is a great multi-user database. It has lots of built in features to handle the multi-user situation. In fact, it is so very popular because it is such a great multi-user database. There is an upper limit on how many users can all use the database at the same time doing updates and edits - depending on how knowledgeable the developer is about access and how the database has been designed - anywhere from 20 users to approx 50 users. Some access databases can be built to handle up to 50 concurrent users, while many others can handle 20 or 25 concurrent users updating the database. These figures have been observed for databases that have been in use for several or more years and have been discussed many times on the access newsgroups.
I have found the SMB2 protocol introduced in Vista to lock the access databases.
It can be disabled by the following regedit:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\LanmanServer\Parameters]
"Smb2"=dword:00000000