Final touches cleaning Mediawiki tables after removing spam pages

Final touches cleaning Mediawiki tables after removing spam pages - mediawiki

As a testament to how good my SEO efforts have been for one of our websites, a wiki residing on the same domain got 2601 spam pages in 2 days (coincidence, got listed on SERPs 2 days ago...).
I have locked the wiki down (read only), enabled block lists, Captchas etc. etc. and used the Nuke extension to remove all the spam.
Now, this is remarkable for just one extension, but it still left stuff here and there, which I'd love to trim out.
Basically, Nuke (which I think it's an official extension) left "orphaned" records in the following tables: pagelinks, searchindex, users.
I have no issues deleting records around but I don't want to break the database relational consistency by randomly pruning stuff about.
I am able to understand how to execute SQL queries, Linux command line scripts and all sorts of advanced stuff.
So, here are some questions for some helpful StackOverflow readers who know Mediawiki internals:
May I freely delete users table rows? I just need to keep two rows so the SQL query is easy. I just don't want to cause side effects with whatever other tables could need to link to them.
What could I do to remove the orphaned records in pagelinks? They clearly point to now gone pages, yet the default maintenance Mediawiki scripts I have used (first the nuke extension, then rebuildall.php) don't trim those orphans away.
This leads me to believe I might still have garbage somewhere causing the script to not remove the links pointing to it. However I have triple checked the pages... only the few pages made by us are left any more. I have purged the revisions as well.
I have tried using the console refreshLinks.php and orphans.php scripts but they did nothing relevant.
I am sure the pagelinks table can be further trimmed down, because by using the dumpLinks.php console maintenance script I can easily grep all sorts of "inconvenient" words and links.

Hopefully, you backup your databases at least once a day. In which case, assuming the wiki is rather new, it might have been easiest to simply revert to a non-spammed version of your DB and alert or manually repeat changed done during these two days.
Generally, a relational database should have strict relations that won't allow you to leave it in inconsistent state by either presenting an error or cascading your action. Not sure how well MediaWiki defined its relations though.
I've removed rows from the users table and haven't noticed any problems.
I'd suggest removing the rows from pagelinks table and see what happens.
You could verify the sanity of your wiki by launching an automated crawler on it and seeing if any errors come up.

Related

How to check which MySQL-databases are still in use? #CleaningUp

I am having a cleaning situation over here, the old programmer didn't clean up his out-of-use databases & users.
While some of the databases are still in use by external sites (on other ftp-servers), some are obsolete and just cluttering up the system.
My question is: How can I figure out which databases (& users) are still in use by other websites? (without checking every website that has ever been created where ever this might be located)
I need to be sure that a MySQL database (& user) are not actively being used by any site anymore, so I can safely delete it to clean up the system.
p.s.: It could also be that a database is still in use, but doesn't do any INSERTs or UPDATEs at all, but only SELECTs the data to load the website.
p.p.s.: I can't (temporary) deactivate/remove databases (& users), because this will cause clients to lose revenue, customers, search ranking etc. etc. and at the end will cost us/me.

Well, if you want even selects I think your options is general query log
https://dev.mysql.com/doc/refman/5.7/en/query-log.html
Enable it, parse it and get used dbs/tables

MySQL - What happens when multiple queries hit the database

I am working on a project, which will be used by around 500 employees in my organization. Currently, it's still in development phase, and very few people(around 10) are using it. I'm using MySQL. I just want to know, what happens if many users are doing front end edits and then save, at the same point of time? Some SELECT queries that I've written do take as long as 6 seconds to execute. As only one query can be executed at any point of time, if already a query is in progress, and another hits the database, will it create problem? If this is a common situation in large scale projects, please let me know how can I handle this. I'm not sure, if I've made myself clear :). Any advice or links will be very helpful.

From technical aspect, no - nothing bad will happen, the database won't go ballistics and die on you, they're made for purposes like concurrent access.
From logical point of view - something bad will happen. If two people edit the same thing at the same time and then post it at the same time - it gets saved to hard drive one after another. The last one to save is the one whose updates will be on the HDD, effectively causing the first person to lose their changes.
You can approach this problem from several angles. Some projects introduce the concept of locking (not table locking but in-app locking). It revolves around marking a record as locked using a boolean column and if anyone tries to access that record for updating, the software says that someone else is editing it. It's something really difficult to implement and for the most time it doesn't work as expected (I think I vaguely remember Joomla! using something like that, it was one of the most annoying features ever).
The other option you have is to save each update as a revision. That way you can keep track on who updated what and when and you never lose any records in case of would-get overwritten. I believe that SO and Wikipedia use that approach and it works really great because you can inspect what two or more people have done and merge their contributions.

Optimistic Concurrency Control
http://en.wikipedia.org/wiki/Optimistic_concurrency_control
Make sure that each record contains date metadata on last changed/modified time, and load that as part of your data object. Then when attempting to commit the row to database, check the last_modified time in the table to ensure that it is the SAME as the one stored in memory for your object. If it matches, commit it, else throw exception.

Decentralized synchronized secure data storage

Introduction
Hi, I am going to ask a question which seems utopic for me, but I need to know if there is a way to achieve what I need. And if not, I need to know why not.
The idea
Suppose I have a database structure, in MySql.
I want to create some solution to allow anyone (no matter who, no matter where) to have a synchronized copy (updated clone) of this database (with its content)
Well, and it is not going to be just one synchronized copy, it could (and should) be a multiple replication (supposing the basic, this means, for example, ten copies all over the world)
And, the most important thing: It must be secure. By secure I mean only real-accepted transactions will be synchronized with all the others (no matter how many) database copies/clones.
Note: Since it would be quite difficult to make the synchronization in real-time, I will design everything to make this feature dispensable. So it is not required.
My auto-suggestion
This is how I am thinking to manage it:
Time identifiers and Updates checking: Every action (insert, update, delete...) will be stored as the action instruction itself, associated to the time identifier. [I think better than a DATETIME field, it'll be an INT one, with the number of miliseconds passed from 1st january 2013 on, for example]. So each copy is going to ask to the "neighbour copy" for new actions done since last update, and execute them after checking they are allowed.
Problem 1: the "neighbour copy" could be outdated too.
Solution 1: do not ask just one neighbour, create a random list with some of the copies/clones and ask them for news (I could avoid the list and ask ALL the clones for updates, but this will be inefficient if clones number ascends too much).
Problem 2: Real-time global synchronization is not active. What if...
Someone at CLONE_ENTERPRISING inserts a row into TABLE.
... this row goes to every clone ...
Someone at CLONE_FIXEMALL deletes this row.
... and at the same time, somewhere in an outdated clone ...
Someone at CLONE_DROPOUT edits this row (now inexistent at the other clones)
Solution 2: easy stuff, force a GLOBAL synchronization before doing any new "depending-on-third-data action" (edit, for example). This global synch. will be unnecessary when making an INSERT, for instance.
Note: Well, someone could have some fun, and make the same insert in two clones... since they're not getting updated in real-time, this row will exist twice. But, it's the same as when we have one single database, in some needed cases we check if there is an existing same-row before doing the final action. Not a problem.
Problem 3: It is possible to edit the code and do not filter actions, so someone could spread instructions to delete everything, or just make some trolling activity. This is not a problem, since good clones will always be somewhere. Those who got bad won't interest anymore.
I really appreciate if you read. I know this is not the perfect solution, it has possibly hundred of holes, but it is my basic start. I will now appreciate anything you can teach me now. Thanks a lot.
PS.: It could be that all this I am trying already exists and has its own name. Sorry for asking then (I'd anyway thank this name, if it exists)

I would suggest a look at Sync Framework from Microsoft. It might be better suited to SQL Server but it should work with MySQL too. The problem you are tackling is quite a complex one.

Generate general schema of a huge unknown database

I am required to make a general schema of a huge database that I have never used.
The problem is that I do not know how/where could I start doing this because, not considering the size, I have no idea of what is each table for. I can guess some but there are the mayority of them in which generic name fields do not say anything to me.
Do you have some advice?what could I do?
There is no documentation about the database and the creators are not able to help me because they are in another company now.
Thank you very much in advanced.

This isn't going to be easy.
Start by gathering any documentation, notes, etc. that exist. Also, it'll greatly help to have a thorough understanding of the type of data being stored, and of the application. Keep ample notes of your discoveries, and build the documentation that should have been built before.
If your database contains declared foreign keys, you can start there, and at least get down the relationships between the tables. Keeping in mind that this may be incomplete. As #John Watson points out, if the relationships are declared, there are tools to do this for you.
Check for stored functions and procedures, including triggers. Though these are somewhat uncommon in MySQL databases. Triggers especially will often yield clues ("every update to table X inserts a new row to table Y" -> "table Y is probably a log or audit table").
Some of the tables are hopefully obvious, and if you know what is related to them, you may be able to start figuring out those related tables.
Hopefully you have access to application code, which you can grep and read to find clues. Access to a test environment which you can destroy repeatedly will be useful too ("what happens if I change this in the app, where does the database change?"; "what happens if I scramble these values?"; etc.). You can dump tables and use diff on them, provided you dump them ordered by primary or unique key.
Doing queries like SELECT DISTINCT foo FROM table can help you see what different things can be in a column.
If its possible to start from a mostly-empty database (e.g., minimal to get the app to run), you can observe what changes as you add data to the app. Much quicker to dump the database when its small. Same for diffing it, same for reading through the output. Some things are easier to understand in a tiny database, but some things are more difficult. When you have a huge dataset and a column is always 3, you can be much more confident it always is.
You can watch SQL traffic from the application(s) to get an idea of what tables and columns they access for each function, and how they join them. Watching SQL traffic can be done in application-specific ways (e.g., DBI trace) or server-specific ways (turn on the general query log) or with a packet tracer like Wireshark or tcpdump. Which is appropriate is going to depend on the environment you're working in. E.g., if you have to do this on a production system, you probably want Wireshark. If you are doing this in dev/test, the disadvantage of the MySQL query log is that all the apps may very well be mixed together, and if multiple people are hitting the apps it'll get confusing. The app-specific log probably won't suffer from this, but of course the app may not have that.
Keep in mind the various ways data can be stored. For example, all three of these could mean May 1, 1980:
1980-05-01 — As a DATE, TIMESTAMP, or text.
2444330.5 — Julian day (with time, specifies at midnight)
44360 — Modified Julian day
326001600 — UNIX timestamp (with time, specifies midnight) assuming local time is US Eastern Time (seconds since Jan 1 1970 UTC)
There may be things in the database which are denormalized, and some of them may be denormalized incorrectly. E.g., you may be wondering "why does this user have a first name Bob in one table, and a first name Joe in another?" and the answer is "data corruption".
There may be columns that aren't used. There may be entire tables that aren't used. Despite this, they may still have data from older versions of the app (or other, no-longer-in-use apps), queries run from the MySQL console, etc.
There may be things which aren't visible in the application anywhere, but are used. Their purpose may be completely non-obvious without knowledge of the algorithms implemented in the app(s). For example, a search function in an app may store all kinds of precomputed information about the documents to search and their connections. Worse, these tables may only be updated by batch jobs, so changing a document won't touch them (making you mistakenly believe they have nothing to do with documents). Then, you come in the next morning, and the table is mysteriously very different. Though, in the search case, a query log when running search will tell you.

Try using the free mySQL workbench (it's specific to mySQL).
I have reverse engineered databases this way and also ended up with great Entity Relationship Diagrams!
I've worked with SQL for 20 years and this product really is great (it's free, from the mysql folks themselves).
It can have occasional problems, crashes, etc. at least it did on Ubuntu10 but they've been relatively rare and far out-weighed by the benefits! It's also actively developed so bugs are actually fixed on an on-going basis.

Assuming that nobody bothered to declare foreign keys in the table definition, and the database belongs to an application which is in use, after grabbing the current schema, the next step for me would be to enable logging of all queries (hoping that the data does NOT use a trivial ORM like [x]hibernate) to identify joins and data semantics.
This perl script may be helpful.

MS Access antiquated? Anything new in 2011?

Our company has a database of 17,000 entries. We have used MS Access for over 10 years for our various mailings. Is there something new and better out there? I'm not a techie, so keep in mind when answering. Our problems with Access are:
-no record of what was deleted,
-will not turn up a name in a search if cap's or punctuation
is not entered exactly,
-is complicated for us to understand the de-duping process.
- We'd like a more nimble program that we can access from more than one dedicated computer.

The only applications I know of that are comparable to Access are FileMaker Pro, and the database component of the Open Office suite. FM Pro is a full-fledged product and gets good marks for ease of use from non-technical users, while Base is much less robust and is not nearly as easy for creating an application.
All of the answers recommending different databases really completely miss the point here -- the original question is about a data store and application builder, not just the data store.
To the specific problems:
PROBLEM 1: no record of what was deleted
This is a design error, not a flaw in Access. There is no database that really keeps a record of what's deleted unless someone programs logging of deleted data.
But backing up a bit, if you are asking this question it suggest that you've got people deleting things that shouldn't be deleted. There are two solutions:
regular backups. That would mean you could restore data from the last backup and likely recover most of the lost data. You need regular backups with any database, so this is not really something that is specific to Access.
design your database so records are never deleted, just marked deleted and then hidden in data entry forms and reports, etc. This is much more complicated, but is very often the preferred solution, as it preserves all the data.
Problem #2: will not turn up a name in a search if cap's or punctuation is not entered exactly
There are two parts to this, one of which is understandable, and the other of which makes no sense.
punctuation -- databases are stupid. They can't tell that Mr. and Mister are the same thing, for instance. The solution to this is that for all data that needs to be entered in a regularized fashion, you use all possible methods to insure that the user can only enter valid choices. The most common control for this is a dropdown list (i.e., "combo box"), which limits the choices the user has to the ones offered in the list. It insures that all the data in the field conforms to a finite set of choices. There are other ways of maintaining data regularity and one of those involves normalization. That process avoids the issue of repeatedly storing, say, a company name in multiple records -- instead you'd store the companies in a different table and just link your records to a single company record (usually done with a combo box, too). There are other controls that can be used to help insure regularity of data entry, but that's the easiest.
capitalization -- this one makes no sense to me, as Access/Jet/ACE is completely case-insensitive. You'll have to explain more if you're looking for a solution to whatever problem you're encountering, as I can't conceive of a situation where you'd actually not find data because of differences in capitalization.
Problem #3: is complicated for us to understand the de-duping process
De-duping is a complicated process, because it's almost impossible for the computer to figure out which record among the candidates is the best one to keep. So, you want to make sure your database is designed so that it is impossible to accidentally introduce duplicate records. Indexing can help with this in certain kinds of situations, but when mailing lists are involved, you're dealing with people data which is almost impossible to model in a way where you have a unique natural key that will allow you to eliminate duplicates (this, too, is a very complicated topic).
So, you basically have to have a data entry process that checks the new record against the existing data and informs the user if there's a duplicate (or near match). I do this all the time in my apps where the users enter people -- I use an unbound form where they type in the information that is the bare minimum to create a new record (usually some combination of lastname, firstname, company and email), and then I present a list of possible matches. I do strict and loose matching and rank by closeness of the match, with the closer matches at the top of the list.
Then the user has to decide if there's a match, and is offered the opportunity to create the duplicate anyway (it's possible to have two people with the same name at the same company, of course), or instead to abandon adding the new record and instead go to one the existing records that was presented as a possible duplicate.
This leaves it up to the user to read what's onscreen and make the decision about what is and isn't a duplicate. But it maximizes the possibility of the user knowing about the dupes and never accidentally creating a duplicate record.
Problem #4: We'd like a more nimble program that we can access from more than one dedicated computer.
This one confuses me. Access is multi-user out of the box (and has been from the very beginning, nearly 20 years ago). There is no limitation whatsoever to a single computer. There are things you have to do to make it work, such as splitting your database into to parts, one part with just the data tables, and the other part with your forms and reports and such (and links to the data tables in the other file). Then you keep the back end data file on one of the computers that acts as a server, and give a copy of the front end (the reports, forms, etc.) to each user. This works very well, actually, and can easily support a couple of dozen users (or more, depending on what they are doing and how well your database is designed).
Basically, after all of this, I would tend to second #mwolfe02's answer, and agree with him that what you need is not a new database, but a database consultant who can design for you an application that will help you manage your mailing lists (and other needs) without you needing to get too deep into the weeds learning Access (or FileMaker or whatever). While it might seem more expensive up front, the end result should be a big productivity boost for all your users, as well as an application that will produce better output (because the data is cleaner and maintained better because of the improved data entry systems).
So, basically, you either need to spend money upfront on somebody with technical expertise who would design something that allows you to do better work (and more efficiently), or you need to invest time in upping your own technical skills. None of the alternatives to Access are going to resolve any of the issues you've raised without significant investment in interface design to further the goals you have (cleaner data, easier to find information, etc.).

At the risk of sounding snide, what you are really looking for is a consultant.
In the hands of a capable programmer, all of your issues with Access are easily handled. The problems you are having are not the result of using the wrong tool, but using that tool less than optimally.
Actually, if you are not a techie then Access is already the best tool for you. You will not find a more non-techie friendly way to build a data application from bottom to top.
That said, I'd say you have three options at this point:
Hire a competent database consultant to improve your application
Find commercial off-the-shelf (COTS) software that does what you need (I'm sure there are plenty of products to handle mailings; you'll need to research)
Learn about database normalization and building proper MS Access applications
If you can find a good program that does what you want then #2 above will maximize your Return on Investment (ROI). One caveat is that you'll need to convert all of your existing data, which may not be easy or even possible. Make sure you investigate that before you buy anything.
While it may be the most expensive option up-front, hiring a competent database consultant is probably your best option if you need a truly custom solution.

SQL Server sounds like a viable alternative to your scenario. If cost is a concern, you can always use SQL Server Express, which is free. Full blown SQL Server provides a lot more functionality that might not be needed right away. Express is a lot simpler as the number of features provided with it are much smaller. With either version though you will have centralized store for your data and the ability to allow all transactions to be recorded in the transaction log. Also, both have the ability to import data from an Access Database.
The newest version of SQL Server is 2008 R2

You probably want to take a look at modern databases. If you're into Microsoft-based products, start with SQL Server Express
EDIT: However, since I understand that you're not a programmer yourself, you'd probably be better off having someone experienced look into your technical problem more deeply, like the other answer suggests.

It sounds like you may want to consider a front-end for your existing Access data store. Microsoft has yet to replace Access per se, but they do have a new tool that is a lot lower on the programming totem pole than some other options. Check out Visual Studio Lightswitch - http://www.microsoft.com/visualstudio/en-us/lightswitch.
It's fairly new (still in beta) but showing potential. With it, just as with any Visual Studio project, you can connect to an MS Access datasource and design a front-end to interact with it. The plus-side here is programming requirements are much lower than with straight-up Visual Studio (read: Wizards).

Given that replacing your Access DB will require some font-end programming, you may look into VistaDB. It should allow your front end to be created in .NET with an XCopy database on the backend without requiring a server. One plus is that it retains SQL Server syntax, so if you do decide to move to SQL Server you'll be one step ahead.
(Since you're not a techie and may not understand my previous statement, you might pass my answer on to the consultant/programmer/database guy who is going to do the work for you.)
http://www.vistadb.net/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008