Some database sanitzation questions - mysql

we have a database of around ~250k records which we want to sanitize, and there are some queries which I just don't know how to write:
*clear words containing a substring, for example, if a word contains the substring "cache", delete the entire words, for example:
"cachelkjdlkjalkjs here happened something" => "here happend something"
*delete rows that include more than 2 digits, with exception of couple of cases, for example: the 3 digits 365 are permitted.
so:
"365 days a year, we do that" => Do nothing
"798 is a random number" => DELETE
*check for number of words, and delete records with less than X number of words.
Any help would be appreciated.

First back up the database!
I would first draw up a list of words (along with the numbers 0...99, 365 and any others you think of). I would then create a script (language of yor chosing) to go through the rows. For each row retrieve the words, puncuation, and numbers and then check to ensure that they are valid. For the valid ones reconstruct the entry and spit out the bits that do not match. From the bits that do not match I would just have a look to ensure that you have not missed anything.
I would first do this in a passive mode (i.e. do not change the database) until you a happy that things are ok.
Hope that helps.

Related

Access 2013 Count

I am working on a report in Access 2013 I need to seperate the first 20 records in a column that contain a value and assign a name to them. Such as at 1-20 I need it to insert Lot 1 at 21-40 need to assign Lot 2 etc... The report needs to be separated by lots of 20. I can also just insert a line when it reaches sets of 20 without a name if that makes it easier. Just need something to show a break at sets of 20.
Example: As you can see the report is separated by welder stencil. When the count in the VT column reaches 20 I need to enter a line or some type of divider to separate data. What our client is asking for is we separate the VT in sets of 20. I don't know whats the easiest way to accomplish this. I have researched it but haven't found anything.
Example Report with Divisions
Update the report's RecordSource query by adding "Lot" values for each row. There are multiple ways of doing this, but the easiest will be if your records already have a sequential, continuous numerical key. If they do not have such a key, you can research generating such sequential numbers for your query, but it is beyond the scope of this question and no details about the actual data schema were supplied in the question.
Let's imagine that you have such a key column [Seq]. You use the modulo (mod) and/or integer division operators (\ - backslash) to determine values that are exactly divisible by 20, e.g. ([Seq] - 1) mod 20 == 0.
Generate a lot value for each row. An example SQL snippet: SELECT ("Lot " & (([Seq] - 1) \ 20)) As LotNumber ...
Utilize Access report sorting and grouping features --grouping on the new Lot field-- to print a line and/or label at the start of each group. You can also have the report start a new page at the beginning or end of such a group.
The details about grouping can be found elsewhere in tutorials and Access documentation and are beyond the scope of this question.

Capitalize Just the last Letter in string - MS Access

I have a column in my access database table, I ran a query to make it proper case by using StrConv([MyColumn],3) but last two letters are state names and this query makes SOmeThing, soMethINg, NY to Something, Something, Ny,
I want the result as Something, Something, NY
Is there a another query I can run after to capitalize last letter?
You can use:
UcaseLast: Left([YourColumn], Len([YourColumn]) - 1) & UCase(Right([YourColumn], 1))
Well, most people would tell you to store your 'address', 'city', and 'state' as separate fields. Then you Proper Case each separately and concatenate them together. If you can do that... that is your best approach.
If this is a database or file that's been tossed at you and you can't make the field/table changes... it's still possible to get your desired results. However, you better make sure all strings end with your state code. Also make sure you don't have foreign addresses since Canadian (and other countries) use more that two letters for the province code at the end.
But if you are sure all records contain two letter state abbreviations, you can continue with the following:
MyColumnAdj: StrConv(Mid([MyColumn],1,len([MyColumn])-2),3) + StrConv(right([MyColumn],2),1)
This takes the midstring of your [MyColumn] from position 1 to the length of your [MyColumn] minus 2 (leaving off the state code) and it Proper Case's it all.
It then concatenates (using the plus sign) to a rightstring of [MyColumn] for a length of 2 and Upper Case's it.
Once again, this is dangerous if the field doesn't have the State Code consistently at the end of the string.
Best of luck. Hope this helps. :)

Remove string with wildcard in Notepad++

I'm trying to merge multiple JSON data sets into one large data set, due to a max limit of 100 on the server I'm pulling them from.
The easiest way to do this would be to eliminate the end of one set and the beginning of the next and replace it with "," so that there would be only one open and close to the entire large set. This is what appears between the last entry of one set and the first entry of the next currently:
],"version":"1.0"}{"error":"OK","limit":100,"offset":100,"number_of_page_results":100,
"number_of_total_results":20235,"status_code":1,"results":[
Again, I need that entire string replaced with just a comma, but the problem I'm encountering is that I had to change the offset between each data set to grab the next 100 entries, so the "offset":100, is different in each string ("offset":200, "offset":300, etc.). I can't seem to get wildcards to cooperate. I suspect it has something to do with all the brackets that are already in the string.
Any help would be appreciated. Thank you.
A regular expression that matches the whole input you provided (provided there's no new line characters) is:
\],"version":"1\.0"\}\{"error":"OK","limit":[0-9]+,"offset":[0-9]+,"number_of_page_results":[0-9]+,"number_of_total_results":[0-9]+,"status_code":[0-9]+,"results":\[
It will get any digits in place off all the numbers in your sample (except version).

SSRS - Expression to replace all but last four characters?

Apologies in advance if question is too basic. I searched but couldn't find anything specifically applicable to reporting services.
I'm working on a report that's currently returning the full value that's being queried.
Currently, the expression for that text field is
=Fields!VarName.Value
What I'm looking to do is return only the last four with a set of *s to represent whatever the preceding digits are. Even though the number of digits is going to vary, it's not important that I match digit for digit, so I'm fine with just inserting a set number of *s and then the last four. I figured this would be easier. I've tried this:
="*****" & Right(Fields!VarName.Value,4)
That returns the stars, but not the actual values. Am I just completely off the mark on how to get those last four numbers?
"Right(RTRIM(Fields!VarName.Value),4)" Was the correct answer.

MySQL Behaviour of Wildcard LIKE matching

I have the following data in TableA...
ID | Text
---------------------------------------------
1 | let's find this document
2 | docments are closed
...and if I do the following select...
select Text from TableA where Text like '%doc%';
...I seem to get a strange result. Both rows are returned. With this select, should it not only return row 1? I would have thought that..
select Text from TableA where Text like 'doc%';
...would have returned just row 2. Am I missing something?
What I'm trying to do is run 3 separate searches across this data as part of my searching tool. The first match is to look for the specified pattern "doc" at the beginning of a string, secondly, my next match looks for the same pattern but at the end of a string, and thirdly, identify if the pattern appears anywhere within the text - so can have text surrounding it. Ideally, the first search would only match row 2, the second search would return no results and the third result would only return row 1.The reason for doing it like this is I wanted to try and get a feel for how the pattern matched the string. Would make it easier to read the results to know that the pattern for a given row matched either (a) at the beginning, (b) at the end, (c) anywhere in the middle.Had thought about using regexp, but my data is unicode.
No, the first query returns both rows, because % means 0 or more characters. So if doc is the first thing appearing in the field, it matches the %doc% pattern as well.
But you're right on the second query, it will only return row 2.
doc_% should match it at the beginning, having at least one character after it.
%_doc should match it at the end, having at least one character before it.
%_doc_% should match it anywhere, having at least one character before and after it.
Note that these strict criteria fail to find the exact string "doc", i.e. with nothing before or after it. You may want to include this case in, say, query #1, by loosening it:
doc% should match it at the beginning, having any number of characters after it.