Does usage of NULL to fill empty cells on my tables make searches / queries faster?
For example is this
90 2 65 2011-04-08 NULL NULL
134 2 64 2011-04-13 NULL 07:00:00
135 2 64 2011-04-13 NULL 07:00:00
136 2 64 2011-04-13 NULL 22:45:00
137 2 64 2011-04-14 NULL 19:30:00
better than
90 2 65 2011-04-08
134 2 64 2011-04-13 07:00:00
135 2 64 2011-04-13 07:00:00
136 2 64 2011-04-13 22:45:00
137 2 64 2011-04-14 19:30:00
If anyone could tell me any specific advantage to using NULL (performance, good practice, etc) that would be much appreciated.
There is a semantic difference.
NULL means "value unknown or not applicable".
If this describes the data for that column, then use it.
The empty string means "value known, and it's 'nothingness'".
If this describes the data for that column, then use it. Of course, this applies only to strings; it's unusual to have such a scenario for other data types, but usually for numeric fields a value of 0 would be appropriate here.
In short, it depends mostly on what your fields mean. Worry about performance and profiling later, when you know what data you're representing. (NULL vs "" is never going to be your bottleneck, though.)
Related
I could really use some help with the problem I'm facing. I'm trying to dynamically add columns to a query based on the number of times a user_id occurs in this example event_table.
So, this is the example event_table. Every time the user_id comes up, I want another column with CONCAT("event_", times_occured) added to the result.
id
user_id
create_date
1
344
2021-05-25
2
25
2021-05-25
3
344
2021-07-06
4
344
2021-07-07
5
3245
2021-08-25
6
52
2021-09-14
7
52
2021-10-11
The query result should be formed this way.
user_id
event_1
event_2
event_3
25
2021-05-25
null
null
52
2021-09-14
2021-10-11
null
344
2021-05-25
2021-07-06
2021-07-07
3245
2021-08-25
null
null
I'm not sure if this is possible, and if it is, do I need to use recursion or loops?
I want to perform OCR on images like this one:
It is a table with numerical data with colons as decimal separators.
It is not noisy, contrast is good, black text on white background.
As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract.
I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:
Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.
There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.
Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".
Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".
Any suggestions what else might be helpful in improving the output quality of tesseract here?
My tesseract version:
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE
Solution
Divide the image into the 5-different row
Apply division-normalization to each row
Set psm to 6 (Assume a single uniform block of text.)
Read
From the original image, we can see there are 5 different rows.
Each iteration, we will take a row, apply normalization and read.
We need to understand how to set image indexes carefully.
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
start_index = 0
end_index = int(h/5)
Question Why do we declare start and end indexes?
We want to read a single row in each iteration. Lets see in an example:
The current image height and width are 645 and 1597 pixels.
Divide the images based on indexes:
start-index
end-index
0
129
129
258 (129 + 129)
258
387 (258 + 129)
387
516 (387 + 129)
Lets see whether the images are readable?
start-index
end-index
image
0
129
129
258
258
387
387
516
Nope, they are not suitable for reading, maybe a little adjustment might help us. Like:
start-index
end-index
image
0
129 - 20
109
218
218
327
327
436
436
545
545
654
Now they are suitable for reading.
When we apply the division-normalization to each row:
start-index
end-index
image
0
109
109
218
218
327
327
436
436
545
545
654
Now when we read:
1,7 | 57 | 71 | 59 | .70 | 65
| 57 | 1,5 | 71 | 59 | 70 | 65
| 71 | 59 | 1,3 | 57 | 70 | 60
| 71 | 59 | 56 | 1,3 | 70 | 60
| 72 | 66 | 71 | 59 | 1,2 | 56
| 72 | 66 | 71 | 59 | 56 | 4,3
Code:
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
# print(img.shape[:2])
start_index = 0
end_index = int(h/5) - 20
for i in range(0, 6):
# print("{}->{}".format(start_index, end_index))
gry_crp = gry[start_index:end_index, 0:w]
blr = cv2.GaussianBlur(gry_crp, (145, 145), 0)
div = cv2.divide(gry_crp, blr, scale=192)
txt = image_to_string(div, config="--psm 6")
print(txt)
start_index = end_index
end_index = start_index + int(h/5) - 20
This is been a headache challenge for me in my project, being a php and or say sql novice.
I must first of all indicate that i'm running this code in php with binded params, since I am using pdo.
Now, as the title indicates, how can I rank students in a class, after having calculated their raw scores from Aggregate Sum Derived Table (From Clause), without #variables in sql. Something like, let assume my data pulled from the database is:
studref Name English Maths Gov
Bd1 Ida 66 78 49
Bd2 Iyan 58 80 69
Bd3 Ivan 44 56 80
Bd4 Iven 63 92 68
Bd5 Ike 69 77 59
So using Aggregate Sum in derived table, I then add another column, which is summation of the marks from the various subjects like:
SUM(THE SCORES) AS accum_raw_scores
studref Name English Maths Gov accum_raw_scores
Bd1 Ida 66 78 49 193
Bd2 Iyan 58 80 69 207
Bd3 Ivan 44 56 80 180
Bd4 Iven 63 92 68 223
Bd5 Ike 69 77 59 205
So, what I want to do next is to add another column, which will represent the rank of each student, based on his/her total score from the subjects. Hence, this is where I want the code below to handle that for me:
1+(SELECT COUNT(*)
FROM (SELECT s.studref, SUM(s.subjectscore) AS total_score
FROM studentsreports s
GROUP BY s.studref) AS sub
WHERE sub.total_score > main.accum_raw_scores) AS overall_position,
So that in the end, I will have something like:
studref Name English Maths Gov accum_raw_scores rank
Bd1 Ida 66 78 49 193 4
Bd2 Iyan 58 80 69 207 2
Bd3 Ivan 44 56 80 180 5
Bd4 Iven 63 92 68 223 1
Bd5 Ike 69 77 59 205 3
Unfortunately, I have tried several approaches but to no success. Please help!
Meanwhile, I want to try as much as possible to do without variables.
I tried to open the csv file (http://archive.ics.uci.edu/ml/machine-learning-databases/00222/) with the pandas module, but the read_csv command did not open the file properly.
import pandas
bankfull = pandas.read_csv('bank-full.csv')
print bankfull.head()
and the result looks like
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y
0 58;"management";"married";"tertiary";"no";2143...
1 44;"technician";"single";"secondary";"no";29;"...
how can I fix the code so the csv file import as proper pandas Dataframe?
Thank you!
You need to set the separator arg sep=';', the default is comma , you can check the docs for read_csv:
pd.read_csv('bank-full.csv', sep=';')
Out[27]:
age job marital education default balance housing loan \
0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no
5 35 management married tertiary no 231 yes no
6 28 management single tertiary no 447 yes yes
7 42 entrepreneur divorced tertiary yes 2 yes no
8 58 retired married primary no 121 yes no
9 43 technician single secondary no 593 yes no
10 41 admin. divorced secondary no 270 yes no
11 29 admin. single secondary no 390 yes no
I have a MySQL table like this
ownerlisting_access_id property_id mainaccess_id subaccess_id access_value
62 2 35 41 Yes
64 2 35 36 Yes
123 4 35 41 Yes
125 4 35 36 Yes
306 7 35 41 Yes
307 7 35 42 Yes
308 7 35 36 Yes
I need to get the property_id which is serving the subaccess_id with 41 & 42 & 36.
I need to get the property_id as 7.
This should work:
SELECT property_id FROM t
WHERE subaccess_id IN (41, 42, 36)
GROUP BY property_id
HAVING COUNT(DISTINCT subaccess_id) = 3
Fiddle here.
Bear in mind that you should match the amount of elements in the IN clause with the number in the HAVING clause. Also note that if you can not have the same subaccess_id more than once for a given property_id then you can remove the DISTINCT keyword.