Print sum of rows and other row value for each column in awk - csv

I have a csv file structured as the one below:
| Taiwan | | US |
| ASUS | MSI | DELL | HP
------------------------------------------
CPU | 50 | 49 | 43 | 65
GPU | 60 | 64 | 75 | 54
HDD | 75 | 70 | 65 | 46
RAM | 60 | 79 | 64 | 63
assembled| 235 | 244 | 254 | 269
and I have to use an awk script to print a comparison between the sum of prices of the individual computer pieces (rows 3 to 6) "versus" the assembled computer price (row 7) displaying also the country each brand comes from. The printed result in the terminal should be something like:
Taiwan Asus 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269
Where the third column is the sum of CPU, GPU, HDD and RAM prices and the fourth column is the price same value seen in row 7 per each computer brand.
So far I have been able to sum the individual columns transforming the solution provided at the post I link below, but I don´t know how I could display the result I want in the desired format. Could anyone help me with this? I´m a bit desperate at this point.
Sum all values in each column bash
This is the content of the original csv file represented at the top of this message:
,Taiwan,,US,
,ASUS,MSI,DELL,HP
CPU,50,49,43,65
GPU,60,64,75,54
HDD,75,70,65,46
RAM,60,79,64,63
assembled,235,244,254,269
Thank you very much in advance.

$ cat tst.awk
BEGIN { FS=","; OFS="\t" }
NR == 2 {
for (i=2; i<=NF; i++) {
corp[i] = (p[i] == "" ? p[i-1] : p[i]) OFS $i
}
}
NR > 2 {
for (i=2; i<=NF; i++) {
tot[i] += p[i]
}
}
{ split($0,p) }
END {
for (i=2; i<=NF; i++) {
print corp[i], tot[i], p[i]
}
}
.
$ awk -f tst.awk file
Taiwan ASUS 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269

Related

tesseract fails at simple number detection

I want to perform OCR on images like this one:
It is a table with numerical data with colons as decimal separators.
It is not noisy, contrast is good, black text on white background.
As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract.
I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:
Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.
There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.
Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".
Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".
Any suggestions what else might be helpful in improving the output quality of tesseract here?
My tesseract version:
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE
Solution
Divide the image into the 5-different row
Apply division-normalization to each row
Set psm to 6 (Assume a single uniform block of text.)
Read
From the original image, we can see there are 5 different rows.
Each iteration, we will take a row, apply normalization and read.
We need to understand how to set image indexes carefully.
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
start_index = 0
end_index = int(h/5)
Question Why do we declare start and end indexes?
We want to read a single row in each iteration. Lets see in an example:
The current image height and width are 645 and 1597 pixels.
Divide the images based on indexes:
start-index
end-index
0
129
129
258 (129 + 129)
258
387 (258 + 129)
387
516 (387 + 129)
Lets see whether the images are readable?
start-index
end-index
image
0
129
129
258
258
387
387
516
Nope, they are not suitable for reading, maybe a little adjustment might help us. Like:
start-index
end-index
image
0
129 - 20
109
218
218
327
327
436
436
545
545
654
Now they are suitable for reading.
When we apply the division-normalization to each row:
start-index
end-index
image
0
109
109
218
218
327
327
436
436
545
545
654
Now when we read:
1,7 | 57 | 71 | 59 | .70 | 65
| 57 | 1,5 | 71 | 59 | 70 | 65
| 71 | 59 | 1,3 | 57 | 70 | 60
| 71 | 59 | 56 | 1,3 | 70 | 60
| 72 | 66 | 71 | 59 | 1,2 | 56
| 72 | 66 | 71 | 59 | 56 | 4,3
Code:
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
# print(img.shape[:2])
start_index = 0
end_index = int(h/5) - 20
for i in range(0, 6):
# print("{}->{}".format(start_index, end_index))
gry_crp = gry[start_index:end_index, 0:w]
blr = cv2.GaussianBlur(gry_crp, (145, 145), 0)
div = cv2.divide(gry_crp, blr, scale=192)
txt = image_to_string(div, config="--psm 6")
print(txt)
start_index = end_index
end_index = start_index + int(h/5) - 20

How to query out hours that has more than 72 hours in derived table?

I am trying to query out only the root_cause that is more than 72 hours and when it find 72 hours or more it will add up.. For example
I have root cause A = 78 hours and Root cause B = 100 hours since these two is more than 72 it should add up 178 hours as "MNPT". Anything that is less than 72 add up and make up routine NPT
I am using derived table query out but the outcome still display the hours including those that less than 72
Select operation_uid, sum (npt_duration) as mnpt from fact_npt_root_cause where npt_duration>72 group by root_cause_code having sum (npt_duration)>72
See this table
|ROOT CAUSE CODE | NPT Duration |
| | |
|A | 23 |
|B | 78 |
|C | 45 |
|D | 100 |
|E | 90 |
When the root cause value is more than 72 hours => then add up those value for example
root cause code B, D, E = 78 + 100 + 90 = 268 as MNPT
When the root cause value is less than 72 hours => then add up the value as 23 + 45 = 68 as routine NPT
I'm not sure of what you want to do but querying operation_uid and grouping by root_cause_code assumes that you always has the same operation_uid for a given root_cause_code... Don't you rather mean :
SELECT operation_uid,
sum (npt_duration) as mnpt
FROM fact_npt_root_cause
WHERE npt_duration>72
GROUP by operation_uid, root_cause_code
HAVING SUM (npt_duration)>72;

MySQL matching row values sets

I am relatively new with mysql and php. I have developed a hockey stat db. Until now, I have been doing pretty basic queries and reporting of the stats.
I want to do a little more advanced query now.
I have a table that records which players were on the ice (shows as a "fk_pp1_id" - "fk_pp5_id") when a goal is scored. here is the table:
pt_id | fk_gf_id | fk_pp1_id | fk_pp2_id | fk_pp3_id | fk_pp4_id | fk_pp5_id
1 | 1 | 19 | 20 | 68 | 90 | 97
2 | 2 | 1 | 19 | 20 | 56 | 91
3 | 3 | 1 | 56 | 88 | 91 | 93
4 | 4 | 1 | 19 | 64 | 88 | NULL
5 | 5 | 19 | 62 | 68 | 88 | 97
6 | 6 | 55 | 19 | 20 | 45 | 62
7 | 7 | 1 | 19 | 20 | 56 | 61
8 | 8 | 65 | 68 | 90 | 93 | 97
9 | 9 | 19 | 20 | 45 | 55 | 62
10 | 10 | 1 | 19 | 20 | 56 | 61
11 | 11 | 1 | 19 | 20 | 56 | 61
12 | 12 | 19 | 20 | 68 | 90 | 97
13 | 13 | 19 | 20 | 68 | 90 | 97
14 | 14 | 19 | 20 | 55 | 62 | 91
15 | 15 | 1 | 56 | 61 | 64 | 88
16 | 16 | 1 | 56 | 61 | 64 | 88
17 | 17 | 1 | 19 | 20 | 56 | 61
18 | 18 | 1 | 19 | 20 | 56 | 61
19 | 19 | 1 | 65 | 68 | 93 | 97
I want to do several queries:
Show which of the five players were together on the ice most often
when a goal was scored.
Select say 2 players and show which other players were on the ice most often with them when a goal was scored.
I was able to write a query which partially accomplishes query #1 above.
SELECT
fk_pp1_id,
fk_pp2_id,
fk_pp3_id,
fk_pp4_id,
fk_pp5_id,
count(*)
FROM TABLE1
group by
fk_pp1_id,
fk_pp2_id,
fk_pp3_id,
fk_pp4_id,
fk_pp5_id
Here are the results:
fk_pp1_id fk_pp2_id fk_pp3_id fk_pp4_id fk_pp5_id count(*)
1 19 20 56 61 4
1 19 20 56 91 1
1 19 64 88 (null) 1
1 56 61 64 88 2
1 56 88 91 93 1
1 65 68 93 97 1
19 1 20 56 61 1
19 20 45 55 62 1
19 20 55 62 91 1
19 20 68 90 97 3
19 62 68 88 97 1
55 19 20 45 62 1
65 68 90 93 97 1 4
See this sqlfiddle:
http://sqlfiddle.com/#!9/e3f5f/1
This seems to work at first, but I realized this query, as written, is sensitive to the order in which the players are listed. That is to say a row with:
1, 19, 20, 68, 90
will not match
19, 1, 20, 68, 90
So to fix this problem, I feel like I have a couple options:
Ensure the data is input into the table in numerical order
Re-write the query so the order of the data in the table doesn't matter
Make the resulting query a sub-query to another query that first
orders the column (left to right) in numerical order.
Change the schema to record/store the data in a better way
1, I can do, but would prefer to have the query be fool-proof.
2 or 3 I prefer, but don't know how to do either.
4, I don't know how to do and is least desirable as I already have some complex queries against this table that would need to be totally re-written.
Am i going about this in the wrong way or is there a solution??
Thanks for your help
UPDATE -
OK I (hopefully) better normalized the data in the table. Thanks #strawberry. Now my table has a column for the goal_id (foreign key) and a column for the player_id (another foreign key) that was on the ice at the time the goal was scored.
Here is the new fiddle:
http://sqlfiddle.com/#!9/39e5a
I can easily get the one player who was on the ice most when goals are scored, but I can't get my mind around how to find the occurrences of a group of players who were on the ice together. For example, how many times were a group of 5 players on the ice together. Then from there, how often a group of 2 players were on the ice together with the 3 other players.
Any other clues???
I find a similar problem here and based on that i come up with this solution.
For the first part of your problem to select how many time same five player were on the ice when the goal is scored your query could look like this:
SELECT GROUP_CONCAT(t1.fk_gf_id) AS MinOfGoal,
t1.players AS playersNumber,
COUNT(t1.fk_gf_id) AS numOfTimes
FROM (SELECT fk_gf_id, GROUP_CONCAT(fk_plyr_id ORDER BY fk_plyr_id) AS players
FROM Table1
GROUP BY fk_gf_id) AS t1
GROUP BY t1.players
ORDER BY numOfTimes DESC;
And for your second part of the question where you want to select two players and find three more player which were on the ice when goal were scored you should extend previous query whit WERE clause like this
SELECT GROUP_CONCAT(t1.fk_gf_id) AS MinOfGoal,
t1.players AS playersNumber,
COUNT(t1.fk_gf_id) AS numOfTimes
FROM (SELECT fk_gf_id, GROUP_CONCAT(fk_plyr_id ORDER BY fk_plyr_id) AS players
FROM Table1
WHERE fk_gf_id IN (SELECT fk_gf_id
FROM Table1
WHERE fk_plyr_id = 19)
AND fk_gf_id IN (SELECT fk_gf_id
FROM Table1
WHERE fk_plyr_id = 56)
GROUP BY fk_gf_id) AS t1
GROUP BY t1.players
ORDER BY numOfTimes DESC;
You can see how it's work here in SQL Fiddle...
Note: I added some data in Table1 (don't be confused with more date counted).
GL!

I would like to find the difference of each row, from the first row As Alias

To clarify my Title
I would like to tabulate how far behind the leader, each successive finisher is from 1st place as shown in my table below.
Finish | Points | Points Behind
1 | 102 |
2 | 92 | 10
3 | 82 | 20
4 | 71 | 31
5 | 61 | 41
6 | 50 | 52
7 | 40 | 62
8 | 30 | 72
9 | 20 | 82
10 | 10 | 92
Select
snpc_stats.gamedetail.Finish,
snpc_stats.gamedetail.Points,
some code I don't know As 'Points Behind'
From
snpc_stats.gamedetail
Where
snpc_stats.gamedetail.GamesID = 113
You can get the points from first finish and do a cross join with rest of the table.
SQL Fiddle
select gd.Finish, gd.Points,
t.Points-gd.Points as PointsBehind
from gamedetail gd
cross join ( select max(Points) from gamedetail where Finish =1) t

LINQ for returning first in repeating sequences

I have a Measurements table as follows:
SourceId : int
TimeStamp: date/time
Measurement: int
Sample data looks like this (more on the asterisks below):
SID| TimeStamp | Measurement
10 | 02-01-2011 12:00:00 | 30 *
10 | 02-01-2011 12:10:00 | 30
10 | 02-01-2011 12:17:00 | 32 *
10 | 02-01-2011 12:29:00 | 30 *
10 | 02-01-2011 12:34:00 | 30
10 | 02-01-2011 12:39:00 | 35 *
10 | 02-01-2011 12:46:00 | 36 *
10 | 02-01-2011 12:39:00 | 36
10 | 02-01-2011 12:54:00 | 36
11 | 02-01-2011 12:00:00 | 36 *
11 | 02-01-2011 12:10:00 | 36
11 | 02-01-2011 12:17:00 | 37 *
11 | 02-01-2011 12:29:00 | 38 *
11 | 02-01-2011 12:34:00 | 38
11 | 02-01-2011 12:39:00 | 37 *
11 | 02-01-2011 12:46:00 | 36 *
11 | 02-01-2011 12:39:00 | 36
11 | 02-01-2011 12:54:00 | 36
I need a LINQ query that will return only the rows when the Measurement value is different from the prior row having the same SourceId (i.e. each row marked with an asterisk). The table should be sorted by SourceId, then TimeStamp.
The data from the query will be used to plot a graph where each SourceId is a series. The source table has several million rows and the repeating measurements are in the thousands. Since these repeating measurement values don't make any difference to the resulting graph I'd like to eliminate them before passing the data to my graph control for rendering.
I have tried using Distinct() in various ways, and reviewed the Aggregate queries here http://msdn.microsoft.com/en-us/vcsharp/aa336746 but don't see an obvious solution.
Sometimes a plain old foreach loop will suffice.
var finalList = new List<MyRowObject>();
MyRowObject prevRow = null;
foreach (var row in myCollection)
{
if (prevRow == null || (row.SID != prevRow.SID || row.Measurement != prevRow.Measurement))
{
finalList.Add(row);
}
prevRow = row;
}
Personally, I like the DistinctUntilChanged extension method that is included in the Rx Extensions library. It's very handy. As is the rest of the library, by the way.
But I do understand, you might not want to add a whole new dependency just for this. In this case, I propose Zip:
sequence.Take(1).Concat(
sequence.Zip( sequence.Skip(1), (prev,next) => new { item = next, sameAsPrevious = prev == next } )
.Where( (x,index) => !x.sameAsPrevious )
.Select( x => x.item )
)
There's no way to do this in a single query in sql. Ergo there's no way to do this in a single query in linq to sql.
The problem is you need to compare each row to the "next" row. That's just not something that sql does well at all.
Look at the first five rows:
10 | 02-01-2011 12:00:00 | 30 *
10 | 02-01-2011 12:10:00 | 30
10 | 02-01-2011 12:17:00 | 32 *
10 | 02-01-2011 12:29:00 | 30 *
10 | 02-01-2011 12:34:00 | 30
You want to keep 2 records with 30 and remove 2 records with 30. That rules out grouping.