How to creat "grayscale" image from float or int values in ITK? - itk

If I have some:
-------------
float values
-----------------------
0.9 0.6 0.3 0.1 0.0
0.7 0.5 0.1 0.0 0.0
0.3 0.2 0.1 0.0 0.0
or int values
-----------------------
22 15 10 7 0
44 35 20 10 0
12 8 6 4 1
How Can I create a grayscale image from these values in ITK, in both cases?

You might find what you are looking for in this example: http://itk.org/ITKExamples/src/Filtering/ImageIntensity/ConvertRGBImageToGrayscaleImage/Documentation.html
Anyway, you need to convert your data matrix to an itk::Image before you can do that. For that goal take a look at the official guide: http://www.itk.org/ItkSoftwareGuide.pdf section 'Importing Image Data from a Buffer'. Once you are done with this just take a look at the output image, that might be exactly what you are looking for (there might be no need to apply the luminance filter in the first example)

Related

How to diagnose cause of slow JDBC write operations while running PySpark job on AWS Glue

I've been stuck on this one for a couple of days so any help is greatly appreciated. I have a Spark ETL job running on AWS Glue and am having a hard time optimizing write performance.
Summary of the job:
A large dataset is being read from S3 (~5 million records)
A smaller dataset is being read from an RDS MySQL instance (hundreds of records)
The task at hand calls for a pairwise comparison of each record from both datasets. I went with a cross join (cartesian product) which itself seems to work fine. I know that this is an expensive operation but I haven't been able to figure out how to avoid expanding the dataset in that manor in order to do this pairwise comparison. It's just the nature of the problem. The good news is that all of the core transformation work is done on a row by row basis. There are no groupbys or aggregate functions required. There is 1 numerical filter that is executed after the cross join and subsequent computations. If I use .show() as an action to make all of these operations run, it takes around 40 minutes. Not bad.
The problem arises when I go to write this data to MySQL using the built-in Spark JDBC connector as such:
` joined_df.write.format("jdbc") \
.option("url", url_write) \
.option("batchsize", 100) \
.option("dbtable","{0}.{1}".format(database, table)) \
.option("user", user).option("password", password).mode("append").save()`
After 4 hours, not a single record is written.
When I run the job over a smaller sample of data, the writes go slowly but they do occur so database connectivity isn't the issue. I have also tried modifying the connection string to set useServerPrepStmts=false and rewriteBatchedStatements=true and using a small batch size of 100.
I am using 10 DPUs with the G1X worker type. This provides 36 cores in total. I used the 3X rule of thumb to pick 108 as the number of partitions which I apply to the larger dataset before converting the DynamicFrame into a Dataframe as such:
df1 = glue_context.create_dynamic_frame.from_catalog(
database = glue_database,
table_name = "test_tbl1",
push_down_predicate = "(upload_date >= CAST('{0}' AS TIMESTAMP))".format(lookback_date)
).repartition(108).toDF()`
I know that 10 DPUs is relatively light on compute but I have a feeling there is something else is going on here since no records are being written after hours. I have also tried using various number of partitions. I set up Spark UI to get some additional metrics. I am including the metrics for my latest attempt that timed out after 4 hours with zero records written. I would have posed a screenshot if my ranking allowed it (Sorry that this is so ugly). My question is, what should I be focused on here that may help get me to the answer of why the writes are hanging? Any wisdom would be greatly appreciated.
Executor ID Address Status RDD Blocks Storage Memory Disk Used Cores Active Tasks Failed Tasks Complete Tasks Total Tasks Task Time (GC Time) Input Shuffle Read Shuffle Write
driver 172.31.33.169:37095 Active 0 0.0 B / 5.8 GiB 0.0 B 0 0 0 0 0 0.0 ms (0.0 ms) 0.0 B 0.0 B 0.0 B
1 172.31.45.233:45821 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 4 7 1.5 min (4 s) 91.1 MiB 0.0 B 306.3 MiB
2 172.31.47.118:44467 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 5 9 1.2 min (3 s) 82.1 MiB 0.0 B 281.3 MiB
3 172.31.42.73:45151 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 4 7 1.4 min (4 s) 89.7 MiB 0.0 B 303.4 MiB
4 172.31.43.123:34465 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 4 8 1.3 min (5 s) 79.5 MiB 0.0 B 274.1 MiB
5 172.31.40.44:46117 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 4 7 1.5 min (5 s) 89.1 MiB 0.0 B 304.6 MiB
6 172.31.36.20:42763 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 4 8 1.1 min (3 s) 21.9 MiB 0.0 B 74.3 MiB
7 172.31.32.240:45533 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 7 11 1.3 min (4 s) 91.6 MiB 59 B 312.4 MiB
8 172.31.36.49:34471 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 7 10 1.5 min (5 s) 101.6 MiB 0.0 B 347.4 MiB
9 172.31.43.177:40385 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 5 8 1.4 min (5 s) 85.7 MiB 0.0 B 294.1 MiB

tesseract fails to form shapetable

i am attempting to extract OCR data of a 3-digit counter within a video via tesseract 4.1.1 on Kubuntu 21.04. (full tesseract version string below.) i am failing to add characters during the shapetable phase, and no other troubleshooting has worked for me -- i turn to you with humble heart. n.b.: the images are of a small pixel font, which takes up the entirety of my source image
image preparation and collation
from the source videos, i: crop to only the counter, invert, grayscale, dump at 1 fps, and then increase resolution by 1000% to 780x180 resolution. the results are individual frames such as this. i take a section of sequential numbers counting down from 500 (without any duplicates or blank images) and combine them into a .tif. (i can't upload the file here, but find the set of images mosaic'd together here)
i import this file into jTessBoxEditor as, for example, type_3.font.exp0.tif. i run tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox to create a .box file, with understandably nonsensical results.
with the hand-chosen source frames and the consistent positions, i'm able to edit the .box file with known box sizes, quantities, like so:
5 0 0 240 180 0
0 270 0 510 180 0
0 540 0 780 180 0
4 0 0 240 180 1
9 270 0 510 180 1
9 540 0 780 180 1
4 0 0 240 180 2
9 270 0 510 180 2
8 540 0 780 180 2
4 0 0 240 180 3
9 270 0 510 180 3
7 540 0 780 180 3
...
i load the edited .box into the jTessBoxEditor to check that it indeed matches my data. this is a 131-page .tif, meaning roughly 40 trains per digit.
training steps (where the problems begin)
i create font_properties and load it with font 0 0 0 0 0. Please note that i've also tried type_3 0 0 0 0 0 and type_3.font.exp0 0 0 0 0 0, with no difference on the below results
i input tesseract type_3.font.exp0.tif type_3.font.exp0 nobatch box.train and a training file is created; however, each page is listed as blank (is this normal?). e.g.:
Page 108
Warning: Invalid resolution 1 dpi. Using 70 instead.
Estimating resolution as 2263
Empty page!!
i input unicharset_extractor font_name.font.exp0.box with success -- the resulting extraction contains the characters i've identified, with some extra lines
13
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
5 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 5 # 5 [35 ]0
0 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 0 # 0 [30 ]0
4 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 4 # 4 [34 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 9 # 9 [39 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 8 # 8 [38 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 8 2 8 7 # 7 [37 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 9 2 9 6 # 6 [36 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 10 2 10 3 # 3 [33 ]0
2 8 0,255,0,255,0,0,0,0,0,0 Common 11 2 11 2 # 2 [32 ]0
1 8 0,255,0,255,0,0,0,0,0,0 Common 12 2 12 1 # 1 [31 ]0
but i know that failure has come for me when shapeclustering -F font_properties -U unicharset -O type_3.unicharset type_3.font.exp0.tr
results in
Reading type_3.font.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
...
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Master shape_table:Number of shapes = 0 max unichars = 0 number with multiple unichars = 0
It has not recognized any shapes at all.
my plea:
what have i missed?? what can i do to pass these 10 humble characters to tesseract?
full version string (installed via apt)
tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.5

Encode data into 1 byte

I have to encode data to 1 byte. I have the following data as of now.
size - 500 ml and 1 litre
Frequency - 0 to 12
% - 0-100
So i decided to break the data into the following -
0 0 0 0 0 0 0 0
1st bit - Size - 0 for 500ml and 1 for 1 litre
2-5 bits - Frequency - 0 to 12 (0000 for 0 and 1100 for 12)
I am not sure how to get the % in this setting. Am i looking into solving this in a wrong way? Is there any other way to do it. Any direction is highly appreciated.
You are left with 3 bits. you need to store a value between 0-100, which atleast needs 7 bits. (2^7 = 128). However, if you only need 8 different percentage values, you can get away with using 3 bits

GNUPLOT: Joining different series of points with vectors

I have a file with data in 2 columns X and Y. There are some blocks and they are separated by a blank line. I want to join the points (given by their coordenates x and y in the file) in each block using vectors. I'm trying to use these functions:
prev_x = NaN
prev_y = NaN
dx(x) = (x_delta = x-prev_x, prev_x = ($0 > 0 ? x : 1/0), x_delta)
dy(y) = (y_delta = y-prev_y, prev_y = ($0 > 0 ? y : 1/0), y_delta)
which I've taken from Plot lines and vector in graphical gnuplot (first answer). The command to plot would be plot for[i=0:5] 'Field_lines.txt' every :::i::i u (prev_x):(prev_y):(dx($1)):(dy($2)) with vectors. The output is
and the problem is that the point (0,0) is being included even though it's not in the file. I don't think I understand what the functions dx and dy do exactly and how they are being used with the option using (prev_x):(prev_y):(dx($1)):(dy($2)) so an explanation of this would help me a lot to try to fix this.
This is the file:
#1
0 5
0 4
0 3
0.4 2
0.8 1
0.8 1
#2
2 5
2 4
2 3
2 2
2 1
2 0
#3
4 5
4.2 4
4.5 3
4.6 2
4.7 1
4.7 0
#4
7 5
7.2 4
7.5 3
7.9 2
7.9 1
7.9 0
#5
9 5
9 4
9.2 3
9.5 2
9.5 1
9.5 0
#6
11 7
12 6
13 5
13.3 4
13.5 3
13.5 2
13.6 1
14 0
Thanks!
I'm not completely sure, what the real problem is, but I think you cannot rely on the columns in the using statement to be evaluated from left to right, and your check $0 > 0 in the dx and dy some too late in my opinion.
I usually put all the assignments and conditionals in the first column, and that works fine also in your case:
set offsets 1,1,1,1
unset key
prev_x = prev_y = 1
plot for [i=0:5] 'Field_lines.txt' every :::i::i \
u (x_delta = prev_x-$1, prev_x=$1, y_delta=prev_y-$2, prev_y=$2, ($0 == 0 ? 1/0 : prev_x)):(prev_y):(x_delta):(y_delta) with vectors backhead
Also, to draw a vector from j-th row to the point in the following row you must invert the definition of x_delta and use backhead to draw the vectors in the correct direction

I am looking for a way to round the decimal portion of numbers up or down to the nearest .25, .5, .75, or whole number

If the decimal part is 0.1 to 0.12, it rounds down to the next lower integer
If the decimal part is 0.13 to 0.37 it rounds to 0.25
If the decimal part is 0.38 to 0.62 it rounds to 0.5
If the decimal part is 0.63 to 0.87 it rounds to 0.75
If the decimal part is 0.88 or more, it rounds up to the next higher integer
Multiply by 4, round to the nearest integer, divide by 4?
There is a general method for this:
Multiply your number by 4.
Round to the nearest integer.
Divide by 4.
In SQL:
ROUND(column * 4) / 4
I don't know the exact function name, but basically you use floor(4*x)/4. floor might be called int, to_int, or something like that.