How can I search pipeline with another pipeline value on google cloud dataflow - google-compute-engine

I would like to search text which includes specified word from stream data with google cloud dataflow.
In detail, I will deal with following two stream.
stream A: element of stream is "word"
stream B: element of stream is "text". and each text consists of "word". This text may have "word" on stream A
Many "text" flow into stream B frequently. On the other hand, "word" flow into stream A occasionally.
When "word" flow into stream A, I would like to search "text" which has "word" and flow into stream B after 5 minutes ago.
Example
time stream A : stream B
00:01 - this is an apple
00:02 - this is an orange
00:03 - I have an apple
00:04 apple <= "this is an apple" and "I have an apple" are found
00:05 this <= "this is an apple" and "this is an orange" are found
Can I search text with google cloud dataflow?

If I understand your question correctly, there are multiple ways to achieve something like what you want. I will describe two variations.
The basic idea in my example code is to use an inner join and SlidingWindows of five minutes. You can implement the join using ParDo side inputs or CoGroupByKey, depending on your data sizes.
Here is how you set up your inputs and windowing:
PCollection<String> streamA = ...;
PCollection<String> streamB = ...;
PCollection<String> windowedStreamA = streamA.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
PCollection<String> windowedStreamB = streamB.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
You may want to adjust the size of windows or period to meet your specification & performance needs.
Here is a sketch of how to do the join with side inputs. This will iterate over the entire five minute window of streamB for each element of streamA, so performance will suffer if windows get large.
PCollectionView<Iterable<String>> streamBview = streamB.apply(View.asIterable());
PCollection<String> matches = windowedStreamA.apply(
ParDo.of(new DoFn<String, String>() {
#Override void processElement(ProcessContext context) {
for (String text : context.sideInput()) {
if (split(text).contains(context.element())) {
context.output(text);
}
}
}
});
Here is a sketch of how to do this with CoGroupByKey by pre-splitting the text and joining each keyword with the lines that contain that keyword. There is similar logic in the TfIdf example included with the SDK.
PCollection<KV<String, Void>> keyedStreamA = windowedStreamA.apply(
MapElements
.via(word -> KV.of(word, null))
.withOutputType(new TypeDescriptor<KV<String, Void>>() {}));
PCollection<KV<String, String>> keyedStreamB = windowedStreamB.apply(
FlatMapElements
.via(text -> split(text).forEach(word --> KV.of(word, text))
.withOutputType(new TypeDescriptor<KV<String, String>>() {}));
TupleTag<Void> tagA = new TupleTag<Void>() {};
TupleTag<String> tagB = new TupleTag<String>() {};
KeyedPCollectionTuple coGbkInput = KeyedPCollectionTuple
.of(tagA, keyedStreamA)
.and(tagB, keyedStreamB);
PCollection<String> matches = coGbkInput
.apply(CoGroupByKey.create())
.apply(FlatMapElements
.via(result -> result.getAll(tagB))
.withOutputType(new TypeDescriptor<String>()));
The best approach will depend on your data. If you are OK with getting more matches than just the last five minutes you can tune the amount of data duplication in windows by enlarging your sliding windows and having a larger period. You can also use triggers to tune when output is produced.

Related

How to calculate SAVI with MODIS in Google Earth Engine (Getting Error: Image.select: Pattern 'B2' did not match any bands.)

I am trying to calculate SAVI vegetation index using MODIS data. But I am getting an error showing:
Image.select: Pattern 'B2' did not match any bands.
Code:
countries = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017")
canada = countries.filter(ee.Filter.eq("country_na", "Canada"))
image = ee.ImageCollection("MODIS/061/MOD09A1")\
.filterDate('2017-01-01','2017-12-31')\
.filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE',10))\
.filterBounds(canada)\
.median()\
.clip(canada)
savi = image.expression(
'1.5*((NIR-RED)/(NIR+RED+0.5))',{
'NIR':image.select('B2'),
'RED':image.select('B1')
}).rename('savi')
saviVis = {'min':0.0, 'max':1, 'palette':['yellow', 'green']}
Map = geemap.Map()
Map.addLayer(savi, saviVis, 'SAVI')
Map
Why am I getting this error? Isn't B1 designated to Red and B2 to NIR?
The general thing to do when you hit this type of problem is to start examining the dataset for what is actually there — how many images are you matching, what properties and bands those images have, etc. I found two problems:
Your filter criteria matched zero images. Therefore the collection is empty, and therefore the median() image from that collection has no bands at all. (You can check this by putting the collection in a variable and printing the size() of it.) You will need to adjust the criteria.
It seems that the main reason they didn't match is that the images in MODIS/061/MOD09A1 do not have a CLOUDY_PIXEL_PERCENTAGE property.
The band names for MODIS/061/MOD09A1 are not B1, B2, ... but sur_refl_b01, sur_refl_b02 and so on. You can see this with the Inspector in the Earth Engine Code Editor, or on the dataset description page.
Perhaps you were working from information about a different dataset?
With the two problems above fixed, your code produces some results. This is the (JS) version I produced while testing (Code Editor link):
var countries = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017");
var canada = countries.filter(ee.Filter.eq("country_na", "Canada"));
var images = ee.ImageCollection("MODIS/061/MOD09A1")
.filterDate('2017-01-01','2017-12-31')
// .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE',10))
.filterBounds(canada);
// print(images);
var image = images.median().clip(canada);
Map.addLayer(canada);
Map.addLayer(image);
var savi = image.expression(
'1.5*((NIR-RED)/(NIR+RED+0.5))',{
'NIR':image.select('sur_refl_b02'),
'RED':image.select('sur_refl_b01')
}).rename('savi');
var saviVis = {'min':0.0, 'max':1, 'palette':['yellow', 'green']};
Map.addLayer(savi, saviVis, 'SAVI')

anova_test not returning Mauchly's for three way within subject ANOVA

I am using a data set called sleep (found here: https://drive.google.com/file/d/15ZnsWtzbPpUBQN9qr-KZCnyX-0CYJHL5/view) to run a three way within subject ANOVA comparing Performance based on Stimulation, Deprivation, and Time. I have successfully done this before using anova_test from rstatix. I want to look at the sphericity output but it doesn't appear in the output. I have got it to come up with other three way within subject datasets, so I'm not sure why this is happening. Here is my code:
anova_test(data = sleep, dv = Performance, wid = Subject, within = c(Stimulation, Deprivation, Time))
I also tried to save it to an object and use get_anova_table, but that didn't look any different.
sleep_aov <- anova_test(data = sleep, dv = Performance, wid = Subject, within = c(Stimulation, Deprivation, Time))
get_anova_table(sleep_aov, correction = "GG")
This is an ideal dataset I pulled from the internet, so I'm starting to think the data had a W of 1 (perfect sphericity) and so rstatix is skipping this output. Is this something anova_test does?
Here also is my code using a dataset that does return Mauchly's:
weight_loss_long <- pivot_longer(data = weightloss, cols = c(t1, t2, t3), names_to = "time", values_to = "loss")
weight_loss_long$time <- factor(weight_loss_long$time)
anova_test(data = weight_loss_long, dv = loss, wid = id, within = c(diet, exercises, time))
Not an expert at all, but it might be because your factors have only two levels.
From anova_summary() help:
"Value
return an object of class anova_test a data frame containing the ANOVA table for independent measures ANOVA. However, for repeated/mixed measures ANOVA, it is a list containing the following components are returned:
ANOVA: a data frame containing ANOVA results
Mauchly's Test for Sphericity: If any within-Ss variables with more than 2 levels are present, a data frame containing the results of Mauchly's test for Sphericity. Only reported for effects that have more than 2 levels because sphericity necessarily holds for effects with only 2 levels.
Sphericity Corrections: If any within-Ss variables are present, a data frame containing the Greenhouse-Geisser and Huynh-Feldt epsilon values, and corresponding corrected p-values. "

Can Tesseract OCR recognize subscripts and superscripts?

I have problems with the general recognition of subscript and superscript in text fragments.
Example-image:
I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:
tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)
The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.
In the output the sub/sup-fragments were all more or less wrong:
"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"
Using Tesseract for OCR is there a way to ...?
optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.
Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training:
link1 and link2.
But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.
Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:
public class SubSupEvaluator {
public void determineSubSupCharacters(BufferedImage image) {
//1. initialize Tesseract and set image infos
TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
try {
int bpp = image.getColorModel().getPixelSize();
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);
//2. start actual OCR run
TessBaseAPIRecognize(handle, null);
//3. iterate over the result character-wise
TessResultIterator ri = TessBaseAPIGetIterator(handle);
TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
TessPageIteratorBegin(pi);
do {
//determine character
Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
String character = ptr.getString(0);
TessDeleteText(ptr); //release memory
//determine position information
IntBuffer leftB = IntBuffer.allocate(1);
IntBuffer topB = IntBuffer.allocate(1);
IntBuffer rightB = IntBuffer.allocate(1);
IntBuffer bottomB = IntBuffer.allocate(1);
TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);
//write info to console
System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
} while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
} finally {
TessBaseAPIDelete(handle); //release memory
}
}
}
The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.
There is very little information on this topic.
One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.
See
How to detect subscript numbers in an image using OCR?
Related (but otherwise not answering the question):
https://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg19434.html
https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp
what do you guys think about getting tesseract to recognize single letters?
Tesseract does not recognize single characters
I tried it with the option --psm 10
tesseract imTstg.png out5 --psm 10
but it did not seem to work. I am thinking about just running yolo to detect the single letters.

Reading from binary file into several labels on a form in C#

I'm writing a trivia game app in C# that writes data to a binary file, then reads the data from the file into six labels. The six labels are as follows:
lblQuestion // This is where the question text goes.
lblPoints // This is where the question points goes.
lblAnswerA // This is where multiple choice answer A goes.
lblAnswerB // This is where multiple choice answer B goes.
lblAnswerC // This is where multiple choice answer C goes.
lblAnswerD // This is where multiple choice answer D goes.
Here is the code for writing to the binary file:
{
bw.Write(Question);
bw.Write(Points);
bw.Write(AnswerA);
bw.Write(AnswerB);
bw.Write(AnswerC);
bw.Write(AnswerD);
}
Now for the code to read data from the file into the corresponding labels:
{
FileStream fs = File.OpenRead(ofd.FileName);
BinaryReader br = new BinaryReader(fs);
lblQuestion.Text = br.ReadString();
lblPoints.Text = br.ReadInt32() + " points";
lblAnswerA.Text = br.ReadString();
lblAnswerB.Text = br.ReadString();
lblAnswerC.Text = br.ReadString();
lblAnswerD.Text = br.ReadString();
}
The Question string reads correctly into lblQuestion.
The Points value reads correctly into lblPoints.
AnswerA, AnswerB, and AnswerC DO NOT read into lblAnswerA, lblAnswerB and lblAnswerC respectively.
lblAnswerD, however, gets the string meant for lblAnswerA.
Looking at the code for reading data into the labels, is there something missing, some sort of incremental value that needs to be inserted into the code in order to get the strings to the correct labels?

Methods for deleting blank (or nearly blank) pages from TIFF files

I have something like 40 million TIFF documents, all 1-bit single page duplex. In about 40% of cases, the back image of these TIFFs is 'blank' and I'd like to remove them before I do a load to a CMS to reduce space requirements.
Is there a simple method to look at the data content of each page and delete it if it falls under a preset threshold, say 2% 'black'?
I'm technology agnostic on this one, but a C# solution would probably be the easiest to support. Problem is, I've no image manipulation experience so don't really know where to start.
Edit to add: The images are old scans and so are 'dirty', so this is not expected to be an exact science. The threshold would need to be set to avoid the chance of false positives.
You probably should:
open each image
iterate through its pages (using Bitmap.GetFrameCount / Bitmap.SelectActiveFrame methods)
access bits of each page (using Bitmap.LockBits method)
analyze contents of each page (simple loop)
if contents is worthwhile then copy data to another image (Bitmap.LockBits and a loop)
This task isn't particularly complex but will require some code to be written. This site contains some samples that you may search for using method names as keywords).
P.S. I assume that all of images can be successfully loaded into a System.Drawing.Bitmap.
You can do something like that with DotImage (disclaimer, I work for Atalasoft and have written most of the underlying classes that you'd be using). The code to do it will look something like this:
public void RemoveBlankPages(Stream source stm)
{
List<int> blanks = new List<int>();
if (GetBlankPages(stm, blanks)) {
// all pages blank - delete file? Skip? Your choice.
}
else {
// memory stream is convenient - maybe a temp file instead?
using (MemoryStream ostm = new MemoryStream()) {
// pulls out all the blanks and writes to the temp stream
stm.Seek(0, SeekOrigin.Begin);
RemoveBlanks(blanks, stm, ostm);
CopyStream(ostm, stm); // copies first stm to second, truncating at end
}
}
}
private bool GetBlankPages(Stream stm, List<int> blanks)
{
TiffDecoder decoder = new TiffDecoder();
ImageInfo info = decoder.GetImageInfo(stm);
for (int i=0; i < info.FrameCount; i++) {
try {
stm.Seek(0, SeekOrigin.Begin);
using (AtalaImage image = decoder.Read(stm, i, null)) {
if (IsBlankPage(image)) blanks.Add(i);
}
}
catch {
// bad file - skip? could also try to remove the bad page:
blanks.Add(i);
}
}
return blanks.Count == info.FrameCount;
}
private bool IsBlankPage(AtalaImage image)
{
// you might want to configure the command to do noise removal and black border
// removal (or not) first.
BlankPageDetectionCommand command = new BlankPageDetectionCommand();
BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
return results.IsImageBlank;
}
private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
// blanks needs to be sorted low to high, which it will be if generated from
// above
TiffDocument doc = new TiffDocument(source);
int totalRemoved = 0;
foreach (int page in blanks) {
doc.Pages.RemoveAt(page - totalRemoved);
totalRemoved++;
}
doc.Save(dest);
}
You should note that blank page detection is not as simple as "are all the pixels white(-ish)?" since scanning introduces all kinds of interesting artifacts. To get the BlankPageDetectionCommand, you would need the Document Imaging package.
Are you interested in shrinking the files or just want to avoid people wasting their time viewing blank pages? You can do a quick and dirty edit of the files to rid yourself of known blank pages by just patching the second IFD to be 0x00000000. Here's what I mean - TIFF files have a simple layout if you're just navigating through the pages:
TIFF Header (4 bytes)
First IFD offset (4 bytes - typically points to 0x00000008)
IFD:
Number of tags (2-bytes)
{individual TIFF tags} (12-bytes each)
Next IFD offset (4 bytes)
Just patch the "next IFD offset" to a value of 0x00000000 to "unlink" pages beyond the current one.