Methods for deleting blank (or nearly blank) pages from TIFF files - language-agnostic

I have something like 40 million TIFF documents, all 1-bit single page duplex. In about 40% of cases, the back image of these TIFFs is 'blank' and I'd like to remove them before I do a load to a CMS to reduce space requirements.
Is there a simple method to look at the data content of each page and delete it if it falls under a preset threshold, say 2% 'black'?
I'm technology agnostic on this one, but a C# solution would probably be the easiest to support. Problem is, I've no image manipulation experience so don't really know where to start.
Edit to add: The images are old scans and so are 'dirty', so this is not expected to be an exact science. The threshold would need to be set to avoid the chance of false positives.

You probably should:
open each image
iterate through its pages (using Bitmap.GetFrameCount / Bitmap.SelectActiveFrame methods)
access bits of each page (using Bitmap.LockBits method)
analyze contents of each page (simple loop)
if contents is worthwhile then copy data to another image (Bitmap.LockBits and a loop)
This task isn't particularly complex but will require some code to be written. This site contains some samples that you may search for using method names as keywords).
P.S. I assume that all of images can be successfully loaded into a System.Drawing.Bitmap.

You can do something like that with DotImage (disclaimer, I work for Atalasoft and have written most of the underlying classes that you'd be using). The code to do it will look something like this:
public void RemoveBlankPages(Stream source stm)
{
List<int> blanks = new List<int>();
if (GetBlankPages(stm, blanks)) {
// all pages blank - delete file? Skip? Your choice.
}
else {
// memory stream is convenient - maybe a temp file instead?
using (MemoryStream ostm = new MemoryStream()) {
// pulls out all the blanks and writes to the temp stream
stm.Seek(0, SeekOrigin.Begin);
RemoveBlanks(blanks, stm, ostm);
CopyStream(ostm, stm); // copies first stm to second, truncating at end
}
}
}
private bool GetBlankPages(Stream stm, List<int> blanks)
{
TiffDecoder decoder = new TiffDecoder();
ImageInfo info = decoder.GetImageInfo(stm);
for (int i=0; i < info.FrameCount; i++) {
try {
stm.Seek(0, SeekOrigin.Begin);
using (AtalaImage image = decoder.Read(stm, i, null)) {
if (IsBlankPage(image)) blanks.Add(i);
}
}
catch {
// bad file - skip? could also try to remove the bad page:
blanks.Add(i);
}
}
return blanks.Count == info.FrameCount;
}
private bool IsBlankPage(AtalaImage image)
{
// you might want to configure the command to do noise removal and black border
// removal (or not) first.
BlankPageDetectionCommand command = new BlankPageDetectionCommand();
BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
return results.IsImageBlank;
}
private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
// blanks needs to be sorted low to high, which it will be if generated from
// above
TiffDocument doc = new TiffDocument(source);
int totalRemoved = 0;
foreach (int page in blanks) {
doc.Pages.RemoveAt(page - totalRemoved);
totalRemoved++;
}
doc.Save(dest);
}
You should note that blank page detection is not as simple as "are all the pixels white(-ish)?" since scanning introduces all kinds of interesting artifacts. To get the BlankPageDetectionCommand, you would need the Document Imaging package.

Are you interested in shrinking the files or just want to avoid people wasting their time viewing blank pages? You can do a quick and dirty edit of the files to rid yourself of known blank pages by just patching the second IFD to be 0x00000000. Here's what I mean - TIFF files have a simple layout if you're just navigating through the pages:
TIFF Header (4 bytes)
First IFD offset (4 bytes - typically points to 0x00000008)
IFD:
Number of tags (2-bytes)
{individual TIFF tags} (12-bytes each)
Next IFD offset (4 bytes)
Just patch the "next IFD offset" to a value of 0x00000000 to "unlink" pages beyond the current one.

Related

Putting C++ string in HTML code to show value on webserver

I've set up a webserver running on ESP8266 thats currently hosting 7 sites. The sites is written in plain HTML in each diffrent tab in the arduino ide. I have installed the library Pagebuilder to help with making everything look nice and run.
Except one thing. I have a button connected to my ESP8266 which by the time being imitates a sensor input. basicly when the button is pressed my integer "x" increments with 1. I also managed to make a string that replicates "x" and increments with the same value.
I also have a problem with Printing the IPadresse of the server, but thats not as important as the other.
My plan then was writing the string "score" (which contains x) into the HTML tab where it should be output. this obviously didnt work.
Things I've tried:
Splitting up the HTML code where I want the string to be printed and using client.println("");
This didnt work because the two libraries does not cooperate and WiFiClient does not find Pagebuilders server. (basicly, the client.println does nothing when I used it with Pagebuilder).
Reconstructing the HTML page as a literal really long string, and adding in the String with x like this: "html"+score+"html" and adding it into where the HTML page const char were. (basicly replacing the variable with the text that were in the variable).
This did neighter work because the argument "PageElement" from Pagebuilder does only expect one string, and errors out because theres an additional string inside the HTML string.
I've tried sending it as a post req. but this did not output the value either.
I have run out of Ideas to try.
//root page
#if defined(ARDUINO_ARCH_ESP8266)
#include <ESP8266WiFi.h>
#include <ESP8266WebServer.h>
#include <WiFiClient.h>
#elif defined(ARDUINO_ARCH_ESP32)
#include <WiFi.h>
#include <WebServer.h>
#endif
#include "PageBuilder.h"
#include "currentGame.h" //tab 1
#if defined(ARDUINO_ARCH_ESP8266)
ESP8266WebServer Server;
ESP8266WebServer server;
#endif
int sensorPin = 2; // button input
int sensorValue = 0;
int x = 0; // the int x
String score=""; //the string x will be in
PageElement CURRENT_GAME_ELEMENT(htmlPage1);
PageBuilder CURRENT_GAME("/current-game", {CURRENT_GAME_ELEMENT}); // this //only showes on href /current-game
void button() {
sensorValue = analogRead(sensorPin); //read the voltage
score="Team 1: "+String((int)x+1); //"make" x a string
if (sensorValue <= 10) { // check if button is pressed
x++; // increment x
Serial.println(x);
Serial.println(score);
delay(100);
}
}
void setup() {
Serial.begin(115200);
pinMode(2, INPUT);
WiFi.softAP("SSID", "PASS");
delay(100);
CURRENT_GAME.insert(Server);
Server.begin();
}
void loop() {
Server.handleClient();
button();
}
// tab 1
const char htmlPage1[] PROGMEM = R"=====(
/*
alot of HTML, basicly the whole website...
..............................................
*/
<div class="jumbotron">
<div align="center">
<h1 class="display-4"> score </h1> // <--- this is where
//I want to print the
//string:
</div>
</div>
)=====";
what I want to do is getting the value of the string score displayed on the website. If I put "score" directly into the HTML, the word score will be displayed, not the value. I want the value displayed.
Edit:
I have figured out how to make the string(score) be printed in the HTML code, thus, I only have to convert the HTML code string back to a char. explanation is in comment below.
Edit 2: (-------------------------solution-------------------------)
Many thanks for the help I've gotten and sorry for being so ignorant, its just so hard being so close and that thing doesnt work. but anyways, What I did was following Pagebuilders example, and making another element to print in current game..
String test(PageArgument& args) {
return score;
}
const char html[] = "<div class=\"jumbotron\"><div align=\"center\"><h1 class=\"display-4\">{{NAME}}</h1></div></div>";
PageElement FRAMEWORK_PAGE_ELEMENT(htmlPage0);
PageBuilder FRAMEWORK_PAGE("/", {FRAMEWORK_PAGE_ELEMENT});
PageElement body_elem(html, { {"NAME", test} });
PageElement CURRENT_GAME_ELEMENT(htmlPage1);
PageBuilder CURRENT_GAME("/current-game", { CURRENT_GAME_ELEMENT, body_elem});
suprisingly easy when I first understood it.. Thanks again.
You could try building your string first, then converting it to a const char
like this: const char * c = str.c_str();
if you can't use a pointer you could try this:
string s = "yourHTML" + score + "moreHTML";
int n = s.length();
char char_array[n + 1];
strcpy(char_array, s.c_str());
additionally you could try the stringstream standard library
This sort of thing is often done using magic tags in your markup that are detected by the server code before it serves the HTML and filled in by executing some sort of callback or filling in a variable, or whatever.
So with this in mind and hoping for the best, I nipped over to: PageBuilder on github and looked to see if there was something similar here. Good news! In one of the examples:
const char html[] = "hello <b>{{NAME}}</b>, <br>Good {{DAYTIME}}.";
...
PageElement body_elem(html, { {"NAME", AsName}, {"DAYTIME", AsDayTime} });
Where {{NAME}} and {{DAYTIME}} are magic tokens. AsName and AsDayTime are functions to be called when the respective tag is encountered while the page is being served.
EDIT: in response to a request to explain differently, I'm not convinced I can do a better job of explaining the code than the example on the library's own github page, so I'll try a wordy description instead:
When you want to serve a webpage to a client, the code needs to know what you want to serve. In the simplest case, it's a static page: the same every time. You can just write the HTML, stick it in a string an be done.
whole_page = "<html>My fixed content</html>";
webserver.serve(whole_page);
But you want some dynamic element(s). As noted, you can do it in a few ways, such as serving some static HTML, then the dynamic bit, then some more static HTML. It seems you've not had much luck like this, and it's rather clunky anyway.
Or you can pre-build a new string out of the three bits and serve that in one chunk, but that's also pretty clunky.
(Aside: taking big strings and adding them together is likely to be slow and memory intensive, two things you really don't want on a little CPU like the ESP8266).
So instead, you allow 'magic' markers in the HTML, using a marker in place of the dynamic content, and serve that instead.
whole_page = "<html>My dynamic content. Value is {{my_value}}</html>";
webserver.serve(whole_page, ...);
The clever bit is that as the page is being served, the webserver is watching the text go by, and when it sees a magic tag, it stops and asks you to fill in the blank, then carries on as before.
Obviously, there is some processing overhead with watching for tags, and some programming overhead with telling it what tags to watch for and how to ask you for the data it needs.
I got advice from a friend who told me I should make a unique argument where I wanted the string(x) and then using some syntax to replace it. I also took inspiration from you Jelle..
what I did was make a unique argument "VAR_CURRENT_SCORE" put that into the HTML where I want the score output, then convert htmlPage1 from a char to a string, use string.replace() and replace "VAR_CURRENT_SCORE" with the string(x) score. this workes as I can see in the serial monitor output.
This is what I did:
//root page
String HTMLstring(htmlstringPage);
delay(100);
HTMLstring.replace("VAR_CURRENT_SCORE", score);
delay(50);
Serial.println("string:");
Serial.println(HTMLstring);
//tab 1 char htmlstringPage[] PROGMEM = R"=====(
<div class="jumbotron">
<div align="center">
<h1 class="display-4">VAR_CURRENT_SCORE</h1>
</div>
</div>
)=====";
However, I still have a small problem left which is converting the string back to char to post it to the website.
To convert the string back:
request->send_P(200, "text/html", HTMLstring.c_str());

How to maintain the Table-like structure of a SWT Label or SWT Text?

I have to display the content of a file using some widgets of Eclipse SWT. The file has structure, it has unkown number of rows with unknown number of columns. The file is formatted properly (i.e. the second column in the first row starts exactly where the second column of the second, third, etc... row.) It has a table-like structure without having an actual table:
itt.van.juci tt.mm.yyyy // hh:mm:ss
juci 14.09.2017 // 08;08:08
If a read the file and print it again to another file I get the same structure without any distortion.
However if I read it into String and use this String to set an SWT Label or SWT.Text I get distorted appearance. The rows in the second (and any subsequent columns) do not start at the same place. The length of the content of the first column influences the starting point of the later columns:
itt.van.juci tt.mm.yyyy // hh:mm:ss
juci 14.09.2017 // 08;08:08
Is there a way to get rid of these extra spaces and keep the original structure?
Here is the setting-the-label part that I use:
Label historyLabel = new Label(compositeFiles, SWT.WRAP);
GridData gd_lblNewLabel = new GridData(SWT.FILL, SWT.FILL, true, true, 1, 1);
gd_lblNewLabel.widthHint = 566;
gd_lblNewLabel.heightHint = 237;
historyLabel.setLayoutData(gd_lblNewLabel);
historyLabel.setText(Dummy.dummyHistory());
}
and here is reading the file:
private static String java_8() {
StringBuilder sb = new StringBuilder();
try (Stream<String> stream = Files.lines(Paths.get(fileName_2))) {
stream.forEach(line->{
sb.append(line).append("\n");
});
} catch (IOException e) {
e.printStackTrace();
}
return sb.toString();
}
Any solution is highly appreciated. I do not need to use org.eclipse.swt.widgets.Text if other widgets suit more. However it would be difficult to use a Table since I do not know how many columns I need.
I need to set the font like this:
Font mono = new Font(compositeFiles.getDisplay(), "Courier New", 12, SWT.NONE);
historyLabel.setFont(mono);
My answare is based on the hints given by greg-449.

How can I search pipeline with another pipeline value on google cloud dataflow

I would like to search text which includes specified word from stream data with google cloud dataflow.
In detail, I will deal with following two stream.
stream A: element of stream is "word"
stream B: element of stream is "text". and each text consists of "word". This text may have "word" on stream A
Many "text" flow into stream B frequently. On the other hand, "word" flow into stream A occasionally.
When "word" flow into stream A, I would like to search "text" which has "word" and flow into stream B after 5 minutes ago.
Example
time stream A : stream B
00:01 - this is an apple
00:02 - this is an orange
00:03 - I have an apple
00:04 apple <= "this is an apple" and "I have an apple" are found
00:05 this <= "this is an apple" and "this is an orange" are found
Can I search text with google cloud dataflow?
If I understand your question correctly, there are multiple ways to achieve something like what you want. I will describe two variations.
The basic idea in my example code is to use an inner join and SlidingWindows of five minutes. You can implement the join using ParDo side inputs or CoGroupByKey, depending on your data sizes.
Here is how you set up your inputs and windowing:
PCollection<String> streamA = ...;
PCollection<String> streamB = ...;
PCollection<String> windowedStreamA = streamA.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
PCollection<String> windowedStreamB = streamB.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
You may want to adjust the size of windows or period to meet your specification & performance needs.
Here is a sketch of how to do the join with side inputs. This will iterate over the entire five minute window of streamB for each element of streamA, so performance will suffer if windows get large.
PCollectionView<Iterable<String>> streamBview = streamB.apply(View.asIterable());
PCollection<String> matches = windowedStreamA.apply(
ParDo.of(new DoFn<String, String>() {
#Override void processElement(ProcessContext context) {
for (String text : context.sideInput()) {
if (split(text).contains(context.element())) {
context.output(text);
}
}
}
});
Here is a sketch of how to do this with CoGroupByKey by pre-splitting the text and joining each keyword with the lines that contain that keyword. There is similar logic in the TfIdf example included with the SDK.
PCollection<KV<String, Void>> keyedStreamA = windowedStreamA.apply(
MapElements
.via(word -> KV.of(word, null))
.withOutputType(new TypeDescriptor<KV<String, Void>>() {}));
PCollection<KV<String, String>> keyedStreamB = windowedStreamB.apply(
FlatMapElements
.via(text -> split(text).forEach(word --> KV.of(word, text))
.withOutputType(new TypeDescriptor<KV<String, String>>() {}));
TupleTag<Void> tagA = new TupleTag<Void>() {};
TupleTag<String> tagB = new TupleTag<String>() {};
KeyedPCollectionTuple coGbkInput = KeyedPCollectionTuple
.of(tagA, keyedStreamA)
.and(tagB, keyedStreamB);
PCollection<String> matches = coGbkInput
.apply(CoGroupByKey.create())
.apply(FlatMapElements
.via(result -> result.getAll(tagB))
.withOutputType(new TypeDescriptor<String>()));
The best approach will depend on your data. If you are OK with getting more matches than just the last five minutes you can tune the amount of data duplication in windows by enlarging your sliding windows and having a larger period. You can also use triggers to tune when output is produced.

HTML5 File API - slicing or not?

There are some nice examples about file uploading at HTML5 Rocks but there are something that isn't clear enough for me.
As far as i see, the example code about file slicing is getting a specific part from file then reading it. As the note says, this is helpful when we are dealing with large files.
The example about monitoring uploads also notes this is useful when we're uploading large files.
Am I safe without slicing the file? I meaning server-side problems, memory, etc. Chrome doesn't support File.slice() currently and i don't want to use a bloated jQuery plugin if possible.
Both Chrome and FF support File.slice() but it has been prefixed as File.webkitSlice() File.mozSlice() when its semantics changed some time ago. There's another example of using it here to read part of a .zip file. The new semantics are:
Blob.webkitSlice(
in long long start,
in long long end,
in DOMString contentType
);
Are you safe without slicing it? Sure, but remember you're reading the file into memory. The HTML5Rocks tutorial offers chunking the upload as a potential performance improvement. With some decent server logic, you could also do things like recovering from a failed upload more easily. The user wouldn't have to re-try an entire 500MB file if it failed at 99% :)
This is the way to slice the file to pass as blobs:
function readBlob() {
var files = document.getElementById('files').files;
var file = files[0];
var ONEMEGABYTE = 1048576;
var start = 0;
var stop = ONEMEGABYTE;
var remainder = file.size % ONEMEGABYTE;
var blkcount = Math.floor(file.size / ONEMEGABYTE);
if (remainder != 0) blkcount = blkcount + 1;
for (var i = 0; i < blkcount; i++) {
var reader = new FileReader();
if (i == (blkcount - 1) && remainder != 0) {
stop = start + remainder;
}
if (i == blkcount) {
stop = start;
}
//Slicing the file
var blob = file.webkitSlice(start, stop);
reader.readAsBinaryString(blob);
start = stop;
stop = stop + ONEMEGABYTE;
} //End of loop
} //End of readblob

Flash ABC : What does the number part of <file>.as$<number> in a swfdump

If I take a swf, and run it through swfdump
swfdump.exe -abc file.swf > ABC.txt
One the first run I may get some output in ABC.txt like this
ObjectConfig.as$60
And on a subsequent run of the same SWF get a different output
ObjectConfig.as$61
What is the meaning of the number after the $ ?
This is part of the debug metadata that the mxmlc compiler adds to the bytecode when you do a debug compile, debug=true. If you do a normal release compile, this info is omitted.
This metadata stores filenames and line numbers so that you can see the location in your source while debugging. Although I'm not sure on the exact meaning of these particular numbers, they seem to be a unique identifier or index of that file for the debugger, perhaps in case of two classes with the same name.
The best I can see is in the source code for swfdump, it calls swf_GetString. Somewhere in this chain it adds what looks like a debugLine or a scopeDepth to the end of the class name:
char* swf_GetString(TAG*t)
{
int pos = t->pos;
while(t->pos < t->len && swf_GetU8(t));
/* make sure we always have a trailing zero byte */
if(t->pos == t->len) {
if(t->len == t->memsize) {
swf_ResetWriteBits(t);
swf_SetU8(t, 0);
t->len = t->pos;
}
t->data[t->len] = 0;
}
return (char*)&(t->data[pos]);
}