How to get sift to use a pager? - pager

I recently stumbled upon sift ("A fast and powerful alternative to grep."). I quite like its feature-set, but there's one thing I'm missing: paged output (with colors).
I mean, it can be done:
$ sift 'search_string' --color | less --RAW-CONTROL-CHARS
The --color I can put into .sift.conf, but the | less -R I have to repeat for every invocation.
With ack, I can stick this into the .ackrc and forget about it (Using ack, how do you display one page at a time? / http://beyondgrep.com/documentation/).
How do I do this with sift?

Related

Formatting wide output via 'column' (or similar) command(s)

This question actually asks the 'inverse' solution as the one here, namely I would like to wrap the long column (column 4) on multiple lines. In effect, the output should look like:
cat test.csv | column -s"," -t -c5
col1 col2 col3 col4 col5
1 2 3 longLineOfText 5
ThatIWantTo
InspectAndWould
LikeToWrap
(excuse the u.u.o.c. duplicated over here :) )
The solution would ideally :
make use of standard *nix text processing utilities (e.g. column, paste, pr which usually are present on any modern Linux machine nowadays, usually coming from the core-utils package);
avoid jq as it is not necessarily present on every (production) system;
don't overheat the brain: yes... am looking mainly at you awk & co. gurus :). "Normal" awk / perl / sed is fine.
as a special bonus , a solution using vim would be even more welcome (again, no brain smoke please), since that would allow for syntax-coloring as well.
The background: I want to be able to make sense of the output of docker history, so as a last resort even some Go Template-magic would suit, as would using jq.
In extreme cases (if the benefits of ease-of-remembering-and-use outweigh the inconvenience of downloading a new utilty (preferably self-contained / static linked) utility on the server - is ok, or using json processing commands (in which case using pythons json module would be preferred)
Thanks !
LE:
Please keep in mind, that dockers output has the columns separated with several spaces, which unfortunately confuses most commands :(

Tesseract receipt scanning advice needed

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am still coming away unsatisfied. I wanted to pose the problem here and see if anyone else has advice on how to solve this task.
My wife came to me this morning and asked if there was anyway she could easily scan her receipts from Wal-Mart and over time build a history of prices spent in categories and for specific items so that we could do some trending and easily deep dive on where the spending is going. At first I felt like this was a very tall order, but after doing some digging I found a few things that make me feel this is within reach:
Wal-Mart receipts are in general, very well structured and easy to read. They even include the UPC for every item (potential for lookups against a UPC database?) and appear to classify food items with an F or I (not sure what the difference is) and have a tax code column as well that may prove useful should I learn the secrets of what the codes mean.
I further discovered that there is some kind of Wal-Mart item lookup API that I may be able to get access to which would prove useful in the UPC lookup.
They have an app for smart phones that lets you scan a QR code printed on every receipt. That app looks up a "TC" code off the receipt and pulls down the entire itemized receipt from their servers. It shows you an excellent graphical representation of the receipt including thumbnail pictures of all the items and the cost, etc. If this app would simply categorize and summarize the receipt, I would be done! But alas, that's not the purpose of the app ....
The final piece of the puzzle is that you can export a computer generated PNG image of the receipt in case you want to save it and throw away the paper version. This to me is the money shot, as these PNGs are computer created and therefore not subject to the issues surrounding taking a picture or scanning a paper receipt
An example of one of these (slightly edited to white out some areas but otherwise exactly as obtained from the app) is here:
https://postimg.cc/image/s56o0wbzf/
You can see that the important part of the text is perfectly aligned in 5 columns and that is ultimately what this question is about. How to get Tesseract to accurately OCR this into text. I have lots of ideas where to take it from here, but it all starts with the OCR!
The closest I have come myself is this example here:
http://pastebin.com/nuZJBVg8
I used psm6 and a character limiting set to force it to do uppercase + numbers + a few symbols only:
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#()/*#%-.
At first glance, the OCR seems to almost match. But as you dig deeper you will see that it fails pretty horribly overall. 3s and 8s are almost always wrong. Same with 6s and 5s. Then there are times it just completely skips over characters or just starts to fall apart (like line 31+ in the example). It starts seeing 2s as 1s, or even just missing characters. The SO PIZZA on line 33 should be "2.82" but comes out as "32".
I have tried doing some pre-processing on the image to thicken up the characters and make sure it's pure black and white but none of my efforts got any closer than the raw image from Wal-Mart + the above commands.
Ideally since this is such a well structured PNG which is presumably always the same width I would love if I could define the columns by pixel widths so that Tesseract would treat each column independently. I tried to research this but the UZN files I've seen mentioned don't translate to me as far as pixel widths and they seem like height is a factor which wouldn't work on these since the height is always going to be variable.
In addition, I need to figure out how to train Tesseract to recognize the numbers 100% accurately (the letters aren't really important). I started researching how to train the program but to be honest it got over my head pretty quickly as the scope of training in the documentation is more for having it recognize entire languages not just 10 digits.
The ultimate end game solution would be a pipeline chain of commands that took the original PNG from the app and gave me back a CSV with the 5 columns of data from the important part of the receipt. I don't expect that out of this question, but any assistance guiding me towards it would be greatly appreciated! At this point I just don't feel like being whipped by Tesseract once again and so I am determined to find a way to master her!
I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.
I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.
I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:
# -----------------------------------------------------------------
# Script: ParseReceipt.ps1
# Author: Jim Sanders
# Date: 7/27/2015
# Keywords: tesseract OCR ImageMagick CSV
# Comments:
# Used to convert a Wal-mart receipt image to a CSV file
# -----------------------------------------------------------------
param(
[Parameter(Mandatory=$true)] [string]$image
) # end param
# create output and temporary files based on input name
$base = (Get-ChildItem -Filter $image -File).BaseName
$csvOutfile = $base + ".txt"
$upscaleImage = $base + "_150.png"
$ocrFile = $base + "_ocr"
# upscale by 150% to ensure OCR works consistently
convert $image -resize 150% $upscaleImage
# perform the OCR to a temporary file
tesseract $upscaleImage -psm 6 $ocrFile
# column headers for the CSV
$newline = "Description,UPC,Type,Cost,TaxType`n"
$newline | Out-File $csvOutfile
# read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
$lines = Get-Content "$ocrFile.txt"
Foreach ($line in $lines) {
# This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
$newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
$newline | Out-File -Append $csvOutfile
}
# clean up temporary files
del $upscaleImage
del "$ocrFile.txt"
The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.
I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:
"=""12345"""
That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.
Anyway, here is how they look after importing and some quick formating:
https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png
I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.
Thanks for the nudge in the right direction #RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!
Text recognition on receipts is one of the hardest problems for OCR to handle.
The reasons are numerous:
receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
they have very large amount of dense text (especially Wall-Mart receipts)
existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.
Your best bet is to perform the following:
Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
Tesseract training. It's not that hard to do :)
OCR result postprocessing to ensure layouting.
Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.
Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.
Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods
pod try PPBlinkOCR

How to annote a large file with mercurial?

i'm trying to find out who did the last change on a certain line in a large xml file ( ~100.00 lines ).
I tried to use the log file viewer of thg but it does not give me any results.
Using hg annote file.xml -l 95000 .. took forever and eventually died with an error message.
Is there a way to annote a single line in a large file that does not take forever ?
You can use hg grep to dig into a file if you have interest in very specific text:
hg grep "text-to-search" file.txt
You will likely need to add the --all switch to get every change that matches and then limit your results to a specific changeset range with -r firstchage:lastchange syntax.
I don't have a file on hand of the size you are working with, so this may also have trouble past a certain point, particularly if the search string matches many many lines in the file. But if you can get as specific as possible with your search string, you should be able to track it down.

gdalwarp too slow (compared to gdal_merge)

I have 70+ raster images in TIFF format that I am trying to merge.
Originals can be found here:
http://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/
After pre-processing (pct2rgb, gdalwarp individual charts, gdal_translate to cut the collars) I try to run them through gdalwarp to mosaic them using a command like this:
gdalwarp --config GDAL_CACHEMAX 3000 -overwrite -wm 3000 -r bilinear -srcnodata 0 -dstnodata 0 -wo "NUM_THREADS=3" /data/aeronav/sec/c/Albuquerque_c.tif .....70 other file names ...master.tif
After 12 hours of processing:
Creating output file that is 321521P x 125647L.
Processing input file /data/aeronav/sec/c/Albuquerque_c.tif.
0...10...20...30...40...
This means gdalwarp is never going to finish.
In contrast. A gdal_merge command like this:
gdal_merge.py -n 0 -a_nodata 0 -o /data/aeronav/sec/master.tif /data/aeronav/sec/c/Albuquerque_c.tif ......70 plus files.....
Finishes in couple of hours.
Problem with gdal_merge is inferior quality output because of "average" sampling. I would like to use "bilinear" at the minimum - and "cubic" sampling if possible and for that gdalwarp is required.
Why is there such a big difference in performance of the two ? Why doesn't gdalwarp want to finish ? Is there any other command line option to speed things up in gadalwarp or is there a way to add sampling option to gdal_merge ?
It seems gdalwarp is not the ideal command to merge these GeoTiffs (since I am not interested in warping again). Instead I used
gdalbuildvrt /data/aeronav/sec/master.virt .... 70+ files in order
to build a virtual mosaic. And then I used gdal_translate to convert the virt file into a GeoTiff:
gdal_translate -of GTiff /data/aeronav/sec/master.virt /data/aeronav/sec/master.tif
That's it—this took less than an hour (even faster than gdal_merge and preserves quality of original files).

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?