Unexpected results when Select objects using jq - json

When I add the body to output list, some wrong names get output. I expected it to output only names for nfl subreddit in both examples. Feature or bug? How can I only output the tuples for subreddit nfl?
The file:
{"author":"403and780","author_flair_css_class":"NHL-EDM4-sheet1-col01-row17","author_flair_text":"EDM - NHL","body":"Don't get why we do this but can't have a Grey Cup GDT.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn6","is_submitter":false,"link_id":"t3_7v9yqa","parent_id":"t3_7v9yqa","permalink":"/r/hockey/comments/7v9yqa/game_thread_super_bowl_lii_philadelphia_eagles_vs/dtqrsn6/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"hockey","subreddit_id":"t5_2qiel","subreddit_type":"public"}
{"author":"kygiacomo","author_flair_css_class":null,"author_flair_text":null,"body":"lol missed the extra wtf","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn7","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsn7/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"shitpostlord4321","author_flair_css_class":null,"author_flair_text":null,"body":"I really hope we get Bleeding Edge before we get the all new all different armor. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn8","is_submitter":false,"link_id":"t3_7v7whz","parent_id":"t3_7v7whz","permalink":"/r/marvelstudios/comments/7v7whz/a_great_new_look_at_iron_mans_avengers_infinity/dtqrsn8/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"marvelstudios","subreddit_id":"t5_2uii8","subreddit_type":"public"}
{"author":"namohysip","author_flair_css_class":null,"author_flair_text":null,"body":"Maybe. I mostly am just doing this to get a story out, and it\u2019s a huge one, so I\u2019m not sure that I\u2019ll be making another fic for many more months. I guess Pokemon Mystery Dungeon just isn\u2019t as popular with the older demographics of AO3.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn9","is_submitter":true,"link_id":"t3_7v9psr","parent_id":"t1_dtqrm3t","permalink":"/r/FanFiction/comments/7v9psr/how_do_you_deal_with_bad_reviews/dtqrsn9/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"FanFiction","subreddit_id":"t5_2r5kb","subreddit_type":"public"}
{"author":"SDsc0rch","author_flair_css_class":null,"author_flair_text":null,"body":"if it rates an upvote, I'll click it - I'm not gonna click on low quality \nnot gonna apologize for it either ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsna","is_submitter":false,"link_id":"t3_7vaam4","parent_id":"t3_7vaam4","permalink":"/r/The_Donald/comments/7vaam4/daily_reminderif_you_see_any_gray_arrows_on_the/dtqrsna/","retrieved_on":1518931297,"score":4,"stickied":false,"subreddit":"The_Donald","subreddit_id":"t5_38unr","subreddit_type":"public"}
{"author":"scarletcrawford","author_flair_css_class":null,"author_flair_text":null,"body":"Honestly, I wanted Takeshi to stay with Poe, but to each their own ship, I guess.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnb","is_submitter":false,"link_id":"t3_7upyc0","parent_id":"t1_dtppyry","permalink":"/r/alteredcarbon/comments/7upyc0/season_1_series_discussion/dtqrsnb/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"alteredcarbon","subreddit_id":"t5_3bzvp","subreddit_type":"public"}
{"author":"immortalis","author_flair_css_class":"vikings","author_flair_text":"Vikings","body":"The ghost of MN kickers will haunt this game.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnc","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnc/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"KryptoFreak405","author_flair_css_class":"48","author_flair_text":"","body":"His original backstory had him training to be an Imperial officer until a commanding officer ordered him to transport a shipment of slaves. He refused, freed the slaves, one of which was Chewie, and defected to become a smuggler","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnd","is_submitter":false,"link_id":"t3_7vanzc","parent_id":"t1_dtqr5q5","permalink":"/r/StarWars/comments/7vanzc/solo_a_star_wars_story_big_game_tv_spot/dtqrsnd/","retrieved_on":1518931297,"score":1102,"stickied":false,"subreddit":"StarWars","subreddit_id":"t5_2qi4s","subreddit_type":"public"}
{"author":"thwinks","author_flair_css_class":null,"author_flair_text":null,"body":"Oh. TIL","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsne","is_submitter":false,"link_id":"t3_7v8o0z","parent_id":"t1_dtqg97a","permalink":"/r/gifs/comments/7v8o0z/silly_walk_champion/dtqrsne/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"gifs","subreddit_id":"t5_2qt55","subreddit_type":"public"}
{"author":"Mimi108","author_flair_css_class":"lions","author_flair_text":"Lions","body":"The Big. The Dick. The Nick. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnf","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnf/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
Code example 1, which works OK:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .author'
kygiacomo
immortalis
Mimi108
Code example 2, which is wrong or unexpected to me:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .body, .author'
403and780
lol missed the extra wtf
kygiacomo
shitpostlord4321
namohysip
SDsc0rch
scarletcrawford
The ghost of MN kickers will haunt this game.
immortalis
KryptoFreak405
thwinks
The Big. The Dick. The Nick.
Mimi108
You can see that author 403and780 commented to a hockey subreddit, not nfl, unfortunately.

jq solution:
jq -r 'select(.subreddit == "nfl") as $o | $o.body, $o.author' head_rc.txt
... as $o - assign filtered object to variable o
The output:
lol missed the extra wtf
kygiacomo
The ghost of MN kickers will haunt this game.
immortalis
The Big. The Dick. The Nick.
Mimi108

Also need to add streaming once I get this syntax correct.
Some good news - you won't need to use the so-called "streaming parser", because your input has already been chopped up. The "streaming parser" is only needed when the input has one or more individually ginormous JSON entities, whereas you have a (long) stream of small JSON objects.
p.s.
As Charles Duffy suggested, the simplest solution to your selection problem is to use parentheses
jq -r 'select(.subreddit=="nfl") | (.body, .author)' input.json
If CSV or TSV makes sense, then change the parentheses to brackets, and tack on #csv or #tsv, e.g.
select(.subreddit=="nfl") | [.body, .author] | #tsv

Related

Is it possible to extract from a map JSON file a list of a city's neighborhoods in a tree-structure format?

Forgive my ignorance, I am not experienced with JSON files. I've been trying to get a tree structure list of all the neighborhoods and locations in the city of Cape Town and this seems to be my last resort.
Unfortunately, I can't even open the file that can be found on this website - http://odp.capetown.gov.za/datasets/official-suburbs?geometry=18.107%2C-34.187%2C19.034%2C-33.988
Could someone tell me if it's possible to extract such as list.
I'd be forever thankful if someone could help me. Thank you in advance
[I am making my comments an answer since I see no other suggestions and no information provided]
I am on a unix/linux shell but the following tools can also be found for windows. My solution for getting a quick list would be:
curl https://opendata.arcgis.com/datasets/8ebcd15badfe40a4ab759682aacf8439_75.geojson |\
jq '.features | .[] | .properties.OFC_SBRB_NAME'
Which gives you:
"HYDE PARK"
"SPRINGFIELD"
"NIEUW MAASTRECHT-2"
"CHARLESVILLE"
"WILDWOOD"
"MALIBU VILLAGE"
"TUSCANY GLEN"
"VICTORIA MXENGE"
"KHAYELITSHA"
"CASTLE ROCK"
"MANSFIELD INDUSTRIA"
...
Explanation:
curl https://... - curl downloads the JSON file from the API you are using
jq: can process JSON on terminal and extract information. I do this in three steps:
.features: GeoJSON format seems to have a standard schema. All the retuned entries are in features array
.[] returns all elements in the array docs here
.properties.OFC_SBRB_NAME: Each element of the array has a field called "properties" which from my understanding carries/includes metadata of this entry. One of those properties in OFC_SBRB_NAME which looks like a name and is the only string in each element. Thus I extract this.
Hope it helps. If you add more detail as to which platform you are using or language, etc I can update the answer, however, the methodology should remain the same I think

searching .CSV file with AWK - only working with first row

I've been trying to search through a specific column of a .csv file to find cells containing a particular word. However, it's only working for the first row (i.e. the headings) in my .csv file.
The file is a series of over 10,000 forum posts, with column 1 as the post key and column 2 as the post text. The headings as below are 'key', 'annotated sentence'.
key,annotated sentence
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading system of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost any claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of having others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on overturning the 2A."
"(595, 0)",So you're actually claiming that it is a lie to say that the UK has a lower gun crime rate than the US? Even if the police were miscounting crimes it's still a huge and unjustified leap in logic to conclude from that that the UK does not have a lower gun crime rate.
"(736, 3)","The anti-abortionists claim a load of **** on many issues. I don't listen to them. To put the ""life"" of an unfertilized egg above that of a person is grotesquely sick IMO. I support any such stem cell research wholeheartedly."
The CSV separator is a comma, and the text delimiter is ".
if I try:
awk -F, '$1 ~ /key/ {print}' posts_file.csv > output_file.csv
it will output the headings row no problem. However, I have tried:
awk -F, '$1 ~ /212/ {print}' posts_file.csv > output_file.csv
awk -F, '$2 ~ /Canada/ {print}' posts_file.csv > output_file.csv
and neither of these work - no matches are found though there should be. I can't figure out why? Any ideas? Thanks in advance.
awk to the rescue!
In general complex csv doesn't work but in your case since key and annotated sentence have very distinct value types you can extend your pattern search to the whole record instead of key and value, the trick is defining the record, which again based on your format can be done as well. For example
$ awk -v RS='\n"' '/Canada/{print RT $0}' csv
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading syst
em of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost a
ny claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of havi
ng others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on ove
rturning the 2A."
and this
$ awk -v RS='\n"' '/(212, 2)/{print RT $0}' csv
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
Python's CSV parsing supports your format out of the box.
Below is a simple script that you could call as follows:
# csvfilter <1-basedFieldNdx> <regexStr> < <infile> > <outfile>
csvfilter 1 'key' < posts_file.csv > output_file.csv
csvfilter 1 '212' < posts_file.csv > output_file.csv
csvfilter 2 'Canada' < posts_file.csv > output_file.csv
Sample script csvfilter:
#!/usr/bin/env python
# coding=utf-8
import csv, sys, re
# Assign arguments to variables.
fieldNdx = int(sys.argv[1]) - 1 # Index of field to filter; Python arrays are 0-based!
reStr = sys.argv[2] if (len(sys.argv) > 2) else '' # Filter regex
# Read from stdin...
reader = csv.reader(sys.stdin)
# ... write to stdout.
writer = csv.writer(sys.stdout, reader.dialect)
# Read each line...
for row in reader:
# Match the target field against the filter regex and
# print the row only if it matches.
if (re.search(reStr, row[fieldNdx])):
writer.writerow(row)
OpenRefine could help with the search.
One way to use awk safely with complex CSV is to use a "csv2tsv" utility to convert the CSV file to a format that can be handled properly by awk.
Usually the TSV ("tab-separated values") format is just right for the job.
(If the final output must be CSV, then either a complementary "tsv2csv" utility can be used, or awk itself can do the job -- though some care may be required to get it exactly right.)
So the pipeline might look like this:
csv2tsv < input.csv | awk -F\\t 'BEGIN{OFS=FS} ....' | tsv2csv
There are several alternatives for csv-to-tsv conversion, ranging from roll-your-own scripts to Excel, but I'd recommend taking the time to check that whichever tool or toolset you select satisfies the "edge case" requirements that are of interest to you.

Tesseract receipt scanning advice needed

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am still coming away unsatisfied. I wanted to pose the problem here and see if anyone else has advice on how to solve this task.
My wife came to me this morning and asked if there was anyway she could easily scan her receipts from Wal-Mart and over time build a history of prices spent in categories and for specific items so that we could do some trending and easily deep dive on where the spending is going. At first I felt like this was a very tall order, but after doing some digging I found a few things that make me feel this is within reach:
Wal-Mart receipts are in general, very well structured and easy to read. They even include the UPC for every item (potential for lookups against a UPC database?) and appear to classify food items with an F or I (not sure what the difference is) and have a tax code column as well that may prove useful should I learn the secrets of what the codes mean.
I further discovered that there is some kind of Wal-Mart item lookup API that I may be able to get access to which would prove useful in the UPC lookup.
They have an app for smart phones that lets you scan a QR code printed on every receipt. That app looks up a "TC" code off the receipt and pulls down the entire itemized receipt from their servers. It shows you an excellent graphical representation of the receipt including thumbnail pictures of all the items and the cost, etc. If this app would simply categorize and summarize the receipt, I would be done! But alas, that's not the purpose of the app ....
The final piece of the puzzle is that you can export a computer generated PNG image of the receipt in case you want to save it and throw away the paper version. This to me is the money shot, as these PNGs are computer created and therefore not subject to the issues surrounding taking a picture or scanning a paper receipt
An example of one of these (slightly edited to white out some areas but otherwise exactly as obtained from the app) is here:
https://postimg.cc/image/s56o0wbzf/
You can see that the important part of the text is perfectly aligned in 5 columns and that is ultimately what this question is about. How to get Tesseract to accurately OCR this into text. I have lots of ideas where to take it from here, but it all starts with the OCR!
The closest I have come myself is this example here:
http://pastebin.com/nuZJBVg8
I used psm6 and a character limiting set to force it to do uppercase + numbers + a few symbols only:
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#()/*#%-.
At first glance, the OCR seems to almost match. But as you dig deeper you will see that it fails pretty horribly overall. 3s and 8s are almost always wrong. Same with 6s and 5s. Then there are times it just completely skips over characters or just starts to fall apart (like line 31+ in the example). It starts seeing 2s as 1s, or even just missing characters. The SO PIZZA on line 33 should be "2.82" but comes out as "32".
I have tried doing some pre-processing on the image to thicken up the characters and make sure it's pure black and white but none of my efforts got any closer than the raw image from Wal-Mart + the above commands.
Ideally since this is such a well structured PNG which is presumably always the same width I would love if I could define the columns by pixel widths so that Tesseract would treat each column independently. I tried to research this but the UZN files I've seen mentioned don't translate to me as far as pixel widths and they seem like height is a factor which wouldn't work on these since the height is always going to be variable.
In addition, I need to figure out how to train Tesseract to recognize the numbers 100% accurately (the letters aren't really important). I started researching how to train the program but to be honest it got over my head pretty quickly as the scope of training in the documentation is more for having it recognize entire languages not just 10 digits.
The ultimate end game solution would be a pipeline chain of commands that took the original PNG from the app and gave me back a CSV with the 5 columns of data from the important part of the receipt. I don't expect that out of this question, but any assistance guiding me towards it would be greatly appreciated! At this point I just don't feel like being whipped by Tesseract once again and so I am determined to find a way to master her!
I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.
I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.
I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:
# -----------------------------------------------------------------
# Script: ParseReceipt.ps1
# Author: Jim Sanders
# Date: 7/27/2015
# Keywords: tesseract OCR ImageMagick CSV
# Comments:
# Used to convert a Wal-mart receipt image to a CSV file
# -----------------------------------------------------------------
param(
[Parameter(Mandatory=$true)] [string]$image
) # end param
# create output and temporary files based on input name
$base = (Get-ChildItem -Filter $image -File).BaseName
$csvOutfile = $base + ".txt"
$upscaleImage = $base + "_150.png"
$ocrFile = $base + "_ocr"
# upscale by 150% to ensure OCR works consistently
convert $image -resize 150% $upscaleImage
# perform the OCR to a temporary file
tesseract $upscaleImage -psm 6 $ocrFile
# column headers for the CSV
$newline = "Description,UPC,Type,Cost,TaxType`n"
$newline | Out-File $csvOutfile
# read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
$lines = Get-Content "$ocrFile.txt"
Foreach ($line in $lines) {
# This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
$newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
$newline | Out-File -Append $csvOutfile
}
# clean up temporary files
del $upscaleImage
del "$ocrFile.txt"
The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.
I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:
"=""12345"""
That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.
Anyway, here is how they look after importing and some quick formating:
https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png
I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.
Thanks for the nudge in the right direction #RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!
Text recognition on receipts is one of the hardest problems for OCR to handle.
The reasons are numerous:
receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
they have very large amount of dense text (especially Wall-Mart receipts)
existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.
Your best bet is to perform the following:
Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
Tesseract training. It's not that hard to do :)
OCR result postprocessing to ensure layouting.
Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.
Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.
Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods
pod try PPBlinkOCR

How to add a header to CSV export in jq?

I'm taking a modified command from the jq tutorial:
curl 'https://api.github.com/repos/stedolan/jq/commits?per_page=5' \
| jq -r -c '.[] | {message: .commit.message, name: .commit.committer.name} | [.[]] | #csv'
Which does csv export well, but missing the headers as the top:
"Fix README","Nicolas Williams"
"README: send questions to SO and Freenode","Nicolas Williams"
"usage() should check fprintf() result (fix #771)","Nicolas Williams"
"Use jv_mem_alloc() in compile.c (fix #771)","Nicolas Williams"
"Fix header guards (fix #770)","Nicolas Williams"
How can I add the header (in this case message,name) at the top? (I know it's possible manually, but how to do it within jq?)
Just add the header text in an array in front of the values.
["Commit Message","Committer Name"], (.[].commit | [.message,.committer.name]) | #csv
Based on Anton's comments on Jeff Mercado's answer, this snippet will get the key names of the properties of the first element and output them as an array before the rows, thus using them as headers. If different rows have different properties, then it won't work well; then again, neither would the resulting CSV.
map({message: .commit.message, name: .commit.committer.name}) | (.[0] | to_entries | map(.key)), (.[] | [.[]]) | #csv
While I fully realize OP was looking for a purely jq answer, I found this question looking for any answer. So, let me offer one I found (and found useful) to others like me.
sudo apt install moreutils - if you don't have them yet. Moreutils website.
echo "Any, column, name, that, is, not, in, your, json, object" | cat - your.csv | sponge your.csv
Disadvantages: requires moreutils package, is not just jq-reliant, so some would understandably say less elegant.
Advantages: you choose your headers, not your JSON keys. Also, pure jq ways are bothered by the sorting of the keys, depending on your version.
How does it work?
echo outputs your header
cat - takes echo output from stdin (cause -) and conCATenates it with your csv file
sponge waits until that is done and writes the result to same file, overwriting it.
But you could do it with tee without having to install any packages!
No, you could not, as Kos excellently demonstrates here. Not unless you're fine with loosing your csv at some point.

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?