Extracting URLs from large text/HTML files

Extracting URLs from large text/HTML files - html

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.

I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!

I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?

Related

Formatting wide output via 'column' (or similar) command(s)

This question actually asks the 'inverse' solution as the one here, namely I would like to wrap the long column (column 4) on multiple lines. In effect, the output should look like:
cat test.csv | column -s"," -t -c5
col1 col2 col3 col4 col5
1 2 3 longLineOfText 5
ThatIWantTo
InspectAndWould
LikeToWrap
(excuse the u.u.o.c. duplicated over here :) )
The solution would ideally :
make use of standard *nix text processing utilities (e.g. column, paste, pr which usually are present on any modern Linux machine nowadays, usually coming from the core-utils package);
avoid jq as it is not necessarily present on every (production) system;
don't overheat the brain: yes... am looking mainly at you awk & co. gurus :). "Normal" awk / perl / sed is fine.
as a special bonus , a solution using vim would be even more welcome (again, no brain smoke please), since that would allow for syntax-coloring as well.
The background: I want to be able to make sense of the output of docker history, so as a last resort even some Go Template-magic would suit, as would using jq.
In extreme cases (if the benefits of ease-of-remembering-and-use outweigh the inconvenience of downloading a new utilty (preferably self-contained / static linked) utility on the server - is ok, or using json processing commands (in which case using pythons json module would be preferred)
Thanks !
LE:
Please keep in mind, that dockers output has the columns separated with several spaces, which unfortunately confuses most commands :(

Extracting CREATE TABLE definitions from MySQL dump?

I have a MySQL dump file over 1 terabyte big. I need to extract the CREATE TABLE statements from it so I can provide the table definitions.
I purchased Hex Editor Neo but I'm kind of disappointed I did. I created a regex CREATE\s+TABLE(.|\s)*?(?=ENGINE=InnoDB) to extract the CREATE TABLE clause, and that seems to be working well testing in NotePad++.
However, the ETA of extracting all instances is over 3 hours, and I cannot even be sure that it is doing it correctly. I don't even know if those lines can be exported when done.
Is there a quick way I can do this on my Ubuntu box using grep or something?
UPDATE
Ran this overnight and output file came blank. I created a smaller subset of data and the procedure is still not working. It works in regex testers however, but grep is not liking it and yielding an empty output. Here is the command I'm running. I'd provide the sample but I don't want to breach confidentiality for my client. It's just a standard MySQL dump.
grep -oP "CREATE\s+TABLE(.|\s)+?(?=ENGINE=InnoDB)" test.txt > plates_schema.txt
UPDATE
It seems to not match on new lines right after the CREATE\s+TABLE part.

You can use Perl for this task... this should be really fast.
Perl's .. (range) operator is stateful - it remembers state between evaluations.
What it means is: if your definition of table starts with CREATE TABLE and ends with something like ENGINE=InnoDB DEFAULT CHARSET=utf8; then below will do what you want.
perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' INPUT_FILE.sql > OUTPUT_FILE.sql
EDIT:
Since you are working with a really large file and would probably like to know the progress, pv can give you this also:
pv INPUT_FILE.sql | perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' > OUTPUT_FILE.sql
This will show you progress bar, speed and ETA.

You can use the following:
grep -ioP "^CREATE\s+TABLE[\s\S]*?(?=ENGINE=InnoDB)" file.txt > output.txt

If you can run mysqldump again, simply add --no-data.

Got it! grep does not support matching across multiple lines. I found this question helpul and I ended up using pcregrep instead.
pcregrep -M "CREATE\s+TABLE(.|\n|\s)+?(?=ENGINE=InnoDB)" test.txt > plates.schema.txt

mysql startup shell problems

since met so many startup errors,I decide to analyze mysql startup shell.while some code fragment I cannot understand clearly.
version:
mysql Ver 14.14 Distrib 5.5.43, for osx10.8 (i386) using readline 5.1
368 #
369 # First, try to find BASEDIR and ledir (where mysqld is)
370 #
372 if echo '/usr/local/mysql/share' | grep '^/usr/local/mysql' > /dev/null
373 then
374 relpkgdata=echo '/usr/local/mysql/share' | sed -e 's,^/usr/local/mysql,,' -e 's,^/,,' -e 's,^,./,'
375 else
376 # pkgdatadir is not relative to prefix
377 relpkgdata='/usr/local/mysql/share'
378 fi
what's the purpose of line 372? a little weird
any help will be appreciated.

At first glance, this is very strange indeed... but here's a solution to this mystery.
372: if echo '/usr/local/mysql/share' | grep '^/usr/local/mysql' > /dev/null
373: then
grep returns true if it matches and false if it doesn't, so this is testing whether the string /usr/local/mysql/share begins with (^) /usr/local/mysql. Output goes /dev/null because we don't need to see it, we just want to compare it.
"Well," you interject, "that's obvious enough. The question is why?" Stick with me.
If it matches:
374: relpkgdata=echo '/usr/local/mysql/share' | sed -e 's,^/usr/local/mysql,,' -e 's,^/,,' -e 's,^,./,'
Beginning with /usr/local/mysql/share, strip off the beginning /usr/local/mysql, then strip off the beginning / then prepend ./.
So /usr/local/mysql/share becomes ./share.
Otherwise, use the string /usr/local/mysql/share.
375: else
376: # pkgdatadir is not relative to prefix
377: relpkgdata='/usr/local/mysql/share'
"That's all fine, too," I hear you say, "but why go through all these gyrations to (apparently) compare and massage two fixed literal strings?? We already know the answer, so what's up with all the tests and substitution?"
It's a fair question.
My first suspicion was that there was some sort of magic bash hackery going on that I didn't recognize, but no, this code is really all too simple to be something along those lines.
My second suspicion, since this is notably absent from MySQL 5.0.96 (which I am not running but keep on hand for reference), was that this was an abandoned attempt to introduce some new magical behavior into mysqld_safe which was never finished and replaced with actual variables, the testing and massaging of which would have made a lot more sense than doing the same thing to literal strings.
But, no. When the only tool you have is a hammer, everything looks like a nail. What this is, is an example of doing something simple... the hard way. At least that's what it looks like to me. There actually is a somewhat rational explanation. To find it answer, you have to look into the source code (not binary) distribution.
MySQL has a lot of "hard-coded" defaults. This turns out to be an example of these.
In the source file scripts/mysqld_safe.sh, the snippet above looks very different:
if echo '#pkgdatadir#' | grep '^#prefix#' > /dev/null
then
relpkgdata=`echo '#pkgdatadir#' | sed -e 's,^#prefix#,,' -e 's,^/,,' -e 's,^,./,'`
else
# pkgdatadir is not relative to prefix
relpkgdata='#pkgdatadir#'
fi
Ah, source munging. Pattern substitution.
When you're compiling MySQL from source, the file scripts/Makefile contains instruction that use sed to replace things like #prefix# and #pkgdatadir# with the literal values. The same thing, of course, happens when Oracle or the Linux disto maintainers compile their binary distribution from source. These paths get hard-coded into many, many, many places in the code, including this script... resulting in the otherwise incomprehensible comparison of two literal strings that somebody should already have known the answer to.
Instead of testing at build time, whether one path is an anchored substring of the other, and the "relpkgdata" value should be expressed relative to the current directory and modifying this script accordingly, that logical test is actually deferred until runtime, comparing two literals that were substituted in for their placeholders at build time.
I've gone to this amount of detail, not because it will help you troubleshoot, because I suspect it won't. It was, however, just bizarre enough to warrant some further investigation.
If you are having difficulty getting MySQL Server running... well, you shouldn't be, because it's a well-established system and it should work. If /bin/sh on your system isn't a symlink to /bin/bash, you might want to change mysqld_safe's shebang line from #!/bin/sh to #!/bin/bash, but beyond that, I suspect you are sniffing down the wrong rabbit hole by looking at mysqld_safe to get to the bottom of your issue. As convoluted as mysqld_safe is, it can't be said that it isn't time-tested. As they say, "the problem is somewhere else."
If I may, I'll suggest that you familiarize yourself with some of our other communities where you're likely to find the answer you need, particularly Ask Ubuntu, Super User, Server Fault, and Database Administrators. Familiarize yourself with each site's community, scope, and the level of existing expertise that each community expects on the part of those who ask questions there, and search the sites for the specific problem you're encountering. It's very likely someone has seen it and we've fixed it on one of them, if not here on SO.

Parse ClamAV logs in Bash script using Regex to insert in MySQL

Morning/Evening all,
I've got a problem where I'm making a script for work that uses ClamAV to scan for malware, and then place it's results in MySQL by taking the resultant ClamAV logs using grep with awk to convert the right parts of the log to a variable. The problem I have is that whilst I have done the summary ok, the syntax of detections makes it slightly more difficult. I'm no expert at regex by all means and this is a bit of a learning experience, so there is probably a far better way of doing it than I have!
The lines I'm trying to parse looks like these:
/net/nas/vol0/home/recep/SG4rt.exe: Worm.SomeFool.P FOUND
/net/nas/vol0/home/recep/SG4rt.exe: moved to '/srv/clamav/quarantine/SG4rt.exe'
As far as I was able to establish, I need a positive lookbehind to match what happens after and before the colon, without actually matching the colon or the space after it, and I can't see a clear way of doing it from RegExr without it thinking I'm trying to look for two colons. To make matters worse, we sometimes get these too...
WARNING: Can't open file /net/nas/vol0/home/laser/samples/sample1.avi: Permission denied
The end result is that I can build a MySQL query that inserts the path, malware found and where it was moved to or if there was an error then the path, then the error encountered so as to convert each element to a variable contents in a while statement.
I've done the scan summary as follows:
Summary looks like:
----------- SCAN SUMMARY -----------
Known viruses: 329
Engine version: 0.97.1
Scanned directories: 17350
Scanned files: 50342
Infected files: 3
Total errors: 1
Data scanned: 15551.73 MB
Data read: 16382.67 MB (ratio 0.95:1)
Time: 3765.236 sec (62 m 45 s)
Parsing like this:
SCANNED_DIRS=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned directories" | awk '{gsub("Scanned directories: ", "");print}')
SCANNED_FILES=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned files" | awk '{gsub("Scanned files: ", "");print}')
INFECTED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Infected files" | awk '{gsub("Infected files: ", "");print}')
DATA_SCANNED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data scanned" | awk '{gsub("Data scanned: ", "");print}')
DATA_READ=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data read" | awk '{gsub("Data read: ", "");print}')
TIME_TAKEN=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Time" | awk '{gsub("Time: ", "");print}')
END_TIME=$(date +%s)
mysql -u scanner_parser --password=removed sc_live -e "INSERT INTO bs.live.bs_jobstat VALUES (NULL, '$CURRTIME', '$PID', '$IY', '$SCANNED_DIRS', '$SCANNED_FILES', '$INFECTED', '$DATA_SCANNED', '$DATA_READ', '$TIME_TAKEN', '$END_TIME');"
rm -f /srv/clamav/$IY-scan-$LOGTIME.log
Some of those variables are from other parts of the script and can be ignored. The reason I'm doing this is to save logfile clutter and have a simple web based overview of the status of the system.
Any clues? Am I going about all this the wrong way? Thanks for help in advance, I do appreciate it!

From what I can determine from the question, it seems like you are asking how to distinguish the lines you want from the logger lines that start with WARNING, ERROR, INFO.
You can do this without getting to fancy with lookahead or lookbehind. Just grep for lines beginning with
"/net/nas/vol0/home/recep/SG4rt.exe: "
then using awk you can extract the remainder of the line. Or you can gsub the prefix out like you are doing in the summary processing section.
As far as the question about processing the summary goes, what strikes me most is that you are processing the entire file multiple times, each time pulling out one kind of line. For tasks like this, I would use Perl, Ruby, or Python and make one pass through the file, collecting the pieces of each line after the colon, storing them in regular programming language variables (not env variables), and forming the MySQL insert string using interpolation.
Bash is great for some things but IMHO you are justified in using a more general scripting language (Perl, Python, Ruby come to mind).

FreeBSD, MySQL, Perl, bash: intermittent blocking on named pipes?

This is weird and I'm not sure who the culprit really is.
I'm doing some scripting, on FreeBSD (6.2)? which makes extensive use of the following ***bash***ism:
do_something <(mysql --skip-column-names -B -e 'select ... from ... where ...;')
... where "do_something is a somewhat crufty utility (in Perl) that won't read from a pipeline. If I use a regular file it works fine. My bash script using things like exec 4< <(...) with these sorts of queries (following by loops of the form while read x y z <&4; do ... never seem to have any issues.
However, Perl (5.8.x) seems to periodically block (apparently forever). I tried changing out the chomp(my $data = <MYDATA>); with a routine that used sysread and I wrote some test cases in Python for comparison. These seem to block far less often than the idiomatic Perl code, but they still do it sometimes. (The Python code using f.read() or os.read(f.fileno()...) seems to behave about equally in this issue).
I've tried reproducing the issue using ... <(cat ...) (where I'm cating the regular file) and that never seems to reproduce that stall.
I've glanced at some ktrace/kdump data ... but I'm far more familiar with Linux strace or even Solaris truss ... so I haven't figured out what's going from there yet, either.
I suppose we can mostly rule out Perl, because I've reproduced the same issue using Python ... I don't see how the bash could be doing anything wrong here (it's just creating a named pipe in /var/tmp/sh-np-xxx and wiring the processes up to that).
What could the mysql shell/utility be doing that might cause this? I don't think I've seen it from anything else (such as cat or dd). I haven't tested this scenario under Linux ... but I've used <(...) (process substitution) for years under Linux and don't recall ever seeing this.
Is it a FreeBSD issue?
Sure I can work around the issue using temporary files ... but I'd sure rather understand why it's doing this (and avoid some of the races and clean-up messiness that temporary files entail).
Any suggestions?

The big difference between operating on the output of mysql and directly on a file is timing. When the perl process is stalled, the big question is: "why is it not making forward progress"? You can use the "l" option to ps to see the wait channel for the perl process; that way you can see if it blocked on a read, or if something else is going on. If it is really blocked on pipe input, I expect the MWCHAN entry for perl to be "piperd".
The same information would be interesting for the mysql process.
What does your Python test code look like?
Another way of writing this while avoiding the bashism is this; that would allow you to rule out bash:
mysql --skip-column-names -B -e 'select ... from ... where ...;' | do_something /dev/stdin
Other interesting questions:
Does the --unbuffered option to mysql change anything?
Does piping the mysql output through dd change anything? (eg. "perlscript <(mysql ... | dd)
Summary: Need more information.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008