Formatting wide output via 'column' (or similar) command(s) - json

This question actually asks the 'inverse' solution as the one here, namely I would like to wrap the long column (column 4) on multiple lines. In effect, the output should look like:
cat test.csv | column -s"," -t -c5
col1 col2 col3 col4 col5
1 2 3 longLineOfText 5
ThatIWantTo
InspectAndWould
LikeToWrap
(excuse the u.u.o.c. duplicated over here :) )
The solution would ideally :
make use of standard *nix text processing utilities (e.g. column, paste, pr which usually are present on any modern Linux machine nowadays, usually coming from the core-utils package);
avoid jq as it is not necessarily present on every (production) system;
don't overheat the brain: yes... am looking mainly at you awk & co. gurus :). "Normal" awk / perl / sed is fine.
as a special bonus , a solution using vim would be even more welcome (again, no brain smoke please), since that would allow for syntax-coloring as well.
The background: I want to be able to make sense of the output of docker history, so as a last resort even some Go Template-magic would suit, as would using jq.
In extreme cases (if the benefits of ease-of-remembering-and-use outweigh the inconvenience of downloading a new utilty (preferably self-contained / static linked) utility on the server - is ok, or using json processing commands (in which case using pythons json module would be preferred)
Thanks !
LE:
Please keep in mind, that dockers output has the columns separated with several spaces, which unfortunately confuses most commands :(

Related

Trying to get `sed `to fix a phpBB board - but I cannot get my regexp to work

I have an ancient phpBB3 board which has gone through several updates over its 15+ years of existence. Sometimes, in the distant past, such updates would partially fail, leaving all sorts of 'garbage' in the BBCode. I'm now trying to do a 'simple' regexp to match a particular issue and fix it.
What happened was the following... during a database update, long long ago, BBCode tags were, for some reason, 'tagged' with a pseudo-attribute — allegedly for the database updating script to figure out each token that required updating, I guess. This attribute was always a 8-char-long alphanumeric string, 'appended' to the actual BBCode with a semicolon, like this:
[I]something in italic[/I]
...
[I:i9o7y3ew]something in italic[/I:i9o7y3ew]
Naturally, phpBB doesn't recognise this as valid BBCode, and just prints the whole text out.
The replacement regexp is actually very basic:
s/\[(\/?)(.+):[[:alnum:]]{0,8}\]/[\1\2]/gim
You can see a working example on regex110.com (where capture groups use $1 instead of \1). The example given there includes a few examples from the actual database itself. [i] is actually the simplest case; there are plenty of others which are perfectly valid but a bit more complex, thus requiring a (.+) matcher, such as [quote=\"Gwyneth Llewelyn\":2m80kuso].
As you can see from the example on regex110.com, this works :-)
Why doesn't it work under (GNU) sed? I'm using version 4.8 under Linux:
$ sed -i.bak -E "s/\[(\/?)(.+):[[:alnum:]]+\]/[\1\2]/gim" table.sql
Just for the sake of the argument, I tried using [A-Za-z0-9]+ instead of [[:alnum:]]+; I've even tried (.+) (to capture the group and then just discard it)
None produced an error; none did any replacements whatsoever.
I understand that there are many different regexp engines out there (PCRE, PCRE2, Boost, etc. and so forth) so perhaps sed is using a syntax that is inconsistent with what I'm expecting...?
Rationale: well, I could have done this differently; after all, MySQL has built-in regexp replacements, too. However, since this particular table is so big, it takes eternities. I thought I'd be far better off by dumping everything to a text file, doing the replacements there, and importing the table again. There is a catch, though: the file is 95 MBytes in size, which means that most tools I've got (e.g. editors with built-in regexp search & replace) will fail with such a huge exception. One notable exception is good old emacs, which has no trouble with such large files. Alas, emacs cannot match anything, so I thought I'd give sed a try (it should be faster, too). sed takes also close to a minute or so to process the whole file — about the same as emacs, in fact — and has the same result, i.e. no replacements are being made. It seems to me that, although the underlying technology is so different (pure C vs. Emacs-LISP), both these tools somehow rely on similar algorithms... both of which fail.
My understanding is that some libraries use different conventions to signal literal vs. metacharacters and quantifiers. Here is an example from an instruction manual for vim: http://www.vimregex.com/#compare
Indeed, contemporary versions of sed seem to be able to handle two different kinds of conventions (thus the -E flag). The issue I have with my regexp is that I find it very difficult to figure out which convention to apply. Let's start with what I'm used to from PHP, Go, JavaScript and a plethora of other regexp implementations, which use the convention that metacharacters & quantifiers do not get backslashed (while literals do).
Thus, \[(\/?)(.+):[[:alnum:]]+\] presumes that there are a few literal matches for [, ], /, and only these few cases require backslashes.
Using the reverse convention — i.e. literals do not get backslashed, while metacharacters and some quantifies do — this would be written as:
[\(/\?\)\(\.\+\):\[\[:alnum:\]\]\+]
Or so I would think.
Sadly, sed also rejects this with an error — and so do vim and emacs, BTW (they seem to use a similar regexp library, or perhaps even the same one).
So what is the correct way to write my regexp so that sed accepts it (and does what I intend it to do)?
UPDATE
I have since learned that, in the database, phpBB, unlike I assumed, does not store BBCode (!) but rather a variant of HTML (some tags are the same, some are invented on the spot). What happens is that BBCode gets translated into that pseudo-HTML, and back again when displaying; that, at least, explains why phpBB extensions such as Markdown for phpBB — but also BBCode add-ons! — can so easily replace, partially or even totally, whatever is in the database, which will continue to work (to a degree!) even if those extensions get deactivated: the parsed BBCode/Markdown is just converted to this 'special' styling in the database, and, as such, will always be rendered correctly by phpBB3, no matter what.
On other words, fixing those 'broken' phpBB tags requires a bit more processing, and not merely search & replace with a single regexp.
Nevertheless, my question is still pertinent to me. I'm not really an expert with regexps but I know the basics — enough to make my life so much easier! — and it's always good to understand the different 'dialects' used by different platforms.
Notably, instead of using egrep and/or grep -E, I'm fond of using ugrep instead. It uses PCRE2 expressions (with the Boost library), and maybe that's the issue I'm having with the sed engine(s) — the different engines speak different regular expressions dialect, and converting from one grep variant to a different one might not be useful at all (because some options will not 'translate' well enough)...
Using sed
(\[[^:]*) - Retain everything up to but not including the next semi colon after a opening bracket within the parenthesis which can later be returned with back reference \1
[^]]* - Exclude everything else up to but not including the next closing bracket
$ sed -E 's/(\[[^:]*)[^]]*/\1/g' table.sql
[I]something in italic[/I]
...
[I]something in italic[/I]

Extracting CREATE TABLE definitions from MySQL dump?

I have a MySQL dump file over 1 terabyte big. I need to extract the CREATE TABLE statements from it so I can provide the table definitions.
I purchased Hex Editor Neo but I'm kind of disappointed I did. I created a regex CREATE\s+TABLE(.|\s)*?(?=ENGINE=InnoDB) to extract the CREATE TABLE clause, and that seems to be working well testing in NotePad++.
However, the ETA of extracting all instances is over 3 hours, and I cannot even be sure that it is doing it correctly. I don't even know if those lines can be exported when done.
Is there a quick way I can do this on my Ubuntu box using grep or something?
UPDATE
Ran this overnight and output file came blank. I created a smaller subset of data and the procedure is still not working. It works in regex testers however, but grep is not liking it and yielding an empty output. Here is the command I'm running. I'd provide the sample but I don't want to breach confidentiality for my client. It's just a standard MySQL dump.
grep -oP "CREATE\s+TABLE(.|\s)+?(?=ENGINE=InnoDB)" test.txt > plates_schema.txt
UPDATE
It seems to not match on new lines right after the CREATE\s+TABLE part.
You can use Perl for this task... this should be really fast.
Perl's .. (range) operator is stateful - it remembers state between evaluations.
What it means is: if your definition of table starts with CREATE TABLE and ends with something like ENGINE=InnoDB DEFAULT CHARSET=utf8; then below will do what you want.
perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' INPUT_FILE.sql > OUTPUT_FILE.sql
EDIT:
Since you are working with a really large file and would probably like to know the progress, pv can give you this also:
pv INPUT_FILE.sql | perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' > OUTPUT_FILE.sql
This will show you progress bar, speed and ETA.
You can use the following:
grep -ioP "^CREATE\s+TABLE[\s\S]*?(?=ENGINE=InnoDB)" file.txt > output.txt
If you can run mysqldump again, simply add --no-data.
Got it! grep does not support matching across multiple lines. I found this question helpul and I ended up using pcregrep instead.
pcregrep -M "CREATE\s+TABLE(.|\n|\s)+?(?=ENGINE=InnoDB)" test.txt > plates.schema.txt

CSV transformation

I have been looking for a way to reformat a CSV (Pipe separator) file with some if parameters, I'm pretty sure this can be done in PHP (strpos and if statements) or using XSLT but wanted to know if this is the best/easiest way to do it before I go and learn my way around a new language. here is a small example of the kind of thing I'm trying to achieve (the real file is about 25000 lines is this changes the answer?)
99407350|Math Book #13 (Random Information)|AB Collings|http:www.abc.com/ABC
497790366|English Book|Harold Herbert|http:www.abc.com/HH
Transform to this:
99407350|Math Book|#13|AB Collings|http:www.abc.com/ABC
497790366|English Book||Harold Herbert|http:www.abc.com/HH
Any advice about which direction I need to look in would be great.
PHP provides getcsv() (PHP 5) and fgetcsv() (PHP 4 and 5) for this, so if you are working in a PHP environment, use that. See e.g. http://www.php.net/manual/en/function.fgetcsv.php
If you do something yourself, remember to cope with "...|..." and/or \| to have | inside a field. Or test to make sure it can't happen - e.g. check the code that exports the database to CSV if that's what's happening.
Note also - on Unix / Solaris / Linux / OS X systems,
awk -F '|' '(NF != 9)' yourfile.csv | wc
will count the number of lines with other than 9 fields; if you are certain | never occurs except as a field delimiter, awk is a perfectly fine language for this too, e.g. with
awk -F '|' '{ gsub(/ [(].*[)]/, "", $1); print}' yourfile.csv
Here, [(] matches ( in a way that works across different versions of awk, and same for [)].

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?

FreeBSD, MySQL, Perl, bash: intermittent blocking on named pipes?

This is weird and I'm not sure who the culprit really is.
I'm doing some scripting, on FreeBSD (6.2)? which makes extensive use of the following ***bash***ism:
do_something <(mysql --skip-column-names -B -e 'select ... from ... where ...;')
... where "do_something is a somewhat crufty utility (in Perl) that won't read from a pipeline. If I use a regular file it works fine. My bash script using things like exec 4< <(...) with these sorts of queries (following by loops of the form while read x y z <&4; do ... never seem to have any issues.
However, Perl (5.8.x) seems to periodically block (apparently forever). I tried changing out the chomp(my $data = <MYDATA>); with a routine that used sysread and I wrote some test cases in Python for comparison. These seem to block far less often than the idiomatic Perl code, but they still do it sometimes. (The Python code using f.read() or os.read(f.fileno()...) seems to behave about equally in this issue).
I've tried reproducing the issue using ... <(cat ...) (where I'm cating the regular file) and that never seems to reproduce that stall.
I've glanced at some ktrace/kdump data ... but I'm far more familiar with Linux strace or even Solaris truss ... so I haven't figured out what's going from there yet, either.
I suppose we can mostly rule out Perl, because I've reproduced the same issue using Python ... I don't see how the bash could be doing anything wrong here (it's just creating a named pipe in /var/tmp/sh-np-xxx and wiring the processes up to that).
What could the mysql shell/utility be doing that might cause this? I don't think I've seen it from anything else (such as cat or dd). I haven't tested this scenario under Linux ... but I've used <(...) (process substitution) for years under Linux and don't recall ever seeing this.
Is it a FreeBSD issue?
Sure I can work around the issue using temporary files ... but I'd sure rather understand why it's doing this (and avoid some of the races and clean-up messiness that temporary files entail).
Any suggestions?
The big difference between operating on the output of mysql and directly on a file is timing. When the perl process is stalled, the big question is: "why is it not making forward progress"? You can use the "l" option to ps to see the wait channel for the perl process; that way you can see if it blocked on a read, or if something else is going on. If it is really blocked on pipe input, I expect the MWCHAN entry for perl to be "piperd".
The same information would be interesting for the mysql process.
What does your Python test code look like?
Another way of writing this while avoiding the bashism is this; that would allow you to rule out bash:
mysql --skip-column-names -B -e 'select ... from ... where ...;' | do_something /dev/stdin
Other interesting questions:
Does the --unbuffered option to mysql change anything?
Does piping the mysql output through dd change anything? (eg. "perlscript <(mysql ... | dd)
Summary: Need more information.