CSV transformation - csv

I have been looking for a way to reformat a CSV (Pipe separator) file with some if parameters, I'm pretty sure this can be done in PHP (strpos and if statements) or using XSLT but wanted to know if this is the best/easiest way to do it before I go and learn my way around a new language. here is a small example of the kind of thing I'm trying to achieve (the real file is about 25000 lines is this changes the answer?)
99407350|Math Book #13 (Random Information)|AB Collings|http:www.abc.com/ABC
497790366|English Book|Harold Herbert|http:www.abc.com/HH
Transform to this:
99407350|Math Book|#13|AB Collings|http:www.abc.com/ABC
497790366|English Book||Harold Herbert|http:www.abc.com/HH
Any advice about which direction I need to look in would be great.

PHP provides getcsv() (PHP 5) and fgetcsv() (PHP 4 and 5) for this, so if you are working in a PHP environment, use that. See e.g. http://www.php.net/manual/en/function.fgetcsv.php
If you do something yourself, remember to cope with "...|..." and/or \| to have | inside a field. Or test to make sure it can't happen - e.g. check the code that exports the database to CSV if that's what's happening.
Note also - on Unix / Solaris / Linux / OS X systems,
awk -F '|' '(NF != 9)' yourfile.csv | wc
will count the number of lines with other than 9 fields; if you are certain | never occurs except as a field delimiter, awk is a perfectly fine language for this too, e.g. with
awk -F '|' '{ gsub(/ [(].*[)]/, "", $1); print}' yourfile.csv
Here, [(] matches ( in a way that works across different versions of awk, and same for [)].

Related

Formatting wide output via 'column' (or similar) command(s)

This question actually asks the 'inverse' solution as the one here, namely I would like to wrap the long column (column 4) on multiple lines. In effect, the output should look like:
cat test.csv | column -s"," -t -c5
col1 col2 col3 col4 col5
1 2 3 longLineOfText 5
ThatIWantTo
InspectAndWould
LikeToWrap
(excuse the u.u.o.c. duplicated over here :) )
The solution would ideally :
make use of standard *nix text processing utilities (e.g. column, paste, pr which usually are present on any modern Linux machine nowadays, usually coming from the core-utils package);
avoid jq as it is not necessarily present on every (production) system;
don't overheat the brain: yes... am looking mainly at you awk & co. gurus :). "Normal" awk / perl / sed is fine.
as a special bonus , a solution using vim would be even more welcome (again, no brain smoke please), since that would allow for syntax-coloring as well.
The background: I want to be able to make sense of the output of docker history, so as a last resort even some Go Template-magic would suit, as would using jq.
In extreme cases (if the benefits of ease-of-remembering-and-use outweigh the inconvenience of downloading a new utilty (preferably self-contained / static linked) utility on the server - is ok, or using json processing commands (in which case using pythons json module would be preferred)
Thanks !
LE:
Please keep in mind, that dockers output has the columns separated with several spaces, which unfortunately confuses most commands :(

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?

Parse ClamAV logs in Bash script using Regex to insert in MySQL

Morning/Evening all,
I've got a problem where I'm making a script for work that uses ClamAV to scan for malware, and then place it's results in MySQL by taking the resultant ClamAV logs using grep with awk to convert the right parts of the log to a variable. The problem I have is that whilst I have done the summary ok, the syntax of detections makes it slightly more difficult. I'm no expert at regex by all means and this is a bit of a learning experience, so there is probably a far better way of doing it than I have!
The lines I'm trying to parse looks like these:
/net/nas/vol0/home/recep/SG4rt.exe: Worm.SomeFool.P FOUND
/net/nas/vol0/home/recep/SG4rt.exe: moved to '/srv/clamav/quarantine/SG4rt.exe'
As far as I was able to establish, I need a positive lookbehind to match what happens after and before the colon, without actually matching the colon or the space after it, and I can't see a clear way of doing it from RegExr without it thinking I'm trying to look for two colons. To make matters worse, we sometimes get these too...
WARNING: Can't open file /net/nas/vol0/home/laser/samples/sample1.avi: Permission denied
The end result is that I can build a MySQL query that inserts the path, malware found and where it was moved to or if there was an error then the path, then the error encountered so as to convert each element to a variable contents in a while statement.
I've done the scan summary as follows:
Summary looks like:
----------- SCAN SUMMARY -----------
Known viruses: 329
Engine version: 0.97.1
Scanned directories: 17350
Scanned files: 50342
Infected files: 3
Total errors: 1
Data scanned: 15551.73 MB
Data read: 16382.67 MB (ratio 0.95:1)
Time: 3765.236 sec (62 m 45 s)
Parsing like this:
SCANNED_DIRS=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned directories" | awk '{gsub("Scanned directories: ", "");print}')
SCANNED_FILES=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned files" | awk '{gsub("Scanned files: ", "");print}')
INFECTED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Infected files" | awk '{gsub("Infected files: ", "");print}')
DATA_SCANNED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data scanned" | awk '{gsub("Data scanned: ", "");print}')
DATA_READ=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data read" | awk '{gsub("Data read: ", "");print}')
TIME_TAKEN=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Time" | awk '{gsub("Time: ", "");print}')
END_TIME=$(date +%s)
mysql -u scanner_parser --password=removed sc_live -e "INSERT INTO bs.live.bs_jobstat VALUES (NULL, '$CURRTIME', '$PID', '$IY', '$SCANNED_DIRS', '$SCANNED_FILES', '$INFECTED', '$DATA_SCANNED', '$DATA_READ', '$TIME_TAKEN', '$END_TIME');"
rm -f /srv/clamav/$IY-scan-$LOGTIME.log
Some of those variables are from other parts of the script and can be ignored. The reason I'm doing this is to save logfile clutter and have a simple web based overview of the status of the system.
Any clues? Am I going about all this the wrong way? Thanks for help in advance, I do appreciate it!
From what I can determine from the question, it seems like you are asking how to distinguish the lines you want from the logger lines that start with WARNING, ERROR, INFO.
You can do this without getting to fancy with lookahead or lookbehind. Just grep for lines beginning with
"/net/nas/vol0/home/recep/SG4rt.exe: "
then using awk you can extract the remainder of the line. Or you can gsub the prefix out like you are doing in the summary processing section.
As far as the question about processing the summary goes, what strikes me most is that you are processing the entire file multiple times, each time pulling out one kind of line. For tasks like this, I would use Perl, Ruby, or Python and make one pass through the file, collecting the pieces of each line after the colon, storing them in regular programming language variables (not env variables), and forming the MySQL insert string using interpolation.
Bash is great for some things but IMHO you are justified in using a more general scripting language (Perl, Python, Ruby come to mind).