searching .CSV file with AWK - only working with first row - csv

I've been trying to search through a specific column of a .csv file to find cells containing a particular word. However, it's only working for the first row (i.e. the headings) in my .csv file.
The file is a series of over 10,000 forum posts, with column 1 as the post key and column 2 as the post text. The headings as below are 'key', 'annotated sentence'.
key,annotated sentence
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading system of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost any claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of having others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on overturning the 2A."
"(595, 0)",So you're actually claiming that it is a lie to say that the UK has a lower gun crime rate than the US? Even if the police were miscounting crimes it's still a huge and unjustified leap in logic to conclude from that that the UK does not have a lower gun crime rate.
"(736, 3)","The anti-abortionists claim a load of **** on many issues. I don't listen to them. To put the ""life"" of an unfertilized egg above that of a person is grotesquely sick IMO. I support any such stem cell research wholeheartedly."
The CSV separator is a comma, and the text delimiter is ".
if I try:
awk -F, '$1 ~ /key/ {print}' posts_file.csv > output_file.csv
it will output the headings row no problem. However, I have tried:
awk -F, '$1 ~ /212/ {print}' posts_file.csv > output_file.csv
awk -F, '$2 ~ /Canada/ {print}' posts_file.csv > output_file.csv
and neither of these work - no matches are found though there should be. I can't figure out why? Any ideas? Thanks in advance.

awk to the rescue!
In general complex csv doesn't work but in your case since key and annotated sentence have very distinct value types you can extend your pattern search to the whole record instead of key and value, the trick is defining the record, which again based on your format can be done as well. For example
$ awk -v RS='\n"' '/Canada/{print RT $0}' csv
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading syst
em of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost a
ny claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of havi
ng others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on ove
rturning the 2A."
and this
$ awk -v RS='\n"' '/(212, 2)/{print RT $0}' csv
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"

Python's CSV parsing supports your format out of the box.
Below is a simple script that you could call as follows:
# csvfilter <1-basedFieldNdx> <regexStr> < <infile> > <outfile>
csvfilter 1 'key' < posts_file.csv > output_file.csv
csvfilter 1 '212' < posts_file.csv > output_file.csv
csvfilter 2 'Canada' < posts_file.csv > output_file.csv
Sample script csvfilter:
#!/usr/bin/env python
# coding=utf-8
import csv, sys, re
# Assign arguments to variables.
fieldNdx = int(sys.argv[1]) - 1 # Index of field to filter; Python arrays are 0-based!
reStr = sys.argv[2] if (len(sys.argv) > 2) else '' # Filter regex
# Read from stdin...
reader = csv.reader(sys.stdin)
# ... write to stdout.
writer = csv.writer(sys.stdout, reader.dialect)
# Read each line...
for row in reader:
# Match the target field against the filter regex and
# print the row only if it matches.
if (re.search(reStr, row[fieldNdx])):
writer.writerow(row)

OpenRefine could help with the search.

One way to use awk safely with complex CSV is to use a "csv2tsv" utility to convert the CSV file to a format that can be handled properly by awk.
Usually the TSV ("tab-separated values") format is just right for the job.
(If the final output must be CSV, then either a complementary "tsv2csv" utility can be used, or awk itself can do the job -- though some care may be required to get it exactly right.)
So the pipeline might look like this:
csv2tsv < input.csv | awk -F\\t 'BEGIN{OFS=FS} ....' | tsv2csv
There are several alternatives for csv-to-tsv conversion, ranging from roll-your-own scripts to Excel, but I'd recommend taking the time to check that whichever tool or toolset you select satisfies the "edge case" requirements that are of interest to you.

Related

Fuzzing command line arguments [argv]

I have a binary I've been trying to fuzz with AFL, the only thing is AFL only fuzzes STDIN, and File inputs and this binary takes input through its arguments pass_read [input1] [input2]. I was wondering if there are any methods/fuzzers that allow fuzzing in this manner?
I don't not have the source code so making a harness is not really applicable.
Michal Zalewski, the creator of AFL, states in this post:
AFL doesn't support argv fuzzing, because TBH, it's just not horribly useful in
practice. There is an example in experimental/argv_fuzzing/ showing how to do it
in a general case if you really want to.
Link to the mentioned example on GitHub: https://github.com/google/AFL/tree/master/experimental/argv_fuzzing
There are some instructions in the file argv-fuzz-inl.h (haven't tried myself).
Bash only Solution
As an example, lets generate 10 random strings and store them in a file
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 10 | head -n 10 > string-file.txt
Next, lets read 2 lines from string-file and pass it into our application
exec handle< string-file.txt
while read string1 <&handle ; do
read string2 <&handle
pass_read $line1 $line2 >> crash_file.txt
done
exec handle<&-
We then have any crashes stored within crash_file.txt for further analysis.
This may not be the most elegant solution, but perhaps you gives you an idea of some other possibilities if no tool necessarily fulfills the current requirements
I looked at the AFLplusplus repo on GitHub. Inside AFLplusplus/utils/argv_fuzzing/, there is a Makefile. If you run it, you will get a .so file (a shared library) that you can use to do argv fuzzing, even if you only have the binary. Obviously, you must use AFL_PRELOAD. You can read more in the README.

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

Compare CSV files

I am currently using a windows utility called TableTexCompare
This tool can take 2 CSV files and compare them. The nice thing about it is that it can make the comparison even if the records of the 2 files are not sorted in the same order or the fields are not positioned in the same order.
As such, the following 2 files would result in a successful comparison
(File1.csv)
FirstName,LastName,Age
Mona,Sax,30
Max,Payne,43
Jack,Lupino,50
(File2.csv)
FirstName,Age,LastName
Max,43,Payne
Jack,50,Lupino
Mona,30,Sax
What I am looking for is to do the same thing from the command-line with just 1 difference:
I would like the comparison to be performed in one direction only, i.e. if File2.csv is as follows (a subset of File1.csv), the comparison should pass
(File2.csv)
FirstName,Age,LastName
Jack,50,Lupino
I do not particularly care if it's going to be in some programming language, a dedicated cli tool or a shell script (e.g. using awk). I have some experience with Java and Groovy but would like to be pointed to some initial direction.
I can offer a Python solution:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
for item in r2:
if not item in r1:
print("r2 is not a subset of r1!")
break
This is actually a bit more verbose than necessary in Python (but easier to understand); I personally would have used a generator expression:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
if all(item in r1 for item in r2):
print("r2 is a subset of r1")
If you can afford to do a case insensitive comparison, and if there are no duplicates within File2.csv that must be matched within File1.csv, and if File1.csv does not contain \\ or \", then all you need is a simple FINDSTR command.
The following will list lines in File2.csv that do not appear in File1.csv:
findstr /vxig:"File1.csv" "File2.csv"
If all you want is an indication whether File1.csv is a superset of File2.csv, then
findstr /vxig:"File1.csv" "File2.csv" >nul && (echo File1 is NOT a superset of File2) || (echo File1 IS a superset of File2)
The search should not have to be case insensitive, except there is a nasty FINDSTR bug: it may fail to find matches when there are multiple case sensitive literal search strings of varying size. The case insensitive option avoids the bug. See Why doesn't this FINDSTR example with multiple literal search strings find a match? for more info.
The search will not work properly if File2.csv contains \\ or \" because FINDSTR will treat them as \ and " respectively. See What are the undocumented features and limitations of the Windows FINDSTR command? for more info. The accepted answer has sections describing FINDSTR escape sequences about half way down.
You can take a look at q - Text as a Database , which allows executing SQL directly on csv files, including joins. This will allow doing a compare easily, and much more, such as matching specific columns for equality, and getting specific columns from rows that don't match etc.
Full disclosure - It's my own open source tool.
Harel Ben-Attia

CSV transformation

I have been looking for a way to reformat a CSV (Pipe separator) file with some if parameters, I'm pretty sure this can be done in PHP (strpos and if statements) or using XSLT but wanted to know if this is the best/easiest way to do it before I go and learn my way around a new language. here is a small example of the kind of thing I'm trying to achieve (the real file is about 25000 lines is this changes the answer?)
99407350|Math Book #13 (Random Information)|AB Collings|http:www.abc.com/ABC
497790366|English Book|Harold Herbert|http:www.abc.com/HH
Transform to this:
99407350|Math Book|#13|AB Collings|http:www.abc.com/ABC
497790366|English Book||Harold Herbert|http:www.abc.com/HH
Any advice about which direction I need to look in would be great.
PHP provides getcsv() (PHP 5) and fgetcsv() (PHP 4 and 5) for this, so if you are working in a PHP environment, use that. See e.g. http://www.php.net/manual/en/function.fgetcsv.php
If you do something yourself, remember to cope with "...|..." and/or \| to have | inside a field. Or test to make sure it can't happen - e.g. check the code that exports the database to CSV if that's what's happening.
Note also - on Unix / Solaris / Linux / OS X systems,
awk -F '|' '(NF != 9)' yourfile.csv | wc
will count the number of lines with other than 9 fields; if you are certain | never occurs except as a field delimiter, awk is a perfectly fine language for this too, e.g. with
awk -F '|' '{ gsub(/ [(].*[)]/, "", $1); print}' yourfile.csv
Here, [(] matches ( in a way that works across different versions of awk, and same for [)].

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?