Read tab-delimited data into Hive array - csv

Data format I need:
12cef8e1b711a351 [1377045694501,1377045728475,1377045709652]
12cf3cb988f10a87 [1380741459591,1380739871201,1380739785397,1380740303830,1380739849591]
12d1be8adb90a88b [1375541238666,1375541281821]
12d29ba61341e7ce [1377855844089,1377855785342]
12d2e28e50d42d19 [1381974506104,1381973579872,1377988785664,1381976074258]
Data format I have - everything is tab-delimited:
12cef8e1b711a351 1377045694501 377045728475 1377045709652
12cf3cb988f10a87 1380741459591 1380739871201 1380739785397 1380740303830 1380739849591
12d1be8adb90a88b 1375541238666 1375541281821
12d29ba61341e7ce 1377855844089 1377855785342
12d2e28e50d42d19 1381974506104 1381973579872 1377988785664 1381976074258
How do I process tab-delimited data so that the first field is delimited from the rest with tab, and everything else is comma-delimited and surrounded by []. Possibly, each comma-delimited item also has to be concluded into "".
I need to read these data into Hive table
CREATE TABLE id_timestamps (id STRING, timestamps array<STRING>);
Can I read it directly to Hive with some tricks or shell I transform tab-delimited data with awk or sed? Please, help with some suggestions and recipes.
Thanks!

This awk script produces the desired format:
awk '{printf "%s\t[", $1; for(i=2;i<=NF;++i) printf "%s%s", $i, (i<NF?",":"]\n")}' file
Print the first column, followed by a tab character and the opening "[". Print the rest of the columns followed by a ",", except the last, which is followed by a "]" and a newline.
Testing it out:
$ awk '{printf "%s\t[", $1; for(i=2;i<=NF;++i) printf "%s%s", $i, (i<NF?",":"]\n")}' file
12cef8e1b711a351 [1377045694501,377045728475,1377045709652]
12cf3cb988f10a87 [1380741459591,1380739871201,1380739785397,1380740303830,1380739849591]
12d1be8adb90a88b [1375541238666,1375541281821]
12d29ba61341e7ce [1377855844089,1377855785342]
12d2e28e50d42d19 [1381974506104,1381973579872,1377988785664,1381976074258]

Related

CSV Column Insertion via awk

I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file
With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.

Convert JSON Objects to JSON object array Without programming Language (exception shell scripting)

I have file containing Multiple JSON Objects and need to covert them to JSON. I have bash and Excel installed, but cannot install any other tool.
{"name": "a","age":"17"}
{"name":"b","age":"18"}
To:
[{"name": "a","age":"17"},
{"name":"b","age":"18"}]
Assuming one object per line as shown by OP question
echo -n "["; while read line; do echo "${line},"; done < <(cat test.txt) | sed -re '$ s/(.*),/\1]/'
Result:
[{"name": "a","age":"17"},
{"name":"b","age":"18"}]
Inspired by https://askubuntu.com/a/475804
awk '(NR==FNR){count++} (NR!=FNR){ if (FNR==1) printf("["); printf("%s", $0); print (FNR==count)?"]":"," } END {if (count==0) print "[]"}' file file
A less compact but more readable version:
awk '
(NR==FNR) {
count++;
}
(NR!=FNR) {
if (FNR==1)
printf("[");
printf("%s", $0);
if (FNR==count)
print "]";
else
print ",";
}
END {
if (count==0) print "[]";
}' file file
The trick is to give the same file twice to awk. Because NR==FNR is always true for the first file, there is a first parse dedicated to counting the number of lines into variable count.
The second parse with NR!=FNR will apply the following algorithm for each line:
Write [ for the first line only
Then write the record, using printf instead of print in order to avoid the newline ending
Then write either ] or , depending on whether we are on the last line or not, using print in order to end with a newline
The END command is just a failsafe to output an empty array in case the file is empty.
Assumptions:
no requirement to (re)format the input data
Sample input:
$ cat raw.dat
{"name": "a","age":"17"}
{"name":"b","age":"18"}
{"name":"C","age":"23"}
One awk idea:
awk 'BEGIN {pfx="["} {printf "%s%s",pfx,$0; pfx=",\n"} END {printf "]\n"}' raw.dat
Where:
for each input line we printf the line without a terminating linefeed
for the first line we use a prefix (pfx) of [
for subsequent lines the prefix (pfx) is set to ,\n (ie, terminate the previous line with ,\n)
once the file has been processed we terminate the last input line with a printf "]\n"
requires a single pass through the input file
This generates:
[{"name": "a","age":"17"},
{"name":"b","age":"18"},
{"name":"C","age":"23"}]
Making sure #chepner's comment (re: a sed solution) isn't lost in the mix:
sed '1s/^/[/;2,$s/^/,/;$s/$/]/' raw.dat
This generates:
[{"name": "a","age":"17"}
,{"name":"b","age":"18"}
,{"name":"C","age":"23"}]
NOTE: I can remove this if #chepner wants to post this as an answer.

transform multiline text into csv with awk sed and grep

I run a shell command that returns a list of repeated values like this (note the indentation):
Name: vm346
cpu 1 (12%) 6150m (76%)
memory 1130Mi (7%) 1130Mi (7%)
Name: vm847
cpu 6 (75%) 30150m (376%)
memory 12980Mi (87%) 12980Mi (87%)
Name: vm848
cpu 3500m (43%) 17150m (214%)
memory 6216Mi (41%) 6216Mi (41%)
I am trying to transform that data like this (in csv):
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The problem is that any given dataset like the one above is always on more than one line.
when I pipe that into it awk it drives me mad because even if I use:
BEGIN{ FS="\n" }
to try and stitch the data together in one line, it doesn't work. No matter what I do, awk keeps the name value as a separated line above everything else.
I am sorry I haven't much code to share but I have been spinning my wheels with this for a few hours now and I am running out of ideas...
I can solve this in Perl:
perl -ane 'print join ",", #F[1 .. $#F]; print $F[0] eq "memory" ? "\n" : ","'
It should be easy to translate it to awk if you need it.
How does it work?
-a splits each line on whitespace into the #F array
-n reads the input line by line and runs the code specified after -e for each line
We print all the elements but the first one separated by commas (see join)
We then look at the first column, if it's memory, we are at the last line of the block, so we print a newline, otherwise we print a comma
With AWK, one option is to set RS to "Name: ", and ignore the first record with NR > 1, e.g.
awk -v RS="Name: " 'BEGIN{OFS=","} NR > 1 {print $1, $3, $4, $5, $6, $8, $9, $10, $11}' file
#> vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
#> vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
#> vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
awk '{$1=""}1' | paste -sd' \n' - | awk '{$1=$1}1' OFS=,
Get rid of the first column. Join every three rows. Same idea with sed:
sed 's/^ *[^ ]* *//' | paste -sd' \n' - | sed 's/ */,/g'
Something else:
awk '
$1=="Name:" {
sep=ors
ors=ORS
} {
for (i=2;i<=NF;++i) {
printf "%s%s",sep,$i
sep=OFS
}
} END {printf "%s",ors}'
Or if you want to print an ORS based on the first field being "memory" (note that this program may end without printing a terminating ORS):
awk '{for (i=2;i<=NF;++i) printf "%s%s",$i,(i==NF && $1=="memory" ? ORS : OFS)}'
something else else:
awk -v OFS=, '
index($0,$1)==1 {
OFS=ors
ors=ORS
} {
$1=""
printf "%s",$0
OFS=ofs
} END {printf "%s",ors} BEGIN {ofs=OFS}'
This might work for you (GNU sed):
sed -nE '/^ +\S+ +/{s///;H;$!d};x;/./s/\s+/,/gp;x;s/^\S+ +//;h' file
In overview the sed program processes indented lines, already gathered lines (except in the case that the current line is the first line of the file) and non-indented lines.
Turn off implicit printing and enable extended regexp's. (-nE).
If the current line is indented, remove the indent, the first field and any following spaces, append the result to the hold space and if it is not the last line, delete it.
Otherwise, check the hold space for gathered lines and if found, replace one or more whitespaces by commas and print the result. Then prep the current line by removing the first field and any following spaces and replace the hold space with the result.
The solution seems logically back-to-front, but programming in this style avoids having to check for end-of-file multiple times and invoking labels and gotos.
N.B. This solution will work for any number of indented lines.
Here is a ruby to do that:
ruby -e '
s=$<.read
s.scan(/^([^ \t]+:)([\s\S]+?)(?=^\1|\z)/m). # parse blocks
map(&:last). # get data part
# parse and join the data fields:
map{|block| block.split(/\n[ \t]+[^ \t]+[ \t]+/)}.
map{|lines| lines.map(&:strip).join(" ").split().join(",")}.
each{|l| puts "#{l}"}
' file
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The advantage is that this is not dependent on the number of lines or the number of fields. It is parsing data that is in blocks of the form:
START: ([ \t]+[data_with_no_space])*\n
l1 ([ \t]+[data_with_no_space])*\n
...
START:
...
Works this way:
Parse the blocks with THIS REGEX;
Save an array of the data elements;
Join the sub arrays and then split into data fields;
Join(',') to make a csv.

Remove double quotes if delimiter value is not present in data

An input file is given, each line of which contains quotes for each column and carriage return/ new line character.
If the line contains new lines it has be appended with in the same
line which is inside the quotes i.e for example line 1
Removing of double quotes for each column if the delimiter(,) is
not present.
Removing of Carriage Return characters i.e(^M)
To exemplify, given the following input file
"name","address","age"^M
"ram","abcd,^M
def","10"^M
"abhi","xyz","25"^M
"ad","ram,John","35"^M
I would like to obtain the following output by means of a sed/perl/awk script/oneliner.
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
Solutions which i have tired it so far
For appending with previous line
sed '/^[^"]*"[^"]*$/{N;s/\n//}' sample.txt
for replacing control-m characters
perl -pne 's/\\r//g' sample.txt
But i didn't achieve final output what i required below
Use a library to parse CSV files. Apart from always wanting to use a library for that here you also have very specific reasons, with embedded newlines and delimiters.
In Perl a good library is Text::CSV (which wraps Text::CSV_XS if installed). A basic example
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Comments
The binary option in the constructor is what handles newlines embedded in data
Once a line is read into the array reference $row I remove newlines in each field with a simplistic regex. By all means please improve this as/if needed
The pruning of $row works as follows. In a foreach loop each element is really aliased by the loop variable, so if that gets changed the array changes. I used default where elements are aliased by $_, which the regex changes so $row changes.
I like this compact shortcut because it has such a distinct look that I can tell from across the room that an array is being changed in place; so I consider it a sort-of-an-idiom. But if it is in fact confusing please by all means write out a full and proper loop
The processed output is printed to STDOUT. Or, open an output file and pass that filehandle to say (or to print in older module versions) so the output goes directly to that file
The above prints, for the sample input provided in the question
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
This might work for you (GNU sed):
sed ':a;/[^"]$/{N;s/\n//;ba};s/"\([^",]*\)"/\1/g' file
The solution is in two parts:
Join broken lines to make whole ones.
Remove double quotes surrounding fields that do not contain commas.
If the current line does not end with double quotes, append the next line, remove the newline and repeat. Otherwise: remove double quotes surrounding fields that do not contain double quotes or commas.
N.B. Supposes that fields do not contain quoted double quotes. If that is the case, the condition for the first step would need to be amended and double quotes within fields would need to catered for.
FPAT is the way to go using gnu awk, it handles comma separated files.
remove ^m
clean lines
remove qutes
.
dos2unix sample.txt
awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt > tmp && mv tmp sample.txt
"name","address","age"
"ram","abcd,def","10"
"abhi","xyz","25"
"ad","ram,John","35"
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
All in one go:
dos2unix sample.txt && awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt | awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1'
Normally you set Filed Separator FS or F to tell how filed are separated. FPAT="([^,]+)|(\"[^\"]+\")" FPAT tells how the filed looks like using a regex. This regex is complicated and often used with CSV.
(i=1;i<=NF;i++) loop through on by one field on the line.
if($i!~",") if it does not contain comma, then
$i=substr($i,2,length($i)-2) remove first and last character, the "
If a field for some reason do not contain ", this is more robust:
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") {n=split($i,a,"\"");$i=(n>1?a[2]:$i)}}1' file
It will not do any thing to a field not contains double quote.
With perl, please try the following:
perl -e '
while (<>) {
s/\r$//; # remove trailing CR code
$str .= $_;
}
while ($str =~ /("(("")|[^"])*"\n?)|((^|(?<=,))[^,]*((?=,)|\n))/g) {
$_ = $&;
if (/,/) { # the element contains ","
s/\n//g; # then remove newline(s) if any
} else { # otherwise remove surrounding double quotes
s/^"//s; s/"$//s;
}
push(#ary, $_);
if (/\n$/) { # newline terminates the element
print join(",", #ary);
#ary = ();
}
}' sample.txt
Output:
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35

Get JSON files from particular interval based on date field

I've a lot json file the structure of which looks like below:
{
key1: 'val1'
key2: {
'key21': 'someval1',
'key22': 'someval2',
'key23': 'someval3',
'date': '2018-07-31T01:30:30Z',
'key25': 'someval4'
}
key3: []
... some other objects
}
My goal is to get only these files where date field is from some period.
For example from 2018-05-20 to 2018-07-20.
I can't base on date of creation this files, because all of this was generated in one day.
Maybe it is possible using sed or similar program?
Fortunately, the date in this format can be compared as a string. You only need something to parse the JSONs, e.g. Perl:
perl -l -0777 -MJSON::PP -ne '
$date = decode_json($_)->{key2}{date};
print $ARGV if $date gt "2018-07-01T00:00:00Z";
' *.json
-0777 makes perl slurp the whole files instead of reading them line by line
-l adds a newline to print
$ARGV contains the name of the currently processed file
See JSON::PP for details. If you have JSON::XS or Cpanel::JSON::XS, you can switch to them for faster processing.
I had to fix the input (replace ' by ", add commas, etc.) in order to make the parser happy.
If your files actually contain valid JSON, the task can be accomplished in a one-liner with jq, e.g.:
jq 'if .key2.date[0:10] | (. >= "2018-05-20" and . <= "2018-07-31") then input_filename else empty end' *.json
This is just an illustration. jq has date-handling functions for dealing with more complex requirements.
Handling quasi-JSON
If your files contain quasi-JSON, then you could use jq in conjunction with a JSON rectifier. If your sample is representative, then hjson
could be used, e.g.
for f in *.qjson
do
hjson -j $f | jq --arg f "$f" '
if .key2.date[0:7] == "2018-07" then $f else empty end'
done
Try like this:
Find a online converter. (for example: https://codebeautify.org/json-to-excel-converter#) and convert Json to CSV
Open CSV file with Excel
Filter your data