merging CSV files in Hadoop [closed] - csv

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am new to the Hadoop framework and would really appreciate it if someone could walk me thru this.
I am trying to merge two .csv files.
The two files have the same headers are orderd the same, etc.
The thing is that I have no idea how to merge these files to one and then clean the empty lines and unused columns.

The two files have the same headers are orderd the same etc
Since the files are the same, you can upload them to the same directory.
hdfs dfs -mkdir -p /path/to/input
hdfs dfs -put file1.csv /path/to/input
hdfs dfs -put file2.csv /path/to/input
HDFS will natively treat these as "parts of a single file" if you read from hdfs:///path/to/input
Note, you'll want to strip the header from both files before placing them into HDFS in this fashion.
Another option would be to concatenate the files locally. (Again, remove the headers first, or at least from all but the first file)
cat file1.csv file2.csv > file3.csv
hdfs dfs -put file3.csv /path/to/input
After that, use whatever Hadoop tools you know to read the files.

Since they have the same structure,load them both using PIG into 2 relations and then UNION the 2 relations.Finally you can FILTER the records that match certain criteria.I am assuming the files have 2 fields each for simplicity.
A = LOAD '/path/file1.csv' USING PigStorage(',') AS (a1:chararray;a2:chararray);
B = LOAD '/path/file2.csv' USING PigStorage(',') AS (b1:chararray;b2:chararray);
C = UNION A,B;
D = FILTER C BY (C.$0 is NULL OR C.$1 is NULL) <-- If first or second column is null filter the record.
DUMP D;

Related

Batch script removed the first 6 lines of all cvs files

I tried a few solutions here, but I wasn`t able to get a solution.
Short summary what it needs to do:
I have a lot ov CSV files and the first 6 lines included header information. I want to create a batch to delete the first lines for all CSV files.

Prettify a one-line JSON file [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I downloaded a 203775480 bytes (~200 MiB, exact size is important for a later error) JSON file which has all entries all on one line. Needless to say, my text editor (ViM) cannot efficiently navigate in it and I'm not able to understand anything from it. I'd like to prettify it. I tried to use cat file.json | jq '.', jq '.' file.json, cat file.json | python -m json.tool but none worked. The former two commands print nothing on stdout while the latter says Expecting object: line 1 column 203775480 (char 203775479).
I guess it's broken somewhere near the end, but of course I cannot understand where as I cannot even navigate it.
Have you got some other idea for prettifying it? (I've also tried gg=G in ViM: it did not work).
I found that the file was indeed broken: I accidentally noticed a ']' at the beginning of the file so I struggled to go to the end of the file and added a ']' at the end (it took me maybe 5 minutes).
Then I've rerun cat file.json | python -m json.tool and it worked like a charm.

How to convert a JSON array to a CSV row? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I'm trying to convert a large JSON file to a CSV format. (I've just learned how to use jq so I'm still a beginner).
I've successfully managed to convert most of the data, however, I'm stuck at an array. Every JSON object in the file is supposed to be converted to a single CSV row, and I can't get that to work.
I've been trying to help myself with an existing answer:
Convert/Export JSON to CSV
But the problem is that this method writes a row for each item in the array, needlessly repeating information.
I am using the same type of command as the above answer, the only difference being the name of the columns, but the array blocks...
For instance, I could have a JSON file resembling:
{resources:[
{"id":"001","name"="Robert","items":[
{"label":"00A","name":"Pen"},
{"label":"00B","name":"Paper"}],
{"id":"002","name"="Bruce","items":[
{"label":"00A","name":"Pen"},
{"label":"00B","name":"Computer"},
{"label":"00C","name":"Headphones"}]
]
}
That I would like to become:
001,Robert,Pen,Paper,
002,Bruce,Pen,Computer,Headphones
I only need the name columns of the array
For the moment, the result is:
001,Robert,Pen
001,Robert,Paper
002,Bruce,Pen,
002,Bruce,Computer
002,Bruce,Headphones
The problem is the actual array is about 30 items long for each JSON object, making it impossible to use this way.
$ jq -r '.resources[] | [.id,.name,.items[].name] | #csv' < /tmp/b.json
"001","Robert","Pen","Paper"
"002","Bruce","Pen","Computer","Headphones"
You can use awk to fix up your current output file.
awk -F, '{a[$1","$2]=a[$1","$2]","$3}; END{for (v in a) print v a[v]}' in.txt | sort >out.txt
Input:
001,Robert,Pen
001,Robert,Paper
002,Bruce,Pen,
002,Bruce,Computer
002,Bruce,Headphones
Output:
001,Robert,Pen,Paper
002,Bruce,Pen,Computer,Headphones

Can a tsv file contain tabs? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have to store data containing tabulations in a file. I would like to use .TSV files (Tab-Separated File).
Here is an example of data (I manually escaped tabs and carriage return for the example):
Computation Display
0 for (int i=0;i<10;i++)\n\tx*=3; printf ("<b>éàè'"</b>");
1 float pi=3.1415; printf("%d %f",x,xf);
Is there a proper way to escape tabs? Should I use \t, should I use quotes or double quotes?
The abbreviation CSV means "Comma Separated Values", but in practice, this abbreviation is used for all files containing values that are separated by some separator-character. That's why spreadsheet applications like Open Office Calc or Microsoft Excel open up a dialog window letting you configure the separator and quoting character when you attempt to open a file with the file-extension .csv.
If your question is how the separator-character can be part of a value of a CSV file, the most common way is quoting the values. Here is in example of the quoting being done with the values
a,b
c"d
e
with , as the separator character and " as the quoting character
"a,b","c""d", e
The second way of quoting is the way Excel does it, you can also see variants where the quoting is done in the same way as the first example.
There are libraries out there that do the parsing and creation of CSV files for you. We "here" use the Ostermiller CSV library (there might be better ones nowerday but it does its job so there was no need to change the library after we introduced it "here" 10 years ago.

Combine files on commit in Mercurial

I've got a project with 2 files I want to source-control using Mercurial:
A SCX-File which is a binary (database) file
A SCT-File which is a text file
My filter:
[encode]
**.scx = tempfile: sccxml INFILE OUTFILE 0
[decode]
**.scx = tempfile: sccxml INFILE OUTFILE 1
Problem
sccxml only receives the path to the SCX-File
The SCX-File can not be converted to a text-file without the corresponding SCT-File
Workarounds
Is it possible to combine the files before the filter runs?
Is it possible to pass both file's paths to sccxml-Converter?
UPDATE:
No, I'm using not using the Win32Text extension.
The SccXml-Executable needs both an SCT-File and an SCX-File as parameter to convert them to a Text-File (the text-representations of both files get tar'ed into one file).
I Want To have the binary files as Text-File in the Repo, to get meaningful diffs. I am currently trying to achieve this using a precommit hook.