JSON to CSV conversion on HDFS - json

I am trying to convert a JSON file into CSV.
I have a JAVA code which is able to do it perfectly on UNIX file system and on local file system.
I have written below main class to perform this conversion on HDFS.
public class ClassMain {
public static void main(String[] args) throws IOException {
String uri = args[1];
String uri1 = args[2];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
FSDataOutputStream out = fs.create(new Path(uri1));
try{
in = fs.open(new Path(uri));
JsonToCSV toCSV = new JsonToCSV(uri);
toCSV.json2Sheet().write2csv(uri1);
IOUtils.copyBytes(in, out, 4096, false);
}
finally{
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
json2sheet and write2csv are methods which perform the conversion and write operation.
I am running this jar using below command:
hadoop jar json-csv-hdfs.jar com.nishant.ClassMain /nishant/large.json /nishant/output
The problem is, it does not write anything at /nishant/output. It creates a 0 sized /nishant/output file.
Maybe the usage of copyBytes is not a good idea here.
How to achieve this on HDFS if it is working OK on unix FS and local FS.
Here I am trying to convert JSON file to CSV and not trying to map JSON objects to their values

FileSystem needs only one configuration key to successfully connect to HDFS.
conf.set(key, "hdfs://host:port"); // where key="fs.default.name"|"fs.defaultFS"

Related

Flink Data Stream CSV Writer not writing data to CSV file

I am new to apache flink and trying to learn data streams. I am reading student data which has 3 columns(Name,Subject and Marks) from a csv file. I have applied filter on marks and only selecting those records where marks >40.
I am trying to write this data to csv file but program runs successfully and csv file remains empty. No data gets written to csv file.
I tried with different syntax for writing csv file but none of them worked for me. I am running this locally through eclipse. Write to text file works fine.
DataStream<String> text = env.readFile(format, params.get("input"),
FileProcessingMode.PROCESS_CONTINUOUSLY,100);
DataStream<String> filtered = text.filter(new FilterFunction<String>(){
public boolean filter(String value) {
String[] tokens = value.split(",");
return Integer.parseInt(tokens[2]) >= 40;
}
});
filtered.writeAsText("testFilter",WriteMode.OVERWRITE);
DataStream<Tuple2<String, Integer>> tokenized = filtered
.map(new MapFunction<String, Tuple2<String, Integer>>(){
public Tuple2<String, Integer> map(String value) throws Exception {
return new Tuple2("Test", Integer.valueOf(1));
}
});
tokenized.print();
tokenized.writeAsCsv("file:///home/Test/Desktop/output.csv",
WriteMode.OVERWRITE, "/n", ",");
try {
env.execute();
} catch (Exception e1) {
e1.printStackTrace();
}
}
}
Below is my input CSV format:
Name1,Subj1,30
Name1,Subj2,40
Name1,Subj3,40
Name1,Subj4,40
Tokenized.print() prints all correct records.
I did a little experimenting, and found that this job works just fine:
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class WriteCSV {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.fromElements(new Tuple2<>("abc", 1), new Tuple2<>("def", 2))
.writeAsCsv("file:///tmp/test.csv", FileSystem.WriteMode.OVERWRITE, "\n", ",");
env.execute();
}
}
If I don't set the parallelism to 1, then the results are different. In that case, test.csv is a directory containing four files, each written by one of the four parallel subtasks.
I'm not sure what's wrong in your case, but maybe you can work backwards from this example (assuming it works for you).
You should remove tokenized.print(); before tokenized.writeAsCsv();.
It will consume the data the print();.

How to uncompress Gzipped with Apache Spark Java

i have a sequence file. In this file is each value compressed json file with GZipped. My Problem, how to read in the gzipped json files with Apache Spark ?
for this my code,
JavaSparkContext jsc = new JavaSparkContext("local", "sequencefile");
JavaPairRDD<String, byte[]> file = jsc.sequenceFile("file:\\E:\\part-00004", String.class, byte[].class);
JavaRDD<String> map = file.map(new Function<Tuple2<String, byte[]>, String>() {
public String call(Tuple2<String, byte[]> stringTuple2) throws Exception {
byte[] uncompress = uncompress(stringTuple2._2);
return uncompress.toString();
}
});
But this code func not working.
Have a nice day
While creating spark context use the constructor which will also take the spark configuration as third parameter.
Set the spark configuration value for key “org.apache.hadoop.io.compression.codecs”
As below
“org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec”

Convert CSVwriter to inputstream

I have the following code where I want to write a list of objects onto a csv where I have defined the attributes and items. I want to convert the writer into a input stream so I read the values and do some performed computations. I also want to store this s3 file in a datastore like Amazon S3.
How do I convert the writer into a inputstream. I see no defined api. Can I read the file somehow like CSVReader reader = new CSVReader(csvWriter)?
public CSVWriter convertModelToObject(List attributes, final Class classType) throws IOException {
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), com.opencsv.CSVParser.DEFAULT_SEPARATOR,
com.opencsv.CSVParser.DEFAULT_QUOTE_CHARACTER);
BeanToCsv bean = new BeanToCsv();
HeaderColumnNameMappingStrategy<T> mappingStrategy = new HeaderColumnNameMappingStrategy<>();
mappingStrategy.setType(classType);
bean.write(mappingStrategy, writer, attributes);
return writer;
Consider replacing the FileWriter you are using with a PipedWriter, creating it with a PipedReader that you would use when creating the CSVReader. You can find an example of the PipedReader Writer here.
Yes, you can.
The solution is to use InputStreamReader to read the file and pass that stream to Buffered reader and read line by line or as you want.
You can refer to this for more methods: https://www.geeksforgeeks.org/different-ways-reading-text-file-java/

How to read a CSV file from Hdfs?

I have my Data in a CSV file. I want to read the CSV file which is in HDFS.
Can anyone help me with the code??
I'm new to hadoop. Thanks in Advance.
The classes required for this are FileSystem, FSDataInputStream and Path. Client should be something like this :
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("/path/to/input/file"));
System.out.println(inputStream.readChar());
}
FSDataInputStream has several read methods. Choose the one which suits your needs.
If it is MR, it's even easier :
public static class YourMapper extends
Mapper<LongWritable, Text, Your_Wish, Your_Wish> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Framework does the reading for you...
String line = value.toString(); //line contains one line of your csv file.
//do your processing here
....................
....................
context.write(Your_Wish, Your_Wish);
}
}
}
If you want to use mapreduce you can use TextInputFormat to read line by line and parse each line in mapper's map function.
Other option is to develop (or find developed) CSV input format for reading data from file.
There is one old tutorial here http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html but logic is same in new versions
If you are using single process for reading data from file it is same as reading file from any other file system. There is nice example here https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-read-a-file-from-hdfs
HTH

Batch convert visual foxpro dbf tables to csv

I have a huge collection of visual foxpro dbf files that I would like to convert to csv.
(If you like, you can download some of the data here. Click on the 2011 link for Transaction Data, and prepare to wait a long time...)
I can open each table with DBF View Plus (an awesome freeware utility), but exporting them to csv takes a few hours per file, and I have several dozen files to work with.
Is there a program like DBF View plus that will allow me to set up a batch of dbf-to-csv conversions to run overnight?
/Edit: Alternatively, is there a good way to import .dbf files straight into SQL Server 2008? They should all go into 1 table, as each file is just a subset of records from the same table and should have all the same column names.
Load up your list of FoxPro files in an array/list then call the ConvertDbf on each to convert them from FoxPro to csv files. See the c# console application code below...
Credit c# datatable to csv for the DataTableToCSV function.
using System;
using System.Data;
using System.Data.OleDb;
using System.IO;
using System.Linq;
using System.Text;
namespace SO8843066
{
class Program
{
static void Main(string[] args)
{
string connectionString = #"Provider=VFPOLEDB.1;Data Source=C:\";
string dbfToConvert = #"C:\yourdbffile.dbf";
ConvertDbf(connectionString, dbfToConvert, dbfToConvert.Replace(".dbf", ".csv"));
Console.WriteLine("End of program execution");
Console.WriteLine("Press any key to end");
Console.ReadKey();
}
static void DataTableToCSV(DataTable dt, string csvFile)
{
StringBuilder sb = new StringBuilder();
var columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName).ToArray();
sb.AppendLine(string.Join(",", columnNames));
foreach (DataRow row in dt.Rows)
{
var fields = row.ItemArray.Select(field => field.ToString()).ToArray();
for (int i =0;i < fields.Length;i++)
{
sb.Append("\"" + fields[i].Trim() );
sb.Append((i != fields.Length - 1) ? "\"," : "\"");
}
sb.Append("\r\n");
}
File.WriteAllText(csvFile, sb.ToString());
}
static void ConvertDbf(string connectionString, string dbfFile, string csvFile)
{
string sqlSelect = string.Format("SELECT * FROM {0}", dbfFile);
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbDataAdapter da = new OleDbDataAdapter(sqlSelect, connection))
{
DataSet ds = new DataSet();
da.Fill(ds);
DataTableToCSV(ds.Tables[0], csvFile);
}
}
}
}
}
In that case, SQL-Server I think has a capability of connecting to foxpro tables. I'm not exactly sure how as I've never done it recently (last time using SQL-Server about 8+ yrs ago). I'm sure there are other threads out there that can point you to connecting SQL-Server to VFP.
I quickly searched and saw this thread
In addition, you might need the latest OleDb provider to establish the connection which I've also posted in a thread here. This thread also shows a sample of the connection string information you may need from SQL-Server. The data source information should point to the PATH where the .DBF files are found, and not the specific name of the .DBF you are trying to connect to.
Hope this helps you out.
This works very well and thanks for the solution. I used this to convert some visual foxpro dbf tables to flat files. With these tables, there is the additional challenge of converting fields of type Currency.
Currency fields are a 64-bit (8 byte) signed integer amidst a 36 element byte array starting at the 27th position. The integer is then divided by 1000 to get 4-decimal precision equivalent.
If you have this type of field, try this inside the fields FOR loop
if (("" + fields[i]).Equals("System.Byte[]"))
{
StringBuilder db = new StringBuilder();
byte[] inbytes = new byte[36];
inbytes = ObjectToByteArray(fields[i]);
db.Append("" + (double)BitConverter.ToInt64(inbytes,27)/1E4);
sb.Append("\"" + db);
}
With the following helper method
private static byte[] ObjectToByteArray(Object obj)
{
BinaryFormatter bf = new BinaryFormatter();
using (var ms = new MemoryStream())
{
bf.Serialize(ms, obj);
return ms.ToArray();
}
}
Check out my answer to Foxbase to postrgresql data transfer. (dbf files reader).