Parsing missing column values CSV in Spark - csv

I have a huge DB imported table with ~270 columns, I created a JavaRDD and used it to fill a dataframe.
Scenario: if all the fields in the CSV file are present then everything is great. But if there some empty field in CSV eg.
Value1,,,,,,value7,,,,,
then on writing to parquet of hive table store fails due to Indexoutofbound exception (column>row size). I don't want to use the spark-csv library.
I tried using filters but of no use as I need all column even if there is no data in the CSV. Please let me know if I am missing on something.
JavaRDD<String> tLogRDD =jsc.textFile(dataFile);
String schema=tLogRDD.first();
List<StructField> columns =new ArrayList<StructField>();
for(String fieldName: schema.split(","))
{
columns.add(DataTypes.createStructField(fieldName,DataTypes.StringType,false));
}
StructType schemaStructType = DataTypes.createStructType(columns);
System.out.println("XXXXXXXXXXXX-Row Read Start-XXXXXXXXXXXXXXX");
#SuppressWarnings("serial")
JavaRDD<Row> rowRDD = tLogRDD.map(
new Function<String, Row>() {
#Override
public Row call(String record) throws Exception {
String[] fields = record.split(",");
Object[] fields_converted = fields;
return RowFactory.create(fields_converted);
}
});
//apply schema to rows
DataFrame tLogfDataFrame=hContext.createDataFrame(rowRDD, schemaStructType);
System.out.println("DataFrame Constructed Successfully");
tLogfDataFrame.show(10);
tLogfDataFrame.save("C:/Users/Documents/1001.csv","parquet");

You can use csv reader from spark,,
like:
sparkSession.read()
.format("csv")
.option("header","true")
.option("inferSchema","true")
.load(--file path--)
That's easier and has set of options.

Related

Read a csv file that has a JSON column in SSIS?

I have the following CSV file that has 4 columns. The last column addresses holds 2 addresses history in a JSON format. I have tried to read it in SSIS but it splits the JSON along with the comma(,) instead of grouping all the addresses under one column.
I am using a flat-file connector for this. Is there any other source component for this type of content? How can I parse this in SSIS so that there are just 4 columns and the addresses appear all under one column?
id,title,name,addresses
J44011,Mr,James,"{""address_line_1"": 45, ""post_code"": ""XY7 10PG""},{""address_line_1"": 15, ""post_code"": ""AB7 1HG""}"
You can use a script component to process the JSON into its own detail table.
I created the following dataflow:
Here are the steps to the script component:
On inputs add ID and Address columns:
On inputs and outputs: add a new output and create columns (remember to program the datatypes:
The script:
public class Addresses
{
public int address_line_1 { get; set; }
public string post_code { get; set; }
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Test if addresses exist, if not leave the Row processing
if (string.IsNullOrEmpty(Row.addresses)) return;
//Fix Json to make it an array of objects
string json = string.Format("[{0}]", Row.addresses);
//Load into an array of Addressses
Addresses[] adds = new System.Web.Script.Serialization.JavaScriptSerializer().Deserialize<Addresses[]>(json);
//Process the array
foreach (var a in adds)
{
rowsAddressesBuffer.AddRow();
rowsAddressesBuffer.ID = Row.id;
rowsAddressesBuffer.Address1 = a.address_line_1;
rowsAddressesBuffer.PostalCode = a.post_code;
}
}
Notes:
The class added to store results.
The JSON had to be fixed to create an array of objects.
You need to add a reference to System.Web.Extensions.
This goes to the load. Make sure text qualifier is defined as a double quote (")
I have tried to read it in SSIS but it splits the JSON along the comma(,) instead of grouping all the addresses under one column.
In order to force SSIS to read the flat file row in 4 columns, you should open the flat file connection manager, go to Advanced Tab, and Add only 4 columns. Make sure the last column length is equal to 4000. This will force reading the 4th column without splitting it.
After importing data to SQL Server, you can parse the JSON content using OPENJSON() function
Parse and Transform JSON Data with OPENJSON (SQL Server)

Parsing csv data format in apache Camel

I followed an example from a book Camel in action. how to marshal and unmarshal csv data format. However, I want to unmarshal a csv file with (comma seperated delimiter) and split body. Then, I will use content based .choice to distribute messages according to required tasks.
In fact, The first and simple example didn't work for me. I used camel 2.15.6 (camel-core, camel-context, camel-csv, commons-csv) and java 7.
public void configure()
{
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter(",");
from("file:test?noop=true")
.unmarshal().csv()
.split(body())
.to("file:out");
}
Please find below the stack trace.
Can you try by removing noop=true? Actually, if noop is true, the file is not moved or deleted in any way. This option is good for readonly data, or for ETL type requirements.
Pass csv as a parameter like this:
public void configure()throws Exception
{
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter(",");
from("file:test?noop=true")
.unmarshal(csv)
.split(body())
.to("file:out");
}
Or it will help you to set contain based routing:I filter according to header of CSV:
//Route 1 for filter CSV based on header
from("file:/home/r2/Desktop/csvFile?noop=true")
.choice().when(body().contains("partyName"))
.to("direct:partyNameCSV")
.when(body().contains("\"stuffName\""))
.to("direct:stuffNameCSV")
.otherwise().endChoice();
//Route 2 partyNameCSV
from("direct:partyNameCSV")
.unmarshal(csv)
.process(new PartyNameCSVProcessor())
.end();
//Route 3 stuffNameCSV
from("direct:stuffNameCSV")
.unmarshal(csv)
.process(new StuffCSVProcessor())
.end();

Reading massive JSON files into Spark Dataframe

I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. In an attempt to render the schema I use this function:
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
on the dataframe that is returned by reading by
val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)
I've also switched this to val df = spark.read.json(path) so that this only works with NDJs and not multi-line JSON--same error.
This is causing an out of memory error on the workers
java.lang.OutOfMemoryError: Java heap space.
I've altered the jvm memory options and spark executor/driver options to no avail
Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later.
No work around. The issue was with the JVM object limit. I ended up using a scala json parser and built the dataframe manually.
You can achieve this in multiple ways.
First while reading, you can provide the schema for dataframe to read json or you can allow the spark to infer the schema by itself.
Once the json is in dataframe, you can follow the following ways to flatten it.
a. Using explode() on dataframe - to flatten it.
b. Using spark sql and access the nested fields using . operator. You can find examples here
Lastly, if you want to add new columns to dataframe
a. First option,using withColumn() is one approach. However this will be done for each new column added and for entire data set.
b. Using sql to generate new dataframe from existing - this may be easiest
c. Lastly, using map, then accessing elements, get old schema, add new values, create new schema and finally get the new df - as below
One withColumn will work on entire rdd. So generally its not a good practise to use the method for every column you want to add. There is a way where you work with columns and their data inside a map function. Since one map function is doing the job here, the code to add new column and its data will be done in parallel.
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

Append data to existing file in Windows Store 8 using JSON

I have created an application in which I am inserting data to the file. It is working fine. Following is my code:
private async void btnSearch_Click(object sender, RoutedEventArgs e)
{
UserDetails details = new UserDetails
{
Name= TxtName.Text,
Course= TxtCouse.Text,
City=TxtCity.Text
};
string jsonContents = JsonConvert.SerializeObject(details);
StorageFolder localFolder = await ApplicationData.Current.LocalFolder.CreateFolderAsync("Storage", CreationCollisionOption.ReplaceExisting); ;
StorageFile textFile = await localFolder.CreateFileAsync("UserDetails.txt", CreationCollisionOption.ReplaceExisting);
using (IRandomAccessStream textStream = await textFile.OpenAsync(FileAccessMode.ReadWrite))
{
// write the JSON string!
using (DataWriter textWriter = new DataWriter(textStream))
{
textWriter.WriteString(jsonContents);
await textWriter.StoreAsync();
}
}
this.Frame.Navigate(typeof(BlankPage1));
}
Now I want that, when a user enter new data the data will append to the same existing file.
Appending data to a JSON text file would mean doing some parsing of the file to find the correct location to insert the text. That is, because JSON is structured with {} delimiters, it's not a simple matter of just appending text to the end of the file.
Given that your data doesn't look that large, the easiest thing to do is to deserialize the JSON from the existing file into memory, add your additional properties to that data structure, and then serialize back to JSON. In that case you probably just want to maintain the structure in memory during the app session, and just overwrite the file with new data whenever you need to. But of course you could also reopen the file, read/parse the JSON into memory, and then rewrite the contents.

how to insert excel data in a database with java

i want to insert data from an excel file into a local database in a UNIX server with java without any manipulation of data.
1- someone told me that i've to convert the excel file extension into .csv to conform with unix. i created a CSV file for each sheet (i've 12) with a macro. the problem is it changed the date format from DD-MM-YYYY to MM-DD-YYYY. how to avoid this?
2- i used LOAD DATA command to insert data from the CSV files to my database. there's a date colonne that is optionnaly specified in the excel file. so in CSV it become ,, so the load data doesn't work (an argument is needed). how can i fix this?
thanks for your help
It should be quite easy to read out the values from Excel with Apache POI. Then you save yourself the extra step of converting to another format and possible problems when your data contains comma and you convert to CSV.
Save the EXCEL file as CSV (comma separated values) format. It will make it easy to read and parse with fairly simple use of StringTokenizer.
Use MySQL (or SQLite depending on your needs) and JDBC to load data into the database.
Here is a CSVEnumeration class I developed:
package com.aepryus.util;
import java.util.*;
public class CSVEnumeration implements Enumeration {
private List<String> tokens = new Vector<String>();
private int index=0;
public CSVEnumeration (String line) {
for (int i=0;i<line.length();i++) {
StringBuffer sb = new StringBuffer();
if (line.charAt(i) != '"') {
while (i < line.length() && line.charAt(i) != ',') {
sb.append(line.charAt(i));
i++;
}
tokens.add(sb.toString());
} else {
i++;
while(line.charAt(i) != '"') {
sb.append(line.charAt(i));
i++;
}
i++;
tokens.add(sb.toString());
}
}
}
// Enumeration =================================================================
public boolean hasMoreElements () {
return index < tokens.size();
}
public Object nextElement () {
return tokens.get(index++);
}
}
If you break the lines of the CSV file up using split and then feed them one by one into the CSVEnumeration class, you can then step through the fields. Or here is some code I have lying around that uses StringTokenizer to parse the lines. csv is a string that contains the entire contents of the file.
StringTokenizer lines = new StringTokenizer(csv,"\n\r");
lines.nextToken();
while (lines.hasMoreElements()) {
String line = lines.nextToken();
Enumeration e = new CSVEnumeration(line);
for (int i=0;e.hasMoreElements();i++) {
String token = (String)e.nextElement();
switch (i) {
case 0:/* do stuff */;break;
}
}
}
I suggest MySQL for its performance and obviously open source.
Here comes two situations:
If you want just to store the excel cell values into the database. You can convert the excel to CSV format, so that you can simply LOAD DATA command in MySQL command.
If you have to do some manipulation before the values to get into the tables, I suggest Apache POI. I've used, that works so fine, whatever you're format of Excel you just have to use the correct implementation.
We are using SQLite in our java application. It's serveless, really simple to use and very efficient.