Transfering csv files into hdfs, with converting them to avro, using flume - csv

I am new to Big Data and I have task to transfer csv files to HDFS using Flume, but it also should convert those csv to avro. I tried to do that using following flume configuration:
a1.channels = dataChannel
a1.sources = dataSource
a1.sinks = dataSink
a1.channels.dataChannel.type = memory
a1.channels.dataChannel.capacity = 1000000
a1.channels.dataChannel.transactionCapacity = 10000
a1.sources.dataSource.type = spooldir
a1.sources.dataSource.spoolDir = {spool_dir}
a1.sources.dataSource.fileHeader = true
a1.sources.dataSource.fileHeaderKey = file
a1.sources.dataSource.basenameHeader = true
a1.sources.dataSource.basenameHeaderKey = basename
a1.sources.dataSource.interceptors.attach-schema.type = static
a1.sources.dataSource.interceptors.attach-schema.key = flume.avro.schema.url
a1.sources.dataSource.interceptors.attach-schema.value = {path_to_schema_in_hdfs}
a1.sinks.dataSink.type = hdfs
a1.sinks.dataSink.hdfs.path = {sink_path}
a1.sinks.dataSink.hdfs.format = text
a1.sinks.dataSink.hdfs.inUsePrefix = .
a1.sinks.dataSink.hdfs.filePrefix = drone
a1.sinks.dataSink.hdfs.fileSuffix = .avro
a1.sinks.dataSink.hdfs.rollSize = 180000000
a1.sinks.dataSink.hdfs.rollCount = 100000
a1.sinks.dataSink.hdfs.rollInterval = 120
a1.sinks.dataSink.hdfs.idleTimeout = 3600
a1.sinks.dataSink.hdfs.fileType = DataStream
a1.sinks.dataSink.serializer = avro_event
The output where avro file with flume's default schema.I also tried to use AvroEventSerializer, but I just got a lot of different error, I solved all of them, except this one:
ERROR hdfs.HDFSEventSink: process failed
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hdfs.DFSOutputStream.computePacketChunkSize(DFSOutputStream.java:1305)
at org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1243)
at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1266)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1101)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1059)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:232)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:75)
Thank you for any help.

Sory for mistakes in the config. I fixed them and found the way to convert css to avro. I a little bit modified AvroEventSerializer this way:
public void write(Event event) throws IOException {
if (dataFileWriter == null) {
initialize(event);
}
String[] items = new String(event.getBody()).split(",");
city.put("deviceID", Long.parseLong(items[0]));
city.put("groupID", Long.parseLong(items[1]));
city.put("timeCounter", Long.parseLong(items[2]));
city.put("cityCityName", items[3]);
city.put("cityStateCode", items[4]);
city.put("sessionCount", Long.parseLong(items[5]));
city.put("errorCount", Long.parseLong(items[6]));
dataFileWriter.append(citi);
}
and here is city definition:
private GenericRecord city = null;
Please reply, if you know a better way to do that

Related

Migrating MySQL blob (image) to FileMaker container using PowerShell

In searching I've found a number of other people that have tried, but none that have been successful.
Here's the problem. I want to take a bunch of images I have stored on my MySQL server in blobs and move them into FileMaker containers.
The best lead I've got is the putas command. It looks something like putas ('$Image','JPEG').
My particular application is as follows. $DataSet.Image1 is a JPEG file stored as "0xFFD8....". The data being in this format may well be the issue, but I don't know what I'd need to convert it to first.
$cmd.CommandText = "update Checklists set Image1 = PutAs('$($DataSet.Image1)', 'JPEG')"
$cmd.ExecuteNonQuery();
All I keep getting is syntax error, but I've tried the syntax many different ways I can't get it to go no matter what I do.
I'd very much like to see someone having success with this to post their example. Any other ideas or workarounds are welcome as well.
Edit:
Here is some extra info. Greg Lane at Skeleton Key gives this example, but I'm not sure how to translate it to PowerShell.
import java.sql.*; import java.io.*;
def url = "jdbc:filemaker://localhost/fmserver_sample";
def driver = "com.filemaker.jdbc.Driver";
def user = "admin";
def password = "";
System.setProperty("jdbc.drivers", driver);
connection = DriverManager.getConnection (url, user, password);
filename = "/Users/Greg/Pictures/vacation/DSC_0202.jpg";
file = new File (filename);
inputstream = new FileInputStream (filename);
sql = "INSERT INTO english_nature (ID, img) VALUES (-1, PutAs(?, 'JPEG'))";
pstatement = connection.prepareStatement ( sql );
pstatement.setBinaryStream (1, inputstream, (int)file.length ());
pstatement.execute ();
//cleanup
pstatement = null;
inputstream = null;
file = null;
connection.close();
I figured it out. For anyone in the future here is how you do it.
$cmd.CommandText = "update Checklists set Image1 = PutAs(?, 'JPEG') where serial = '$($DataSet.serial)' AND ChecklistNumber = 1"
$cmd.Parameters.Add('?', $DataSet.Image1)
$cmd.Prepare()
$cmd.ExecuteNonQuery();

Connecting to an AWS MySQL Database from Scala with Slick

i just created my first AWS MySQL Database and want to connect to that from my scala application using Slick.
My config file shows:
awsMySQL = {
profile = "slick.jdbc.MySQLProfile$"
dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
properties = {
url = "jdbc:mysql://<databaseName>.cn17tbad2awy.eu-central-1.rds.amazonaws.com"
user = "foo"
password = "bar"
driver = com.mysql.cj.jdbc.Driver
}
connectionPool = disabled
keepAliveConnection = true
}
I just define a query to receive all my customers, but when exeuting this code i receive a SQLException: No database selected.
val db = Database.forConfig("awsMySQL")
val CustomersDAO = TableQuery[Customers]
val q1 = for (c <- CustomersDAO) yield c.name
val a = q1.result
val f = db.run(a)
Await.result(f, Duration.Inf)
I do not really understand this exception, because from my point of view by the url specifies the database. Could you please help me.
Thanks in advance.
I think you are pointing to the host where the MySQL service is running but not to the Database itself
Try to replace url = "jdbc:mysql://<databaseName>.cn17tbad2awy.eu-central-1.rds.amazonaws.com" with something like:
url = "jdbc:mysql://<databaseName>.cn17tbad2awy.eu-central-1.rds.amazonaws.com/DBSCHEMA"

JPA caching database results, need to "un-cache"

I'm seeing "caching" behavior with database (MySQL 5) records. I can't seem to see the new data application side w/o logging in/out or restarting the app server (Glassfish 3). This is the only place in the application where db records are "stuck." I'm guessing I'm missing something with JPA persistence.
I've attempted changing db records by hand, there's still some sort of caching mechanism in place "helping" me.
This is editFile() method that saves new data.
After I fire this, I see the data updated in the db as expected.
this.file is the class level property that the view uses to show file data. It shows old data. I attempt to move db data back in to it after I've fired my UPDATE queries with the filesList setter: this.setFilesList(newFiles);
When the application reads it back out though, GlassFish seems to resond with requests for this data w/ old data.
public void editFile(Map<String, String> params) {
// update file1 record
File1 thisFile = new File1();
thisFile.setFileId(Integer.parseInt(params.get("reload-form:fileID")));
thisFile.setTitle(params.get("reload-form:input-small-name"));
thisFile.setTitle_friendly(params.get("reload-form:input-small-title-friendly"));
this.filesFacade.updateFileRecord(thisFile);
//update files_to_categories record
int thisFileKeywordID = Integer.parseInt(params.get("reload-form:select0"));
this.filesToCategoriesFacade.updateFilesToCategoriesRecords(thisFile.getFileId(), thisFileKeywordID);
this.file = this.filesFacade.findFileByID(thisFile.getFileId());
List<File1> newFiles = (List<File1>)this.filesFacade.findAllByRange(low, high);
this.setFilesList(newFiles);
}
Facades
My Facades are firing native SQL to update each of those DB tables. When I check the DB after they fire, the data is going in, that part is happening as I expect and hope.
File1
public int updateFileRecord(File1 file){
String title = file.getTitle();
String title_titleFriendly = file.getTitle_friendly();
int fileID = file.getFileId();
int result = 0;
Query q = this.em.createNativeQuery("UPDATE file1 set title = ?1, title_friendly = ?2 where file_id = ?3");
q.setParameter(1, title);
q.setParameter(2, title_titleFriendly);
q.setParameter(3, fileID);
result = q.executeUpdate();
return result;
}
FilesToCategories
public int updateFilesToCategoriesRecords(int fileId, int keywordID){
Query q = this.em.createNativeQuery("UPDATE files_to_categories set categories = ?1 where file1 = ?2");
q.setParameter(1, keywordID);
q.setParameter(2, fileId);
return q.executeUpdate();
}
How do I un-cache?
Thanks again for looking.
I don't think caching is the Problem, I think it's transactions.
em.getTransaction().begin();
Query q = this.em.createNativeQuery("UPDATE file1 set title = ?1, title_friendly = ?2 where file_id = ?3");
q.setParameter(1, title);
q.setParameter(2, title_titleFriendly);
q.setParameter(3, fileID);
result = q.executeUpdate();
em.getTransaction().commit();
I recommend to surrond your Writings to the DB with Transactions to get them persisted. Unless you commit requests may return results without the changes.
Ok, JTA does the Transactionmanagement.
Why are you doing this, when you are using JPA.
public int updateFileRecord(File1 file){
String title = file.getTitle();
String title_titleFriendly = file.getTitle_friendly();
int fileID = file.getFileId();
int result = 0;
Query q = this.em.createNativeQuery("UPDATE file1 set title = ?1, title_friendly = ?2 where file_id = ?3");
q.setParameter(1, title);
q.setParameter(2, title_titleFriendly);
q.setParameter(3, fileID);
result = q.executeUpdate();
return result;
}
This should work and update the internal State that comes with JPA
public int updateFileRecord(File1 file){
em.persist(file);
}
#daniel & #Tiny got me going on this one, thanks again guys.
I wanted to point out that I used the .merge() method out of the Entity Manager class.
It's important to note that for .merge() to UPDATE the record instead of INSERTing a new one; that the object you're submitting to .merge() must include all properties respective of the fields in the database table (that your DAO knows about) or you will INSERT new database records.
public void updateFileRecord(File1 file){
em.merge(file);
}

How to read CSV file from POST?

I've been stuck for hours on this csv problem. The following code is run after a form is posted :
fichier_en_lecture = request.FILES['fichier_csv'].read()
nom_du_fichier = request.FILES['fichier_csv'].name
importateur = request.user
traitement_du_fichier(fichier_en_lecture, importateur)
And the "traitement_du_fichier" function goes like this :
def traitement_du_fichier(fichier_en_lecture, nom_du_fichier, importateur):
nouveau_fichier = FichierAdhérents(importateur=importateur, fichier_csv=nom_du_fichier)
nouveau_fichier.save()
import csv
lecteur = csv.reader(fichier_en_lecture, delimiter=",", quotechar='|')
for row in lecteur:
nouvel_adhérent = AdhérentDuFichier()
nouvel_adhérent['fichier_adhérents'] = nouveau_fichier
column_counter = 0
nouvel_adhérent['fédération'] = row[column_counter]
column_counter += 1
nouvel_adhérent['date_première_adhésion'] = row[column_counter]
column_counter += 1
nouvel_adhérent['date_dernière_cotisation'] = row[column_counter]
I get the following error :
iterator should return strings, not int (did you open the file in text mode?)
I've tried to use open() but from what I understand, open() only works with a direct path to the uploaded file. However, I need to do this from memory.
In python 3,
I used:
import csv
from io import StringIO
csvf = StringIO(xls_file.read().decode())
reader = csv.reader(csvf, delimiter=',')
xls_file being the file got from the POST form.
I hope it helps.

Retrieving column mapping info in T4

I'm working on a T4 file that generates .cs classes based on an entity model, and one of the things I'm trying to get to is the mapping info in the model. Specifically, for each field in the model I'm trying retrieve the database field name it is mapped to.
I've found that the mapping info is apparently stored in StorageMappingItemCollection, but am having an impossible time figuring out how to query it and retrieve the data I need. Has anyone worked with this class and can maybe provide guidance?
The code I have so far goes something like this (I've pasted everything up to the problematic line):
<#
System.Diagnostics.Debugger.Launch();
System.Diagnostics.Debugger.Break();
#>
<## template language="C#" debug="true" hostspecific="true"#>
<## include file="EF.Utility.CS.ttinclude"#>
<## output extension=".cs"#><#
CodeGenerationTools code = new CodeGenerationTools(this);
MetadataLoader loader = new MetadataLoader(this);
CodeRegion region = new CodeRegion(this, 1);
MetadataTools ef = new MetadataTools(this);
string inputFile = #"MyModel.edmx";
EdmItemCollection ItemCollection = loader.CreateEdmItemCollection(inputFile);
StoreItemCollection storeItemCollection = null;
loader.TryCreateStoreItemCollection(inputFile, out storeItemCollection);
StorageMappingItemCollection storageMappingItemCollection = null;
loader.TryCreateStorageMappingItemCollection(
inputFile, ItemCollection, storeItemCollection, out storageMappingItemCollection);
var item = storageMappingItemCollection.First();
storageMappingItemCollection has methods like GetItem() and such, but I can't for the life of me get it to return data on fields that I know exist in the model.
Thx in advance!
Parsing the MSL isn't really that hard with Linq to XML
string mslManifestResourceName = GetMslName(ConfigurationManager.ConnectionStrings["Your Connection String"].ConnectionString);
var stream = Assembly.GetExecutingAssembly().GetManifestResourceStream(mslManifestResourceName);
XmlReader xreader = new XmlTextReader(stream);
XDocument doc = XDocument.Load(xreader);
XNamespace xmlns = "http://schemas.microsoft.com/ado/2009/11/mapping/cs";
var items = from entitySetMap in doc.Descendants(xmlns + "EntitySetMapping")
let entityTypeMap = entitySetMap.Element(xmlns + "EntityTypeMapping")
let mappingFragment = entityTypeMap.Element(xmlns + "MappingFragment")
select new
{
EntitySet = entitySetMap.Attribute("Name").Value,
TypeName = entityTypeMap.Attribute("TypeName").Value,
TableName = mappingFragment.Attribute("StoreEntitySet").Value
};
It may be easier to parse the EDMX file as XML rather than using the StorageMappingItemCollection.