csv data streaming using Kafka - csv

I am trying to send csv file data throguh producer into kafka topic then on the consumer side i am listening the event.
Producer is a command line. I am sending the csv file using below command -
kafka-console-producer.bat --broker-list localhost:9092 --topic freshTopic < E:\csv\sample.csv
I am sucessfully listen the event on consumer side as well.
Now I have to save that data in some database like elasticsearch. For this i have to convert csv records into DataModel. I read below tutorial for it but not able to understand that how can i write this in java. So Can anyone help me here how can i convert csv file data into datamodel? Thanks in advance.
Csv data streaming using kafka

What you wrote will work fine to get data into Kafka. There are lots of ways to get data into Elasticsearch (which is not a database) from there...
You don't need Avro, as JSON will work too, but confluent-schema-registry doesn't handle conversion from CSV, and Kafka has no "DataModel" class.
Assuming you want Avro to get it into Elasticsearch as individual fields, then
You could use Kafka Connect spooldir source instead of the console producer, and that would get you further along, and then you can run Elasticsearch sink connector from there
Use something to parse the CSV to Avro, as the link you have shows (doesn't have to be Python, KSQL could work too)
If you are fine with JSON, then Logstash would work as well

Related

How do we name the files that are streamed via firehose?

I'm building an architecture using boto3, and I hope to dump the data in JSON format from API to S3. What blocks in my way right now is first, firehose does NOT support JSON; my workaround right now is not compressing them but it's still different from a JSON file. But I still want to see a better choice to make the files more compatible.
And second, the file names can't be customized. All the data I collected will be eventually converted onto Athena for the query, so can boto3 do the naming?
Answering a couple of the questions you have. Firstly if you stream JSON into Firehose it will write JSON to S3. JSON is the file data structure and compression is the file type. Compressing JSON doesn't make it something else. You'll just need to decompress it before consuming it.
RE: file naming, you shouldn't care about that. Let the system name it whatever. If you define the Athena table with the location, you'll be able to query it. When new files are added, you'll be able to query them immediately.
Here is an AWS tutorial that walks you through this process. JSON stream to S3 with Athena query.

Read Array Of Jsons From File to Spark Dataframe

I have a gzipped JSON file that contains Array of JSON, something like this:
[{"Product":{"id"1,"image":"/img.jpg"},"Color":"black"},{"Product":{"id"2,"image":"/img1.jpg"},"Color":"green"}.....]
I know this is not the ideal data format to read into scala, however there is no other alternative but to process the feed in this manner.
I have tried :
spark.read.json("file-path")
which seems to take a long time (processes very quickly if you have data in MBs, however takes way long for GBs worth of data ), probably because spark is not able to split the file and distribute accross to other executors.
Wanted to see if there is a any way out to preprocess this data and load it into spark context as a dataframe.
Functionality I want seems to be similar to: Create pandas dataframe from json objects . But I wanted to see if there is any scala alternative which could do similar and convert the data to spark RDD / dataframe .
You can read the "gzip" file using spark.read().text("gzip-file-path"). Since Spark API's are built on top of HDFS API , Spark can read the gzip file and decompress it to read the files.
https://github.com/mesos/spark/blob/baa30fcd99aec83b1b704d7918be6bb78b45fbb5/core/src/main/scala/spark/SparkContext.scala#L239
However, gzip is non-splittable so spark creates an RDD with single partition. Hence, reading gzip files using spark doe not make sense.
You may decompress the gzip file and read the decompressed files to get most out of the distributed processing architecture.
Appeared like a problem with the data format being given to spark for processing. I had to pre-process the data to change the format to a spark friendly format, and run spark processes over that. This is the preprocessing I ended up doing: https://github.com/dipayan90/bigjsonprocessor/blob/master/src/main/java/com/kajjoy/bigjsonprocessor/Application.java

How to reorder CSV columns using apache NIFI processor?

In my scenario,Users have the option to upload a CSV file and can map the columns of that CSV file to a predefined schema.I need to reorder the columns of that CSV file based on user mapping and upload it to HDFS. Is there any way to achieve this via a NIFI processor ?
You can accomplish this with a ConvertRecord processor. Register an Avro schema describing the expected format in a Schema Registry (controller service), and create a CSVReader implementation to convert this incoming data to the generic Apache NiFi internal record format. Similarly, use a CSVRecordSetWriter with your output schema to write the data back to CSV in whatever columnar order you like.
For more information on the record processing philosophy and some examples, see Record-oriented data with NiFi and Apache NiFi Records and Schema Registries.

Yarn parsing job logs stored in hdfs

Is there any parser, which I can use to parse the json present in yarn job logs(jhist files) which gets stored in hdfs to extract information from it.
The second line in the .jhist file is the avro schema for the other jsons in the file. Meaning that you can create avro data out of the jhist file.
For this you could use avro-tools-1.7.7.jar
# schema is the second line
sed -n '2p;3q' file.jhist > schema.avsc
# removing the first two lines
sed '1,2d' file.jhist > pfile.jhist
# finally converting to avro data
java -jar avro-tools-1.7.7.jar fromjson pfile.jhist --schema-file schema.avsc > file.avro
You've got an avro data, which you can for example import to a Hive table, and make queries on it.
You can check out Rumen, a parsing tool from the apache ecosystem
or When you visit the web UI, go to job history and look for the job for which you want to read .jhist file. Hit the Counters link at the left,now you will be able see an API which gives you all the parameters and the value like CPU time in milliseconds etc. which will read from a .jhist file itself.

Parsing JSON data and storing locally on iphone

Any one has done this, i want to get data parse from the JSON (Google location API) and store it into the sqlite databse on the iphone.
Problem is that if the parsed data is huge in amount how to synchronize the parsing and saving data in sqlite locally on the iphone.
And user interface includes table of saved data that should be work with out any interruption.
And the solution should be in COCOA Framework using Objective C
You must read some tutorials
How do I parse JSON with Objective-C?
http://cookbooks.adobe.com/post_Store_data_in_the_HTML5_SQLite_database-19115.html
http://html5doctor.com/introducing-web-sql-databases/
Parse JSON in JavaScript?