I would like to store the events data in Parquet format (e.g., on HDFS). Do I need to modify the code of the corresponding sinks, or there is a way around it? E.g., using a Flume interceptor.. Thanks.
On the one hand, there was an issue regarding Cygnus about modifying the code having in mind the goal of supporting multiple output formats when writting to HDFS. The modification was done, but only support for our custom Json and CSV formats were coded. This meas the code is ready for being modified in order to add a third format. I've added a new issue regarding the specific Parquet support on OrionHDFSSink; if you finally decide to do the modification, I can assign you the issue :)
On the other hand, you can always use the native HDFS sink (that persists all the notified body) and, effectively, program a custom interceptor.
As you can see, in both cases you will have to code the Parquet part (or wait until we have room for implementing it).
Related
I have millions of documents in different collections in my database. I need to export them to a csv onto my local storage when I specify the collection name.
I tried mlcp export but didn't work. We cannot use corb for this because of some issues.
I want the csv to be in such a format that if I try a mlcp import then I should be able to restore all docs just the way they were.
My first thought would be to use MLCP archive feature, and to not export to a CSV at all.
If you really want CSV, Corb2 would be my first thought. It provides CSV export functionality out of the box. It might be worth digging into why that didn't work for you.
DMSDK might work too, but involves writing code that handles the writing of CSV, which sounds cumbersome to me.
Last option that comes to mind would be Apache NiFi for which there are various MarkLogic Processors. It allows orchestration of data flow very generically. It could be rather overkill for your purpose though.
HTH!
ml-gradle has support for exporting documents and referencing a transform, which can convert each document to CSV - https://github.com/marklogic-community/ml-gradle/wiki/Exporting-data#exporting-data-to-csv .
Unless all of your documents are flat, you likely need some custom code to determine how to map a hierarchical document into a flat row. So a REST transform is a reasonable solution there.
You can also use a TDE template to project your documents into rows, and the /v1/rows endpoint can return results as CSV. That of course requires creating and loading a TDE template, and then waiting for the matching documents to be re-indexed.
I have a JSON file with a large array of JSON objects. I am using JsonTextReader on StreamReader to read data from the files. But, also, some attributes need to be updated as well.
Is it possible to use JsonTextWriter to find and update a particular JSON object?
Generally, to modify a file means reading the whole file to memory, making the change, then writing the whole thing back out to the file. (There are certain file formats that don't require this by virtue of having a static-size layout or other mechanisms designed to work around having to read in the whole file but JSON isn't one of those.)
JSON.net is capable of reading and writing JSON streams as a series of tokens, so it should be possible to minimize the memory footprint by using this. However you will still be reading the entire file into memory and then writing it back out. Because of the simultaneous read/write, you'd need to write to a temp file instead and then, once you're done, move/rename that temp file to the correct place.
Depending on how you've structured the JSON, you may also need to keep track of where you are in that structure. This can be done by tracking the tokens as they're received and using them to maintain a kind of "path" into the structure. That path can be used to determine when you're at a place that needs updating.
The general strategy is to read in tokens, alter them if required, then write them out again.
I have run into a somewhat perplexing issue that has plagued me for several months now. I am trying to create an Avro Schema (schema-enforced format for serializing arbitrary data, basically, as I understand it) to convert some complex JSON files (arbitrary and nested) eventually to Parquet in a pipeline.
I am wondering if there is a way to get the superset of field names I need for this use case staying in Apache Spark instead of Hadoop MR in a reasonable fashion?
I think Apache Arrow under development might be able to help avoid this by treating JSON as a first class citizen eventually, but it is still aways off yet.
Any guidance would be sincerely appreciated!
One of the resources my app uses is data in two JSON files that are pulled from a third party and that are constantly updated with fresh content.
Each of these files have a specific structure that doesn't change.
However, sometimes the third party creates structural changes that may mess with my app.
My question is: how can I monitor their structure so I can detect changes as they occur?
Thanks!
For this reason you can use json schema and validate the files agains it. If they have the schema than you're, good just need to validate. If not than you have to do it based on a correct json. There are online generators for that.
I am designing a system with 30,000 objects or so and can't decide between the two: either have a JSON file pre computed for each one and get data by pointing to URL of the file (I think Twitter does something similar) or have a PHP/Perl/whatever else script that will produce JSON object on the fly when requested, from let's say database, and send it back. Is one more suited for than another? I guess if it takes a long time to generate the JSON data it is better to have already done JSON files. What if generating is as quick as accessing a database? Although I suppose one has a dedicated table in the database specifically for that. Data doesn't change very often so updating is not a constant thing. In that respect the data is static for all intense and purposes.
Anyways, any thought would be much appreciated!
Alex
You might want to try MongoDB which retrieves the objects as JSON and is highly scalable and easy to setup.