Split a monster json file in smaller json files

Split a monster json file in smaller json files - json

as by title I have a beast of json files that I would like to split into smaller files. The size of this json file is 8.7 Gb. The format/json are located at this link: Detailed book graph
. is the first paragraph. The file is big enough to saturate the RAM of my PC (32 GB)
I tried to look for some tools online or on github, but nothing worked. Anyone have any idea how I can do that?

Related

Why do .pdn files for paint.net contain a bunch of gibberish?

I was using paint.net (An image program) and I decided to open a .pdn file as raw text because I was curious, what I saw was a bunch of gibberish! Why is the data stored like this?

It is most likely stored as binary. This wont make sense to view as a human. However this makes it quick and easy for the program to understand. It also most likely reduces the amount of space the file takes up. Most programs store data like this.

How do I train tesseract 4 with image data instead of a font file?

I'm trying to train Tesseract 4 with images instead of fonts.
In the docs they are explaining only the approach with fonts, not with images.
I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.
I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.
These files need to be single lines of text.
In the tesstrain repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -

.json to .csv "big" file

I recently downloaded my location history from Google. From 2014 to present.
The resulting .json file was 997,000 lines, plus a few.
All of the online converters would freeze and lock up unless I did it in really small slices which isn't an option. (Time constraints)
I've gotten a manual process down between Sublime Text and Libre Office to get my information transferred, but I know there's an easier way somewhere.
I even tried the fastFedora plug-in which I couldn't get to work.
Even though I'm halfway done, and will likely finish up using my process, is there an easier way?
I can play with Java though I'm no pro. Any other languages that play well with .json?
A solution that supports nesting without flattening the file. Location data is nested and needs to remain nested (or the like) to make sense. At least grouped.

Massive data file conversion

I have a .data file that is +19MB in size. If I open it in texteditor on my Mac I get the circle of death. I've tried moving into a json file in Atom, but that too breaks. It's all on 1 line, so trying to beautify to break down file into manageable chunks is breaking my things too. I have cut about 150 lines worth, beautified it, and it came out great, but that's kb of data. I've got a +19MB file.
I've tried multiple ways of cutting and pasting to see if I can make smaller files, but that too breaks everything. Can I get this into sql?
I'm working on my personal project and this data file is critical to my project. Thanks!

Reverse engineering a custom data file

At my place of work we have a legacy document management system that for various reasons is now unsupported by the developers. I have been asked to look into extracting the documents contained in this system to eventually be imported into a new 3rd party system.
From tracing and process monitoring I have determined that the document images (mainly tiff files) are stored in a number of 1.5GB files. These files seem to be read from a specific offset and then written to a tmp file that is then served via a web app to the client, and then deleted.
I guess I am looking for suggestions as to how I can inspect these large files that contain the tiff images, and eventually extract and write them to individual files.

Are the TIFFs compressed in some way? If not, then your job may be pretty easy: stitch the TIFFs together from the 1.5G files.
Can you see the output of a particular 1.5G file (or series of them)? If so, then you should be able to piece together what the bytes should look like for that TIFF if it were uncompressed.
If the bytes don't appear to be there, then try some standard compressions (zip, tar, etc.) to see if you get a match.

I'd open a file, seek to the required offset, and then stream into a tiff object (ideally one that supports streaming from memory or file). Then you've got it. Poke around at some of the other bits, as there's likely metadata about the document that may be useful to the next system.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Split a monster json file in smaller json files - json

Related

Why do .pdn files for paint.net contain a bunch of gibberish?

How do I train tesseract 4 with image data instead of a font file?

.json to .csv "big" file

Massive data file conversion

Reverse engineering a custom data file

Categories

Resources