How can I convert local ORC files to CSV? - csv

I have an ORC file on my local machine and I need any reasonable format from it (e.g. CSV, JSON, YAML, ...).
How can I convert ORC to CSV?

Download
Extract the files, go to the java folder and execute maven: mvn install
Use ORC-Tools
This is how I use them - you will likely need to adjust the paths:
java -jar ~/.m2/repository/org/apache/orc/orc-tools/1.5.4/orc-tools-1.5.4-uber.jar data ~/your_file.orc > output.json
The output is JSON Lines which is easy to convert to CSV. First I needed to remove the last two lines from the output. Then:
import pandas as pd
df = pd.read_json('output.json', lines=True)
df.to_csv('output.csv')

Another option could be bigdata-file-viewer, it's a cross-platform application. You can open an ORC file and save the file in CSV format.
The detailed usage is as following:
Download runnable jar from release page or follow Build section to build from source code.
Invoke it by java -jar BigdataFileViewer-1.2-SNAPSHOT-jar-with-dependencies.jar
Open binary format file by "File" -> "Open". Currently, it can open file with parquet suffix, orc suffix and avro suffix. If no suffix specified, the tool will try to extract it as Parquet file
Set the maximum rows of each page by "View" -> Input maximum row number -> "Go"
Set visible properties by "View" -> "Add/Remove Properties"
Convert to CSV file by "File" -> "Save as" -> "CSV"
Check schema information by unfolding "Schema Information" panel

Related

How would i save a doc/docx/docm file into directory or S3 bucket using Pyspark

I am trying to save a data frame into a document but it returns saying that the below error
java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html
My code is below:
#f_data is my dataframe with data
f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
display(f_data)
Note that i could save files of CSV, text and JSON format but is there any way to save a docx file using pyspark?
My question here. Do we have the support for saving data in the format of doc/docx?
if not, Is there any way to store the file like writing a file stream object into particular folder/S3 bucket?
In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer:
A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.
If you want to write DOCX files programmatically, you can:
Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?
Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.

How to read CSV file values in Jenkins pipeline using groovy

I want to read a CSV file values. inside the CSV url and it's credentials. I need to read the first url and store in variable for Jenkins pipeline
def records= readCSV file : 'location of file'
you might need to install utility plugin for this

Yarn parsing job logs stored in hdfs

Is there any parser, which I can use to parse the json present in yarn job logs(jhist files) which gets stored in hdfs to extract information from it.
The second line in the .jhist file is the avro schema for the other jsons in the file. Meaning that you can create avro data out of the jhist file.
For this you could use avro-tools-1.7.7.jar
# schema is the second line
sed -n '2p;3q' file.jhist > schema.avsc
# removing the first two lines
sed '1,2d' file.jhist > pfile.jhist
# finally converting to avro data
java -jar avro-tools-1.7.7.jar fromjson pfile.jhist --schema-file schema.avsc > file.avro
You've got an avro data, which you can for example import to a Hive table, and make queries on it.
You can check out Rumen, a parsing tool from the apache ecosystem
or When you visit the web UI, go to job history and look for the job for which you want to read .jhist file. Hit the Counters link at the left,now you will be able see an API which gives you all the parameters and the value like CPU time in milliseconds etc. which will read from a .jhist file itself.

How to export JMeter results to JSON?

We run load tests with JMeter and would like to export result data (throughput, latency, requests per second etc.) to JSON, either a file or STDOUT. How can we do that?
JMeter can save the results in a CSV format with header.
(Do not forget to select Save Field Names - it is OFF by default)
Then you can use this tool to covert the CSV to a JSON.
http://www.convertcsv.com/csv-to-json.htm
EDIT
JMeter stores the result in XML or CSV format. XML is by default (with .jtl extension). But It is always recommended to save the result in csv format.
If you want to convert XML to JSON
http://www.utilities-online.info/xmltojson/#.U9O2ifldVBk
If you are planning to use CSV, To save the result in CSV format automatically
When you are running your test via command line, to save the result in csv for a specific test
%JMETER_HOME%\bin\jmeter.bat" -n -t %TESTNAME% -p %PROPERTY_FILE_PATH% -l %RESULT_FILE_PATH% -j %LOG_FILE_PATH% -Djmeter.save.saveservice.output_format=csv
Or
You can update the jmeter.properties in bin folder to enable below property (for any test you run)
jmeter.save.saveservice.output_format=csv
Hope, it is clear!
There is no OOTB solution for this but you could inspire yourself from this patch:
https://issues.apache.org/bugzilla/show_bug.cgi?id=53668

Best way to format large JSON file? (~30 mb)

I need to format a large JSON file for readability, but every resource I've found (mostly online) doesn't deal with data say, above 1-2 MB. I need to format about 30 MB. Is there any way to do this, or any way to code something to do this?
With python >= 2.6 you can do the following:
For Mac/Linux users:
cat ugly.json | python -mjson.tool > pretty.json
For Windows users (thanks to the comment from dnk.nitro):
type ugly.json | python -mjson.tool > pretty.json
jq can format or beautify a ~100MB JSON file in a few seconds:
jq '.' myLargeUnformattedFile.json > myLargeBeautifiedFile.json
The command above will beautify a single-line ~120MB file in ~10 seconds, and jq gives you a lot of json manipulation capabilities beyond simple formatting, see their tutorials.
jsonpps is the only one worked for me (https://github.com/bazaarvoice/jsonpps).
It doesn't load everything to RAM unlike jq, jsonpp and others that I tried.
Some useful tips regarding installation and usage:
Download url: https://repo1.maven.org/maven2/com/bazaarvoice/jsonpps/jsonpps/1.1/jsonpps-1.1.jar
Shortcut (for Windows):
Create file jsonpps.cmd in the same directory with the following content:
#echo off
java -Xms64m -Xmx64m -jar %~dp0\jsonpps-1.1.jar %*
Shortcut usage examples:
Format stdin to stdout:
echo { "x": 1 } | jsonpps
Format stdin to file
echo { "x": 1 } | jsonpps -o output.json
Format file to file:
jsonpps input.json -o output.json
Background-- I was trying to format a huge json file ~89mb on VS Code using the command (Alt+Shift+F) but the usuals, it crashed. I used jq to format my file and store it in another file.
A windows 11 use case is shown below.
step 1- download jq from the official site for your respective OS - https://stedolan.github.io/jq/
step 2- create a folder in the C drive named jq and paste the executable file that you downloaded into the folder. Rename the file as jq (Error1: beware the file is by default an exe file so do not save it as 'jq.exe' save it only as 'jq')
step 3- set your path variable to the URL of the executable file.
step 4- open your directory on cmd where the json file is stored and type the following command - jq . currentfilename.json > targetfilename.json
replace currentfilename with the file name that you want to format
replace targetfilename with the final file name that you want your data formatted in
within seconds you should see your target file in the same directory in a formatted version which can now be opened on VS Code or any editor for that matter. Any error related to the recognizability of jq as a command can be traced back with high probability to Error 1.
jq jquery json data-preprocessing data-cleaning
You can use Notepad++ (https://notepad-plus-plus.org/downloads/) for formatting large JSON files (tested in Windows).
Install Notepad++
Go to Plugins -> Plugins Admin -> Install the 'Json Viewer' plugin. The plugin source code is present in https://github.com/kapilratnani/JSON-Viewer
After plugin installation, go to Plugins -> JSON Viewer -> Format JSON.
This will format your JSON file