I have one issue in my hive code. I want to extract JSON data from using HIVE.Following is the sample json format
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"versionModified"{"machine":"123.dfer","founder":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
I want to get the following fields
ver
type
vehicle
ts
founder
state
the issue is founder and state is in one array "version"
can anybody help how to get rid of this?
some times instead of versionmedified something else may come
eg:
some times my data will be like
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"anotherCriteria":{"engine":"123.dfer","developer":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
adding some sample data below:
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC"{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP"{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX"{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
i need to put this data into various tables based on version if it is "BOX" the put in one table if it is "GAP" put another...
you can use json serde to fetch all fields
Just follow below steps
1.Download json serde from http://www.congiu.net/hive-json-serde/1.3/
2.Add json serde Jar
hive> ADD jar /root/json-serde-1.3-jar-with-dependencies.jar;
Added [/root/json-serde-1.3-jar-with-dependencies.jar] to class path
Added resources: [/root/json-serde-1.3-jar-with-dependencies.jar]
3.create table
CREATE TABLE json_serde_table (
Rtype struct<ver:int, os:string,type:string,vehicle:string,MOD: struct<Version:Array<struct<versionModified:struct<machine:string,founder:string,state:string,fashion:string,cdc:string,dof:string,ts:string>>>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
4.load json file into table
hive> load data local inpath '/root/json.txt' INTO TABLE json_serde_table;
Loading data to table default.json_serde_table
Table default.json_serde_table stats: [numFiles=1, totalSize=234]
OK
Time taken: 0.877 seconds
5.Fire below query to get result
hive> select Rtype.ver ver ,Rtype.type type ,Rtype.vehicle vehicle ,Rtype.MOD.version[0].versionModified.ts ts,Rtype.MOD.version[0].versionModified.founder founder,Rtype.MOD.version[0].versionModified.state state from json_serde_table;
Query ID = root_20170412170606_a674d31b-31d7-477b-b9ff-3ebd76636cf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0018, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0018/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-12 17:06:44,990 Stage-1 map = 0%, reduce = 0%
2017-04-12 17:06:53,361 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1491484583384_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 4891 HDFS Write: 50 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
1 ns Mh-3412 2000-04-01T00:00:00.171Z 3.0 Florida
Time taken: 19.745 seconds, Fetched: 1 row(s)
Related
I wonder what is the best way to read data from csv file (located on S3) and then insert into database table.
I have deployed apache flink on my k8s cluster.
I have tried with DataSet api in the following way:
Source(Read csv) -> Map(Transform POJO to Row) -> Sink(JdbcOutputFormat)
It seems that Sink (writing into DB) is the bottleneck. Source and Map tasks are idle for ~80% while at the same time Sink is idle for 0ms/1s with input rate rate 1.6MB/s.
I can only speed up the whole operation of inserting csv content into my database by spliting the whole operation on new replicas of task managers.
Is there any room for improving performance of my jdbc sink?
[edit]
DataSource<Order> orders = env.readCsvFile("path/to/file") //
.pojoType(Order.class, pojoFields)
.setParallelism(6) //
.name("Read csv"); //
JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat()
.setQuery("INSERT INTO orders(...) values (...)") //
.setBatchInterval(10000) //
.finish();
orders.map(order -> {
Row r = new Row(29);
//assign values from Order pojo to Row
return r;
}).output(jdbcOutput).name("Postgre SQL Output");
I have experimented with batch interval in range 100-50000 but it didn't affect speed of processing significantly, it's still 1.4-1.6MB/s
If instead of writing to external database I print all entries from csv file to stdout (print()) I get rate 6-7MB/s so this is why I assumed the problem is with jdbc sink.
With this post just wanted to make sure my code doesn't have any performance issues and I reach max performance from a single Task Manager.
I am new in bigquery, Here I am trying to load the Data in GCP BigQuery table which I have created manually, I have one bash file which contains bq load command -
bq load --source_format=CSV --field_delimiter=$(printf '\u0001') dataset_name.table_name gs://bucket-name/sample_file.csv
My CSV file contains multiple ROWS with 16 column - sample Row is
100563^3b9888^Buckname^https://www.settttt.ff/setlllll/buckkkkk-73d58581.html^Buckcherry^null^null^2019-12-14^23d74444^Reverb^Reading^Pennsylvania^United States^US^40.3356483^-75.9268747
Table schema -
When I am executing bash script file from cloud shell, I am getting following Error -
Waiting on bqjob_r10e3855fc60c6e88_0000016f42380943_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'project-name-
staging:bqjob_r10e3855fc60c6e88_0000ug00004521': Error while reading data, error message: CSV
table
encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection
for more details.
Failure details:
- gs://bucket-name/sample_file.csv: Error while
reading data, error message: CSV table references column position
15, but line starting at position:0 contains only 1 columns.
What would be the solution, Thanks in advance
You are trying to insert wrong values to your table per the schema you provided
Based on table schema and your data example I run this command:
./bq load --source_format=CSV --field_delimiter=$(printf '^') mydataset.testLoad /Users/tamirklein/data2.csv
1st error
Failure details:
- Error while reading data, error message: Could not parse '39b888'
as int for field Field2 (position 1) starting at location 0
At this point, I manually removed the b from 39b888 and now I get this
2nd error
Failure details:
- Error while reading data, error message: Could not parse
'14/12/2019' as date for field Field8 (position 7) starting at
location 0
At this point, I changed 14/12/2019 to 2019-12-14 which is BQ date format and now everything is ok
Upload complete.
Waiting on bqjob_r9cb3e4ef5ad596e_0000016f42abd4f6_1 ... (0s) Current status: DONE
You will need to clean your data before upload or use a data sample with more lines with --max_bad_records flag (Some of the lines will be ok and some not based on your data quality)
Note: unfortunately there is no way to control date format during the upload see this answer as a reference
We had the same problem while importing data from local to BigQuery. After researching the data we saw that there data which starting \r or \s enter image description here
After implementing ua['ColumnName'].str.strip() and ua['District'].str.rstrip(). we could add data to Bg.
Thanks
I have an Orion Context Broker connected with Cosmos by cygnus.
It works ok, I mean I send new elements to Context Broker and cygnus send them to Cosmos and save them in files.
The problem I have is when I try to do some searchs.
I start hive, and I see that there are some tables created related with the files that cosmos have created, so I launch some querys.
The simple one works fine:
select * from Table_name;
Hive doesn't launch any mapReduce jobs.
but when I want to filter, join, count, or get only some fields. That is what happens:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = JOB_NAME, Tracking URL = JOB_DETAILS_URL
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -kill JOB_NAME
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-07-08 14:35:12,723 Stage-1 map = 0%, reduce = 0%
2015-07-08 14:35:38,943 Stage-1 map = 100%, reduce = 100%
Ended Job = JOB_NAME with errors
Error during job, obtaining debugging information...
Examining task ID: TASK_NAME (and more) from job JOB_NAME
Task with the most failures(4):
-----
Task ID:
task_201409031055_6337_m_000000
URL: TASK_DETAIL_URL
-----
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
I have found that the files that are created by Cygnus has difference between tha other files, because in the case of the cygnus, they have to be deserialize with a jar.
So, I have the doubt if in those cases I have to apply any MapReduce method or if is already any general method to do this.
Before executing any Hive sentence, do this:
hive> add jar /usr/local/hive-0.9.0-shark-0.8.0-bin/lib/json-serde-1.1.9.3-SNAPSHOT.jar;
If you are using Hive through JDBC, execute is as any other sentence:
Connection con = ...
Statement stmt = con.createStatement();
stmt.executeQuery("add jar /usr/local/hive-0.9.0-shark-0.8.0-bin/lib/json-serde-1.1.9.3-SNAPSHOT.jar");
stmt.close();
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("select ...");
I have some issues importing a large set of relationships (2M records) from a CSV file.
I'm running Neo4j 2.1.7 on Mac OSX (10.9.5), 16GB RAM.
The file has the following schema:
user_id, shop_id
1,230
1,458
1,783
2,942
2,123
etc.
As mentioned above - it contains about 2M records (relationships).
Here is the query I'm running using the browser UI (I was also trying to do the same with a REST call):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
This query takes ages to run, about 800 seconds. I do have indexes on :User(id) and :Shop(id). Created them with:
CREATE INDEX ON :User(id)
CREATE INDEX ON :Shop(id)
Any ideas on how to increase the performance?
Thanks
Remove the space before shop_id
try to run:
LOAD CSV WITH HEADERS FROM "file:test.csv" AS r
return r.user_id, r.shop_id limit 10;
to see if it is loaded correctly. On your original data r.shop_id is null as the column name is shop_id
Also make sure that you didn't store the id's as numeric values in the first place, then you have to use toInt(r.shop_id)
Try to profile your statement in Neo4j Browser (2.2.) or in Neo4j-Shell.
Remove the PERIODIC COMMIT for that purpose and limit the rows:
PROFILE
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
WITH relation LIMIT 10000
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
I want to move one table with self reference from PostgreSQL to Neo4j.
PostgreSQL:
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
Result:
$ cat /tmp/empbase.csv | head
e_id,e_name,e_bossid
1,emp_no_1,
2,emp_no_2,
3,emp_no_3,
4,emp_no_4,
5,emp_no_5,3
6,emp_no_6,2
7,emp_no_7,3
8,emp_no_8,1
9,emp_no_9,4
Size:
$ du -h /tmp/empbase.csv
631M /tmp/empbase.csv
I import data to neo4j with:
neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
> CREATE (:EmpBase:_EmpBase { neo_eb_id: toInt(row.e_id),
> neo_eb_bossID: toInt(row.e_bossid),
> neo_eb_name: row.e_name});
and this works fine:
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 20505764
Properties set: 61517288
Labels added: 41011528
846284 ms
The Neo4j console says:
Location:
/home/neo4j/data/graph.db
Size:
5.54 GiB
But then I want to proceed with the relation that each emp has a boss. So simple emp->bossid SELF reference.
Now I do it like this:
LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
MATCH (employee:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_id)})
MATCH (manager:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_bossid)})
MERGE (employee)-[:REPORTS_TO]->(manager);
But this works for 5-6 hours and breaks in the end with system failures it freezez the system.
I think this might be terribly wrong.
1. Am I doing sth wrong or is it bug for No4j?
2. Why out of 631 MB csv now I get 5,5 GB?
EDIT1:
$ du -h /home/neo4j/data/
20K /home/neo4j/data/graph.db/index
899M /home/neo4j/data/graph.db/schema/index/lucene/1
899M /home/neo4j/data/graph.db/schema/index/lucene
899M /home/neo4j/data/graph.db/schema/index
27M /home/neo4j/data/graph.db/schema/label/lucene
27M /home/neo4j/data/graph.db/schema/label
925M /home/neo4j/data/graph.db/schema
6,5G /home/neo4j/data/graph.db
6,5G /home/neo4j/data/
SOLUTION:
Wait until the :schema in console says ONLINe not POPULATING
change log size in config file
Add USING PERIODIC COMMIT 1000 in second csv import
Index only on label
Only match on one Label: MATCH (employee:EmpBase {neo_eb_id: toInt(row.e_id)})
Did you create the index: CREATE INDEX ON :EmpBase(neo_eb_id);
then wait for the index to get online :schema in browser
OR if it is a unique id: CREATE CONSTRAINT ON (e:EmpBase) assert e.neo_eb_id is unique;
Otherwise your match will scan all nodes in the database for each MATCH.
For your second question, I think it's the transaction log files,
you can limit their size in conf/neo4j.properties with
keep_logical_logs=100M size
The actual nodes and properties files shouldn't be that large. Also you don't have to store the boss-id in the database. That's actually handled by the relationship :)