I have a table with 10 columns. Each row in the table was is originally a JSON Object that I receive in this format.
{"mainEntity":
"atlasId": 1234567
"calculatedGeography": False
"calculatedIndustry" : False
"geography": "G:6J"
"isPublic" = False
"name" = XYZ, Inc
"permId" = 12345678987
primaryRic=""
type=corporation
}
I am using jdbc and a mysql driver. The problem is my insert statements look very long and ugly(see example below) because of the high number of columns. Is there a way to solve this or is this the only way. Also, is there a way to insert multiple records at the same time with jdbc?
"INSERT INTO table_name VALUES(1234567, False, False, "G:6J", False, "XYZ, Inc", 12345678987, "", corporation"
Are you only wondering about style or also performance? always use prepared statements when you make inserts, this will unclutter your code and make sure the datatypes are all correct.
If it is about speed, you might try transactions, or even "load data infile". The load data method requires you to make a temporary CSV file that is directly loader into the database.
Related
I have a data frame made up of 3 columns named INTERNAL_ID, NT_CLONOTYPE and SAMPLE_ID. I need to write a script in R that will transfer this data into the appropriate 3 columns with the exact names in a MySQL table. However, the table has more than 3 columns, say 5 (INTERNAL_ID, COUNT, NT_CLONOTYPE, AA_CLONOTYPE, and SAMPLE_ID). The MySQL table already exists and may or may not include preexisting rows of data.
I'm using the dbx and RMariaDB libraries in R. I've been able to connect to the MySQL database with dbxConnect(). When I try to run dbxUpsert()
-----
conx <- dbxConnect(adapter = "mysql", dbname = "TCR_DB", host = "127.0.0.1", user = "xxxxx", password = "xxxxxxx")
table <- "TCR"
records <- newdf #dataframe previously created with the update data.
dbxUpsert(conx, table, records, where_cols = c("INTERNAL_ID"))
dbxDisconnect(conx)
I expect to obtain an updated mysql table with the new rows, which may or may not have null entries in the columns not contained in the data frame.
Ex.
INTERNAL_ID COUNT NT_CLONOTYPE AA_CLONOTYPE SAMPLE_ID
Pxxxxxx.01 CTTGGAACTG PMA.01
The connection and disconnection all run fin, but instead of the output I obtain the following error:
Error in .local(conn, statement, ...) :
could not run statement: Field 'COUNT' doesn't have a default value
I'm suspecting it's because the number of columns in each file are not the same, but I'm not sure. And if such, how can I get around this.
I figured it out. I changed the table entry for "COUNT" to default to NULL. This allowed for the program to proceed by ignoring "COUNT".
My original problem is that I need to insert a lot of records to DB, so to speed up, I want to use mysqlimport which takes a file of row values and load them to specified table. So suppose I have model Book, I couldn't simply use book.attributes.values as one of the fields is a hash that is serialized to db (using serialize), so I need to know what is the format this hash will be stored in in the db. Same for time and dates fields. Any help?
How about using SQL insert statements instead of serialization?
book = Book.new(:title => 'Much Ado About Nothing', author: 'William Shakespeare')
sql = book.class.arel_table.create_insert
.tap { |im| im.insert(record.send(
:arel_attributes_with_values_for_create,
record.attribute_names)) }
.to_sql
Given billions of the following variable length URLs, where the number of parameters depends on the parameter "type":
test.com/req?type=x&a=1&b=test
test.com/req?type=x&a=2&b=test2
test.com/req?type=y&a=4&b=cat&c=dog&....z=0
I would like to extract and store its parameters in a database to basically execute queries like "get number of occurrences of each possible value for parameter "a" when "type" is x" as fast as possible, taking into account that:
There are 100 possible values for "type".
There will NOT be concurrent writes/reads in the DB. First I fill the DB, then I execute queries.
There will be ~10 clients querying the DB.
There is only one machine for storing the DB (no clusters/ distributed computing)
Which of the following options for the DB would be the fastest option?
1) MySQL using an EAV pattern
table 1
columns: id, type.
rows:
0 | x
1 | x
2 | y
table 2
columns: table1_id, param, value
rows:
0 | a | 1
0 | b | test
2) NoSql (mongoDb)
Please feel free to suggest any other option.
Thanks in advance.
I think you can try use ElasticSearch. It's very fast search engine which can be used as a document-oriented (JSON) NoSQL database. If the insertion speed does not play a decisive role, it will be a good solution for your problem.
It's structure of json document. {url: "your url", type: "type from url", params: {a:"val", b:"val"...}} or more simple {url: "your url", type: "type from url", a:"val", b:"val"...}
Size of params is not fixed, because it's scheme-free.
I have written a Java program to do the following and would like opinions on my design:
Read data from a CSV file. The file is a database dump with 6 columns.
Write data into a MySQL database table.
The database table is as follows:
CREATE TABLE MYTABLE
(
ID int PRIMARY KEY not null auto_increment,
ARTICLEID int,
ATTRIBUTE varchar(20),
VALUE text,
LANGUAGE smallint,
TYPE smallint
);
I created an object to store each row.
I used OpenCSV to read each row into a list of objects created in 1.
Iterate this list of objects and using PreparedStatements, I write each row to the database.
The solution should be highly amenable to the changes in requirements and demonstrate good approach, robustness and code quality.
Does that design look ok?
Another method I tried was to use the 'LOAD DATA LOCAL INFILE' sql statement. Would that be a better choice?
EDIT: I'm now using OpenCSV and it's handling the issue of having commas inside actual fields. The issue now is nothing is writing to the DB. Can anyone tell me why?
public static void exportDataToDb(List<Object> data) {
Connection conn = connect("jdbc:mysql://localhost:3306/datadb","myuser","password");
try{
PreparedStatement preparedStatement = null;
String query = "INSERT into mytable (ID, X, Y, Z) VALUES(?,?,?,?);";
preparedStatement = conn.prepareStatement(query);
for(Object o : data){
preparedStatement.setString(1, o.getId());
preparedStatement.setString(2, o.getX());
preparedStatement.setString(3, o.getY());
preparedStatement.setString(4, o.getZ());
}
preparedStatement.executeBatch();
}catch (SQLException s){
System.out.println("SQL statement is not executed!");
}
}
From a purely algorithmic perspective, and unless your source CSV file is small, it would be better to
prepare your insert statement
start a transaction
load one (or a few) line(s) from it
insert the small batch into your database
return to 3. while there are some lines remainig
commit
This way, you avoid loading the entire dump in memory.
But basically, you probably had better use LOAD DATA.
If the no. of rows is huge, then the code will fail at Step 2 with out of memory error. You need to figure out a way to get rows in chunks and perform a batch with prepared statement for that chunk, continue till all the rows are processed. This will work for any no. of rows and also the batching will improve performance. Other than this I don't see any issue with the design.
This is my first attempt to throw data back and forth between a local MySQL database and R. That said, I have a table created in the database and want to insert data into it. Currently, it is a blank table (created with MySQL Query Browser) and has a PK set.
I am using the RODBC package (RMySQL gives me errors) and prefer to stick with this library.
How should I go about inserting the data from a data frame into this table? Is there a quick solution or do I need to:
Create a new temp table from my dataframe
Insert the data
Drop the temp table
With separate commands? Any help much appreciated!
See help(sqlSave) in the package documentation; the example shows
channel <- odbcConnect("test")
sqlSave(channel, USArrests, rownames = "state", addPK=TRUE)
sqlFetch(channel, "USArrests", rownames = "state") # get the lot
foo <- cbind(state=row.names(USArrests), USArrests)[1:3, c(1,3)]
foo[1,2] <- 222
sqlUpdate(channel, foo, "USArrests")
sqlFetch(channel, "USArrests", rownames = "state", max = 5)
sqlDrop(channel, "USArrests")
close(channel)
which hopefully should be enough to get you going.