Ruby Faker Library with JSONB in PostgreSQL db - json

My question is whether PostgreSQL actually stores json data in a jsonb column type with quotation marks?
The content in the column is stored as:
"{\"Verdie\":\"Barbecue Ribs\",\"Maurice\":\"Pappardelle alla Bolognese\",\"Vincent\":\"Tiramisù\"}"
I can't workout if this is a feature of PostgreSQL or how I'm seeding my Rails database with Faker data:
# seeds.rb
require "faker"
10.times do
con = Connector.create(
user_id: 1,
name: Faker::Company.name,
description: Faker::Company.buzzword
)
rand(6).times do
con.connectors_data.create(
version: Faker::Number.number(digits: 5),
metadata: Faker::Json.shallow_json(width: 3, options: { key: "Name.first_name", value: "Food.dish" }),
comment: Faker::Lorem.sentence
)
end
end

Its due to what you're using to seed the data.
This is a classic "double encoding" issue.
When dealing with JSON columns you need to remember that that the database adapter (the pg gem) will automatically serialize Ruby hashes, arrays, numerals, strings and booleans into json*. If you feed something that you have already converted into JSON the database adapter it will store it as a string - thus the escaped quotes. "JSON strings" in Ruby are not a specific type and the adapter has no idea that you for example intended to store the JSON object {"foo": "bar"} and not the string "{\"foo\": \"bar\"}".
This is what also what commonly happens when serialize or store are used on JSON columns out of ignorance.
The result is that you get garbage data that can't be queried without using the Postgres to_json function on every row which is extremely inefficient or you need to update the entire table with:
UPDATE table_name SET column_name = to_json(column_name);
Which can also be very costly.
While you could do:
rand(6).times do
con.connectors_data.create(
version: Faker::Number.number(digits: 5),
metadata: JSON.parse(Faker::Json.shallow_json(width: 3, options: { key: "Name.first_name", value: "Food.dish" })),
comment: Faker::Lorem.sentence
)
end
Its very smelly and the underlying method that Faker::Json uses to generate the hash is not public so you might want to look around for a better alternative.

Related

How to serialize/deserialize ruby hashes/structs with objects as keys to json

I would like to dump a nested datastructure in ruby to json (I am aware of the Marshal module but I need a standard format) and be able to load/parse the datastructure again. Catch: I use structs (or easier for the example: hashes) as keys of hashes. Example:
require 'json'
h = {{hello: 123} => 123}
JSON.parse(JSON.generate(h)) #=> {"{:hello=>123}"=>123}
So the problem is, that JSON.generate(h) serialises the key {:hello=>123} as a string and when I parse the result again, it remains a string.
How can I solve this and regain the original structure after generate/parse?
JSON only allows strings as object keys. For this reason to_s is called for all keys.
You'll have the following options to solve your issue:
The best option is changing the data structure so it can properly be serialized to JSON.
You'll have to handle the stringified key yourself. An Hash produces a perfectly valid Ruby syntax when converted to a string that can be converted using Kernel#eval like Andrey Deineko suggested in the comments.
result = json.transform_keys { |key| eval(key) }
# json.transform_keys(&method(:eval)) is the same as the above.
The Hash#transform_keys method is relatively new (available since Ruby 2.5.0) and might currently not be in you development environment. You can replace this with a simple Enumerable#map if needed.
result = json.map { |k, v| [eval(k), v] }.to_h
Note: If the incoming JSON contains any user generated content I highly sugest you stay away from using eval since you might allow the user to execute code on your server.
I need a standard format
YAML is a standard format that would suffice here:
▶ h = {{hello: 123} => 123}
#⇒ {{:hello=>123}=>123}
▶ YAML.dump h
#⇒ "---\n? :hello: 123\n: 123\n"
▶ YAML.load _
#⇒ {{:hello=>123}=>123}
As already pointed by mudasobwa, YAML is a good tool: allows you to store also custom class objects:
require 'yaml'
class MyCaptain
attr_accessor :name, :ship
def initialize(name, ship)
#name = name
#ship = ship
end
end
kirk = MyCaptain.new('James T. Kirk', 'USS Enterprise NCC-1701')
picard = MyCaptain.new('Jean-Luc Picard', 'Enterprise NCC-1701D')
captains = [kirk, picard]
File.open("my_captains.yml","w") do |file|
file.write captains.to_yaml
end
p YAML.load_file('my_captains.yml')
#=> [#<MyCaptain:0x007f889d0973b0 #name="James T. Kirk", #ship="USS Enterprise NCC-1701">, #<MyCaptain:0x007f889d096b40 #name="Jean-Luc Picard", #ship="Enterprise NCC-1701D">]

Rails i18n API Triple Dashes/Hyphens, Ellipsises and Newlines

I am using i18n API for a purpose. I seed a MySQL database with:
Translation.find_or_create_by(locale: 'en', key:'key1', value: 'value1')
However, after seed, the data is saved on database as:
locale: en
key: key1
value: --- value1\n...\n
All columns are varchar(255) and 'utf8_unicode_ci'.
On Rails i18n documentation, I could not find an explanation for this.
Because of that problem, I can not use find_or_create_by() method. It do/can not check the value column and adds duplicate entries.
Is there any solution for that?
Translate model:
Translation = I18n::Backend::ActiveRecord::Translation
if Translation.table_exists?
I18n.backend = I18n::Backend::ActiveRecord.new
I18n::Backend::ActiveRecord.send(:include, I18n::Backend::Memoize)
I18n::Backend::Simple.send(:include, I18n::Backend::Memoize)
I18n::Backend::Simple.send(:include, I18n::Backend::Pluralization)
I18n.backend = I18n::Backend::Chain.new(I18n::Backend::Simple.new, I18n.backend)
end
What you're seing in your value column is the value serialized to YAML (that's done by the I18n::Backend::ActiveRecord::Translation); which is required, among other things, for pluralization.
#find_or_create_by doesn't work nicely when the value stored in the database needs serialization
To do a simple seed try:
Translation.create_with(value: 'value1').find_or_create_by(locale: 'en', key: 'key1')

How to use ransack to search MySQL JSON array in Rails 5

Rails 5 now support native JSON data type in MySQL, so if I have a column data that contains an array: ["a", "b", "c"], and I want to search if this column contains values, so basically I would like to have something like: data_json_cont: ["b"]. So can this query be built using ransack ?
Well I found quite some way to do this with Arrays(not sure about json contains for hash in mysq). First include this code in your active record model:
self.columns.select{|column| column.type == :json}.each do |column|
ransacker "#{column.name}_json_contains".to_sym,
args: [:parent, :ransacker_args] do |parent, args|
query_parts = args.map do |val|
"JSON_CONTAINS(#{column.name}, '#{val.to_json}')"
end
query = query_parts.join(" * ")
Arel.sql(query)
end
end
Then assuming you have class Shirt with column size, then you can do the following:
search = Shirt.ransack(
c: [{
a: {
'0' => {
name: 'size_json_contains',
ransacker_args: ["L", "XL"]
}
},
p: 'eq',
v: [1]
}]
)
search.result
It works as follows: It checks that the array stored in the json column contains all elements of the asked array, by getting the result of each json contains alone, then multiplying them all, and comparing them to arel predicate eq with 1 :) You can do the same with OR, by using bitwise OR instead of multiplication.

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.

ruby/hash:ActiveSupport::HashWithIndifferentAccess

I store a hash to the mysql,but the result make me confuse:
hash:
{:unique_id=>35, :description=>nil, :title=>{"all"=>"test", "en"=>"test"}...}
and I use the serialize in my model.
serialize :title
The result in mysql like this:
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
all: test
en: test
Anyone can tell me what is the meaning? Why there is a ruby/hash:ActiveSupport::HashWithIndifferentAccess in mysql?
TL;DR:
serialize :title, Hash
What’s happening here is that serialize internally will yaml-dump the class instance. And the hashes in rails are monkeypatched to may having an indifferent access. The latter means, that you are free to use both strings and respective symbols as it’s keys:
h = { 'a' => 42 }.with_indifferent_access
puts h[:a]
#⇒ 42
Hash needs to be serialized, default serializer is YAML which supports in some way, in the Ruby implementation, the storing of type. Your hash is of type ActiveSupport::HashWithIndifferentAccess, so when fetched back, ruby knows what object should get back (unserializes it to an HashWithIndifferentAccess).
Notice that HashWithIndifferentAccess is a hash where you can access values by using either strings or symbols, so:
tmp = { foo: 'bar' }.with_indifferent_access
tmp[:foo] # => 'bar'
tmp['foo'] # => 'bar'
With a normal hash (Hash.new), you would get instead:
tmp = { foo: 'bar' }
tmp[:foo] # => 'bar'
tmp['foo'] # => nil
Obviously, in mysql, the hash is stored as a simple string, (YAML is plain text with a convention)