Add field of similar data with percentage in database - mysql

I am working on a scraping task where I have to collect products title and prices from two websites. I have the large dataset after scraping and the structure of the table:
[
{
"id": 1,
"title": "ProductA",
"price": 10,
"matches": []
},
{
"id": 2,
"title": "ProductB",
"price": 20,
"matches": []
},
{
"id": 3,
"title": "Another One",
"price": 30,
"matches": []
},
]
I am using MongoDB right now as a Database. I have to run some script that will find the matches products with a score and store it in the column within a large dataset. Example:
[
{
"id": 1,
"title": "ProductA",
"price": 10,
"matches": [{score: 0.75, productId: 2}]
},
{
"id": 2,
"title": "ProductB",
"price": 20,
"matches": [{score: 0.75, productId: 1}]
},
{
"id": 3,
"title": "Another One",
"price": 30,
"matches": []
},
]
I tried LIKE in SQL and $text in MongoDB but both are working only when I have to find similarities with a specific text.
Is there any built-in operation in DB that can go thru all the documents and find similarities by title, generate a percentage of how much it matches, and then add it in the matches field?
NOTE: MongoDB is not mandatory, any database can be mentioned.

Related

pyspark json nested generate field name

suppose i am having the following json
{
"filename": "orderDetails",
"datasets": [
{
"orderId": "ord1001",
"customerId": "cust5001",
"orderDate": "2021-12-24 00.00.00.000",
"shipmentDetails": {
"street": "M.G.Road",
"city": "Delhi",
"state": "New Delhi",
"postalCode": "110040",
"country": "India"
},
"orderDetails": [
{
"productId": "prd9001",
"quantity": 2,
"sequence": 1,
"totalPrice": {
"gross": 550,
"net": 500,
"tax": 50
}
},
{
"productId": "prd9002",
"quantity": 3,
"sequence": 2,
"totalPrice": {
"gross": 300,
"net": 240,
"tax": 60
}
}
]
}
I would like to read the json to spark dataframe with name
filename,filename_datasets,filename_datasets_orderID.........
filename_orderDetails_productID,filename_orderDetails_quatity ...
what's a good way to do that ? Can I first generate my custom field name from the json schema itself
??
Firstly, there is no a straight forward easy way to do it.
The guidelines to what you requesting are as follow:
You have to create a struct with all of your relevant fields.
cast your datasets field with your relevant schema using the definition at point 1.
explode it using explode method from (pyspark.sql.functions)
explode again the relevant field orderDetails
Select the relevant requested columns using a regular select statement.
done.

How to group by? Django Rest Framework

I am developing an API Rest based on Django, I have two models:
Album
Track
I am trying to get the right format on this JSON (this is what I am getting now):
[
{
"album": "album-123kj23mmmasd",
"track": {
"id": 6,
"uuid": "2c553219-9833-43e4-9fd1-44f536",
"name": "song name 1",
},
"duration": 2
},
{
"album": "album-123kj23mmmasd",
"track": {
"id": 7,
"uuid": "9e5ef1da-9833-43e4-9fd1-415547a",
"name": "song name 5",
},
"duration": 4
},
This is what I would like to reach, I would like to group by 'albums':
[
{
"album": "album-123kj23mmmasd",
"tracks": [{
"id": 6,
"uuid": "2c553219-9833-43e4-9fd1-44f536",
"name": "song name 1",
"duration": 2
},
{
"id": 7,
"uuid": "9e5ef1da-9833-43e4-9fd1-415547a",
"name": "song name 5",
"duration": 4
},
]
},
]
EDIT 1: I am using Foreign Key instead ManyToMany
class Track(models.Model:
name, creation_date, etc...
class Album(models.Model):
track = models.Foreignkey(Track, on_delete=models.CASCADE)
Thanks in advance
SOLUTION:
Due the model complexity, I decided to user BaseSerializer, which allows to create custom response.

What would be the best way to format JSON data consumed by a SPA?

I'm working with a friend on a single page application (in React, but I believe that the framework doesn't really matter, the same question applies to Angular as well).
There is a database with 2 tables:
Feature
Car
Both tables are connected in the database with many-to-many relation.
We differ in how we should pass the data from the backend to the frontend (more precisely, CarManagementComponent that will let user work on cars/features (edit/update/delete etc)). We want to have ability to perform several actions before, actually, sending a request back to the backend to update the database so that the user has desktop application-like interface experience.
Please, keep in mind that there are more tables in the database but for the example's simplicity, we're talking here only about 2 of them.
1) My approach:
{
"Features": [
{
"Id": 0,
"Price": 3000,
"Name": "led lights",
"Color": "transparent",
"Brand": "Valeo",
"Guarantee": 12
},
{
"Id": 1,
"Price": 1000,
"Name": "air conditioning",
"Color": "",
"Brand": "Bosch",
"Guarantee": 12
},
{
"Id": 2,
"Price": 600,
"Name": "tinted windows",
"Color": "",
"Brand": "Bosch",
"Guarantee": 36
}
],
"Cars": [
{
"Id": 0,
"Name": "Ford Mustang GT",
"Weight": 2210,
"Features":[
{
"Id": 0, // id of many-to-many relations record
"FeatureId": 2
},
{
"Id": 1, // id of many-to-many relations record
"FeatureId": 1
}
]
},
{
"Id": 1,
"Name": "Volkswagen Arteon",
"Weight": 1650,
"Features":[
{
"Id": 2, // id of many-to-many relations record
"FeatureId": 2
}
]
}
]
}
2) My friend's approach:
{
"Cars": [
{
"Id": 0,
"Name": "Ford Mustang GT",
"Weight": 2210,
"Features": [
{
"Id": 1,
"Price": 1000,
"Name": "air conditioning",
"Color": "",
"Brand": "Bosch",
"Guarantee": 12
},
{
"Id": 2,
"Price": 600,
"Name": "tinted windows",
"Color": "",
"Brand": "Bosch",
"Guarantee": 36
}
]
},
{
"Id": 1,
"Name": "Volkswagen Arteon",
"Weight": 1650,
"Features": [
{
"Id": 2,
"Price": 600,
"Name": "tinted windows",
"Color": "",
"Brand": "Bosch",
"Guarantee": 36
}
]
}
]
}
I belive that the 1st approach is better because:
it weighs less (no data redundancy)
it would be easier to convert such data into object-oriented structure
eg. we are able to see all Feature records (in 2nd approach, we'd only see records that are being connected with Cars and another backend request would be needed)
eg. unlike the 2nd approach, we're able to obtain all the needed data in just 1 request (less problems with synchronization) and we could be saving modified data in a single request as well
My friend says 2nd approach is better because:
it'd be easier to achieve that using ORM (hibernate)
he's never seen 1st approach in his life (which could lead to a conclusion, that it's being done in a wrong way)
What do you think? Which solution is better? Maybe both of them in some areas? Maybe there's a 3rd solution we didn't think of yet?
I would say the approach I mostly like is yours for 2 main reasons:
Keeping in mind that data duplication is bad in an HTTP request, your approach is avoiding it.
You let the FeatureId inside the car object and it is enough to get the feature in an efficient performance O(N).
To make it even better, you could change your feature structure to this:
"Features": {
0: { // <- If the id is unique, you can use it as a key.
"Id": 0,
"Price": 3000,
"Name": "led lights",
"Color": "transparent",
"Brand": "Valeo",
"Guarantee": 12
},
1: {
"Id": 1,
"Price": 1000,
"Name": "air conditioning",
"Color": "",
"Brand": "Bosch",
"Guarantee": 12
},
2: {
"Id": 2,
"Price": 600,
"Name": "tinted windows",
"Color": "",
"Brand": "Bosch",
"Guarantee": 36
}
},
This way, you can get the Feature in O(1).

Redshift Copy Command Error "Overflow, Column type: Integer"

I am using Copy command of Redshift database and storing json file from s3 bucket to databse. but I am getting this error "Overflow, Column type: Integer" and the error code is 1216 and line number in json file is 33.
Here is my json file:
{
"id": 119548805147,
"title": "Shoes",
"vendor": "xyz",
"product_type": "",
"handle": "shoes",
"options": [
{
"id": 171716739099,
"product_id": 119548805147,
"name": "Size",
"position": 1,
"values": [
"9",
"10",
"11"
]
},
{
"id": 171716771867,
"product_id": 119548805147,
"name": "Color",
"position": 2,
"values": [
"Red",
"white",
"Black"
]
}
],
"images": [],
"image": null
} //line number 33
{
"id": 119548805147,
"title": "Shoes",
"vendor": "xyz",
"product_type": "",
"handle": "shoes",
"options": [
{
"id": 171716739099,
"product_id": 119548805147,
"name": "Size",
"position": 1,
"values": [
"9",
"10",
"11"
]
},
{
"id": 171716771867,
"product_id": 119548805147,
"name": "Color",
"position": 2,
"values": [
"Red",
"white",
"Black"
]
}
],
"images": [],
"image": null
}
my table in redshift is as below
CREATE TABLE products (
"_id" int4 DEFAULT "identity"(297224, 0, '1,1'::text),
"id" int4,
title varchar(50),
product_type varchar(200),
vendor varchar(200),
handle varchar(200),
variants_id int4,
"options" varchar(65535),
images varchar(65535),
image varchar(65535)
);
And my Copy command in Redshift is here:
copy products
from 's3://kloudio-data-files'
access_key_id 'my access key'
secret_access_key 'my secret key'
json 'auto'
I think there is a mismatch of column and json file data type but I am not getting it.
The error suggests that that the value you're trying to input is bigger than the type can hold, and I can see from your data sample that id takes the value 171716771867 which is greater than the max value an INTEGER can hold.
Integers are 4 bytes long in Redshift, so they can hold (2 ^ (8))^4 = 4294967296 distinct values, which gives us the range: [-2147483648, 2147483647], or one can read this off from the table in the official documentation
The solution is to use a different type for your data. Use a Big Integer, if you want the id as numeric or use a text field. note, I only scanned your sample input for 1 overflow error, it may be necessary to correct the type for other fields

How correctly design data model in cassandra to store that json

i'm studying cassandra to understand how to model data to best manage an json like this:
{
"summary": {
"elem": [
{
"score": 15.8,
"value": "xxx"
},
{
"score": 15.7,
"value": "yyy"
},
{
"score": 13.9,
"value": "zzz"
}
],
"sens": [
{
"score": 23,
"start": 0,
"end": 210,
"value": "kkk"
},
{
"score": 12.1,
"start": 212,
"end": 326,
"value": "nnn"
}
]
},
"cats": [
{
"name": "c1",
"val": 10245,
"sens": [
{
"val": "mmm",
"els": [
{
"start": 25,
"end": 38,
"value": "ccc"
}
],
"score": 810,
"start": 0,
"end": 210
},
...
]
},
...
],
"ecv": {
"ens": [
{
"val": "bbb",
"text": "jjj",
"matches": [
{
"start": 2706,
"end": 2719,
"value": "aaa"
}
],
"properties": [
{
"name": "id",
"value": "0001"
},
{
"name": "uni",
"value": "V"
},
...
]
},
...
],
"rels": [
{
"ens": [
{
"text": "pp",
"start": 0,
"end": 7,
"value": "uuu"
},
{
"type": "rrr",
"start": 25,
"end": 38,
"value": "www"
}
],
"act": {
"name": "rtr",
"type": "fff",
"start": 122,
"end": 125
},
"sens": {
"value": "ddd"
}
},
...
]
},
"doms": [
{
"value": "yyy",
"fas": [
{
"val": "ccw",
"sens": {
"start": 0,
"end": 210,
"value": "xxx"
},
"els": [
{
"start": 169,
"end": 178,
"value": "bhh"
},
...
],
"ents": [
{
"val": "ents1",
"type": "xxx",
"matches": [
{
"start": 0,
"end": 7,
"value": "bbb"
}
]
},
...
]
},
...
]
},
...
]
}
I used for some months MongoDB so i think is simple to write this entire document to mongoDB collection.
I can't know how to desing my cassandra model to store that json.
Can someone give me a way to start "think in Cassandra" ?
Thanks!
A thing that's vital to understand is that there's no "one storage" model in Cassandra. You have a traditional ER model, from which you derive a logical model, that you combine with your constraints and performance requirements to get the final physical model. Your tables in cassandra are the last form, and the other forms are "in your head" or documented - ideas rather than tables. So, how do you get to those tables? You think in terms of your queries. Your tables cater to your queries, and as such without knowing something about access patterns, there's no way of saying how you should store it.
Data in Cassandra is stored in partitions. A partition is identified by the first key in the primary key. All rows in a partition are stored together on one machine (and replicas). If your query hits a partition, it's fast. If it hits multiple partitions (pk in query), it's slower. If it hits all partitions, it's slowest. However, if all queries hit a single partition, you get hotspots (some servers utilised heavily while others are idle). This can be bad in large clusters.
Inside a partition, data is sorted in order according to the clustering keys (the "other" keys in the primary key). You can only do filter (where) queries on a clustering key if all previous clustering keys have been specified. In addition, you can only to a range query (>, < etc.) on the last clustering key in a predicate. You can also create secondary indices, which can let you query on equality conditions outside of the clustering key requirements, though they are slower and are updated async.
There's a lot of intricacies there, and the queries you can perform are much more restricted (if any sort of perf is a consideration). So the "cassandra way" is think about your query patterns, and then store data based on those. If you have mutliple storage patterns, duplicate the same info in different forms. There is no "one true way" of storing data.