I have a table for holding translations. It is laid out as follows:
id | iso | token | content
-----------------------------------------------
1 | GB | test1 | Test translation 1 (English)
2 | GB | test2 | Test translation 2 (English)
3 | FR | test1 | Test translation 1 (French)
4 | FR | test2 | Test translation 2 (French)
// etc
For the translation management tool to go along with the table I need to output it in something more like a spreadsheet grid:
token | GB | FR | (other languages) -->
-------------------------------------------------------------------------------------------
test1 | Test translation 1 (English) | Test translation 1 (French) |
test2 | Test translation 1 (French) | Test translation 2 (French) |
(other tokens) | | |
| | | |
| | | |
V | | |
I thought this would be easy, but it turned out to be far more difficult than I expected!
After a lot of searching and digging around I did find group_concat, which for the specific case above I can get to work and generate the output I'm looking for:
select
token,
group_concat(if (iso = 'FR', content, NULL)) as 'FR',
group_concat(if (iso = 'GB', content, NULL)) as 'GB'
from
translations
group by token;
However, this is, of course, totally inflexible. It only works for the two languages I have specified so far. The instant I add a new language I have to manually update the query to take it into account.
I need a generalized version of the query above, that will be able to generate the correct table output without having to know anything about the data stored in the source table.
Some sources claim you can't easily do this in MySQL, but I'm sure it must be possible. After all, this is the sort of thing databases exist for in the first place.
Is there a way of doing this? If so, how?
Because of mysql limitations, I need to do something like this on query side and in 1 query, I would do it like this:
query:
select token, group_concat(concat(iso,'|',content)) as contents
from translations
group by token
"token";"contents"
"test1";"GB|Test translation 1 (English),FR|Test translation 1
(French),IT|Test translation 1 (Italian)" "test2";"GB|Test translation
2 (English),FR|Test translation 2 (French),IT|Test translation 2
(Italian)"
Than While I am binding rows I could split from comma to rows and split from pipe for header..
What you seek is often called a dynamic crosstab wherein you dynamically determine the columns in the output. Fundamentally, relational databases are not designed to dynamically determine the schema. The best way to achieve what you want is to use a middle-tier component to build the crosstab SQL statement similar to what you have shown and then execute that.
Related
I have two tables that I want to relate to each other. The issue is any product can have n-number of POs, so individual columns wouldn't work in a traditional DB.
I was thinking of using JSON fields to store an array, or using XML. I would need to insert additional POs, so I'm concerned with the lack of editing support for XML.
What is the standard way of handling n-number of attributes in a single field?
|id | Product | Work POs|
| - | ------- | ------- |
| 1 | bicycle | 002,003 |
| 2 | unicycle| 001,003 |
|PO | Job |
|-- | ---------------- |
|001|Install 1 wheel |
|002|Install 2 wheels |
|003|Install 2 seats |
The standard way to store multi-valued attributes in a relational database is to create another table, so you can store one value per row. This makes it easy to add or remove one new value, or to search for a specific value, or to count PO's per product, and many other types of queries.
id
Product
1
bicycle
2
unicycle
product_id
PO
1
002
1
003
2
001
2
003
PO
Job
001
Install 1 wheel
002
Install 2 wheels
003
Install seat
I also recommend reading my answer to Is storing a delimited list in a database column really that bad?
In some case you really need store array-like data in one field.
In MySQL 5.7.8+ you can use JSON type datafield:
ALTER TABLE `some_table` ADD `po` JSON NOT NULL`;
UPDATE `some_table` SET `po` = '[002,003]' WHERE `some_table`.`id` = 1;
See examples here: https://sebhastian.com/mysql-array/
Community!
Story: I am trying to upload a CSV file with a huge batch of products to my e-commerce shop. But there are many very similar products, but all with every column slightly different. And luckily the plugin I use can handle this, but it needs the same title for the entire product range or some reference to its parent product. The reference is sadly not there.
Now I want to know how I can find values in a CSV file that are nearly the same (in SQL there was something called '%LIKE%') to structure the table appropriately. I can hardly describe what I want to achieve, but here is an example for what I'm looking for.
I basically want to transform this table:
+---------------+---------------+---------------+---------------+
| ID | Title | EAN | ... |
+---------------+---------------+---------------+---------------+
| 1 | AquaMat 3.6ft | 1234567890 | ... |
+---------------+---------------+---------------+---------------+
| 2 | AquaMat 3.8ft | 1234567891 | ... |
+---------------+---------------+---------------+---------------+
| 3 | AquaMat 4ft | 1234567892 | ... |
+---------------+---------------+---------------+---------------+
into this:
+---------------+---------------+---------------+---------------+
| ID | Title | EAN | ... |
+---------------+---------------+---------------+---------------+
| 1 | AquaMat | 1234567890 | ... |
+---------------+---------------+---------------+---------------+
| 2 | AquaMat | 1234567891 | ... |
+---------------+---------------+---------------+---------------+
| 3 | AquaMat | 1234567892 | ... |
+---------------+---------------+---------------+---------------+
The extra data can be scraped. Can I do this with Excel? With Macros? With Python?
Thank you for taking time and reading this.
If you have any questions, than feel free to ask.
EDIT:
The Title column contains products with completely different names and might even contain more whitespaces. And some products might have 1 attribute but others have up to 3 attributes. But this can be sorted manually.
And with nearly the same I mean as you can see in the table. The Title's are basically the same but not identical. I want to remove the attributes from them. Also, there are no other columns with any more details, only numbers and the attributes that I am trying to cut of the title!!!
Here's an idea using Python and .split():
import csv
with open('testfile.csv', 'r', encoding="utf-8-sig") as inputfile:
csv_reader = csv.reader(inputfile, delimiter=',')
with open('outputfile.csv', 'w', newline='') as outputfile:
w = csv.writer(outputfile)
header=['ID','Title','EAN','Product','Attr1','Attr2','Attr3']
w.writerow(header)
for row in csv_reader:
if row[0]=='ID':
header_row=True
pass
else:
header_row=False
list=row[1].split()
for item in list:
# if you want, you can add some other conditions on the attribute (item) in here
row.append(item)
if not header_row:
print('row: {}'.format(row))
w.writerow(row)
I think we're going to need more information about what, exactly you're trying to achieve. Is it just the extra text after the "Aquamat" (for example) that you want to remove? If so, you could simply loop through the csv file and remove anything after "Aquamat" in the "Title" column.
I assume from your description, though, that there is more to it than this.
Perhaps a starting point would be to let us know what you mean by "nearly the same". Do you want exactly what SQL means by LIKE, or something different?
EDIT:
You might check out Python's Regular Expressions: Here. If your "nearly the same" can be translated into a regex expression as described in the docs, then you could use python to loop through the csv file and search/replace terms based on the regular expression.
Are all the "nearly the same" things in the "Title" column, or could they be in other columns as well?
Edit: SQL doesn't work for this. I just found out about Solr/Sphinx and it seems like the right tool for this problem, so if you know Solr or Sphinx I'm eager to hear from you.
Basically, I have a .tsv with patent info and a .csv with product names. I need to match each row of the patents column against the product names and extract the occurrences in a new .csv column.
You can scroll down and see the example at the end.
Original question:
SQL newbie here so bear with me :). I can't figure out how to do this:
My database:
mysql> SHOW TABLES;
+-----------------------+
| Tables_in_prodpatdb |
+-----------------------+
| assignee |
| patents |
| patent_info |
| products |
+-----------------------+
mysql> DESCRIBE patents;
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| ... | | | | | |
| patent_id | varchar(20) | YES | | NULL | |
| text | text | YES | | NULL | |
| ... | | | | | |
+-------------+-------------+------+-----+---------+-------+
mysql> DESCRIBE products;
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| name | text | YES | | NULL | |
+-------------+-------------+------+-----+---------+-------+
I have to work with the columns name and text, they look like this:
name
product1
product2
product3
...
~10M rows
text
long text description 1
long text description 2
long text description 3
...
~88M rows
I need to check patents.text row 1 and match it against products.name column to find every product name in that row, then store those products names in a new table. Then check row 2 and repeat.
If a patents.text row has a product name several times only copy it to the new table once. If some row has no product names just skip it. The output should be something like this:
Operation Product
1 prod5, prod6
2 prod7
...
An example:
name
valve
a/c fan
farmed salmon
...
text
This patent deals with a new approach to air-conditioned fan. With some new valve the a/c fan is
so much better. The new valve is great.
This patent has no product names in it.
This patent talks about farmed salmon.
...
Desired output:
Operation Product
1 valve, a/c fan
2 farmed salmon
...
You can use GROUP_CONCAT with inner SELECT query, e.g.:
SELECT p.text,
(SELECT GROUP_CONCAT(name) FROM products WHERE LOCATE(LOWER(name), LOWER(p.text)) > 0) AS 'products'
FROM patent p;
The only way I can see doing this with a reasonable performance is a full text search. I've seldom done these myself (maybe 3 times in 20+ years now); so I'll defer to someone else w/ more experience.
Using https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html as a starting point.
Provided the full text index has been created, it may be something as simple as:
SELECT pat.patent_ID, group_concat(P.Name)
FROM patents pat
CROSS JOIN products p
WHERE MATCH (pat.text)
AGAINST (p.name IN NATURAL LANGUAGE MODE)
GROUP BY pat.patent_ID;
Since every product vs every patent we have to cross join so we now have 880 million rows; that alone is a a lot. The more reading I do on this however, the more I realize we're dealing with unstructured data in a RDBMS. By its nature that's not an ideal fit; and there may be much more optimized methods to handle this outside of a RDBMS, or we have to spend the time to structure the data in the RDBMS so it can be more effective with the indexes (such as splitting the text into it's own rows per word for indexing)
Lastly, do we really need to look for ALL products? the sheer size of the data involved on both sizes means this is going to take time in a database that doesn't handle unstructured data well.
Edit
Scratch the below as it will not be able to handle the load effectively. But keeping it for posterity.
I think concat() and group_concat() may do the trick.
We join where the patent.text is like the product name generating multiple rows. the group_concat then combines these rows into one record. I'm not sure where "Operation" comes from in your result.
SELECT pat.text, group_concat(P.Name) as Product
FROM patents pat
INNER JOIN text
on pat.text like concat('%',p.name,'%')
GROUP by pat.text
However don't expect this to be fast; as we're doing a wild card search using a % on both ends; so no index can be used.
I need to create a large scale DB Model for a web application that will be multilingual.
One doubt that I've every time I think on how to do it is how I can resolve having multiple translations for a field. A case example.
The table for language levels, that administrators can edit from the backend, can have multiple items like: basic, advance, fluent, mattern... In the near future probably it will be one more type. The admin goes to the backend and add a new level, it will sort it in the right position.. but how I handle all the translations for the final users?
Another problem with internationalization of a database is that probably for user studies can differ from USA to UK to DE... in every country they will have their levels (that probably it will be equivalent to another but finally, different). And what about billing?
How you model this in a big scale?
Here is the way I would design the database:
Visualization by DB Designer Fork
The i18n table only contains a PK, so that any table just has to reference this PK to internationalize a field. The table translation is then in charge of linking this generic ID with the correct list of translations.
locale.id_locale is a VARCHAR(5) to manage both of en and en_US ISO syntaxes.
currency.id_currency is a CHAR(3) to manage the ISO 4217 syntax.
You can find two examples: page and newsletter. Both of these admin-managed entites need to internationalize their fields, respectively title/description and subject/content.
Here is an example query:
select
t_subject.tx_translation as subject,
t_content.tx_translation as content
from newsletter n
-- join for subject
inner join translation t_subject
on t_subject.id_i18n = n.i18n_subject
-- join for content
inner join translation t_content
on t_content.id_i18n = n.i18n_content
inner join locale l
-- condition for subject
on l.id_locale = t_subject.id_locale
-- condition for content
and l.id_locale = t_content.id_locale
-- locale condition
where l.id_locale = 'en_GB'
-- other conditions
and n.id_newsletter = 1
Note that this is a normalized data model. If you have a huge dataset, maybe you could think about denormalizing it to optimize your queries. You can also play with indexes to improve the queries performance (in some DB, foreign keys are automatically indexed, e.g. MySQL/InnoDB).
Some previous StackOverflow questions on this topic:
What are best practices for multi-language database design?
What's the best database structure to keep multilingual data?
Schema for a multilanguage database
How to use multilanguage database schema with ORM?
Some useful external resources:
Creating multilingual websites: Database Design
Multilanguage database design approach
Propel Gets I18n Behavior, And Why It Matters
The best approach often is, for every existing table, create a new table into which text items are moved; the PK of the new table is the PK of the old table together with the language.
In your case:
The table for language levels, that administrators can edit from the backend, can have multiple items like: basic, advance, fluent, mattern... In the near future probably it will be one more type. The admin goes to the backend and add a new level, it will sort it in the right position.. but how I handle all the translations for the final users?
Your existing table probably looks something like this:
+----+-------+---------+
| id | price | type |
+----+-------+---------+
| 1 | 299 | basic |
| 2 | 299 | advance |
| 3 | 399 | fluent |
| 4 | 0 | mattern |
+----+-------+---------+
It then becomes two tables:
+----+-------+ +----+------+-------------+
| id | price | | id | lang | type |
+----+-------+ +----+------+-------------+
| 1 | 299 | | 1 | en | basic |
| 2 | 299 | | 2 | en | advance |
| 3 | 399 | | 3 | en | fluent |
| 4 | 0 | | 4 | en | mattern |
+----+-------+ | 1 | fr | élémentaire |
| 2 | fr | avance |
| 3 | fr | couramment |
: : : :
+----+------+-------------+
Another problem with internationalitzation of a database is that probably for user studies can differ from USA to UK to DE... in every country they will have their levels (that probably it will be equivalent to another but finally, different). And what about billing?
All localisation can occur through a similar approach. Instead of just moving text fields to the new table, you could move any localisable fields - only those which are common to all locales will remain in the original table.
I'm trying to build a MySQL query that uses the rows in a lookup table as the columns in my result set.
LookupTable
id | AnalysisString
1 | color
2 | size
3 | weight
4 | speed
ScoreTable
id | lookupID | score | customerID
1 | 1 | A | 1
2 | 2 | C | 1
3 | 4 | B | 1
4 | 2 | A | 2
5 | 3 | A | 2
6 | 1 | A | 3
7 | 2 | F | 3
I'd like a query that would use the relevant lookupTable rows as columns in a query so that I can get a result like this:
customerID | color | size | weight | speed
1 A C D
2 A A
3 A F
The kicker of the problem is that there may be additional rows added to the LookupTable and the query should be dynamic and not have the Lookup IDs hardcoded. That is, this will work:
SELECT st.customerID,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=1 AND st.customerID = st1.customerID) AS color,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=2 AND st.customerID = st1.customerID) AS size,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=3 AND st.customerID = st1.customerID) AS weight,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=4 AND st.customerID = st1.customerID) AS speed
FROM ScoreTable st
GROUP BY st.customerID
Until there is a fifth row added to the LookupTable . . .
Perhaps I'm breaking the whole relational model and will have to resolve this in the backend PHP code?
Thanks for pointers/guidance.
tom
You have architected an EAV database. Prepare for a lot of pain when it comes to maintainability, efficiency and correctness. "This is one of the design anomalies in data modeling." (http://decipherinfosys.wordpress.com/2007/01/29/name-value-pair-design/)
The best solution would be to redesign the database into something more normal.
What you are trying to do is generally referred to as a cross-tabulation, or cross-tab, query. Some DBMSs support cross-tabs directly, but MySQL isn't one of them, AFAIK (there's a blog entry here depicting the arduous process of simulating the effect).
Two options come to mind for dealing with this:
Don't cross-tab at all. Instead, sort the output by row id, then AnalysisString, and generate the tabular output in your programming language.
Generate code on-the-fly in your programming langauge to emit the appropriate query.
Follow the blog I mention above to implement a server-side solution.
Also consider #Marek's answer, which suggests that you might be better off restructuring your schema. The advice is not a given, however. Sometimes, a key-value model is appropriate for the problem at hand.