I'm working with articles scraped from online newspapers with a mysql database and python. I want to use pandas to_sql method on a dataframe for appending recently scraped articles to a mysql table. It works pretty well, but im having some problems with the following:
Since the articles are automatically scraped from news sites, about 1% of them have issues (encoding, or texts are too long or stuff like that) and dont fit on the mysql table fields. Pandas to_sql method for some reason IGNORES these errors and discards the rows that do not fit. For example I have the following mysql table:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(255) | YES | | NULL | |
| description | text | YES | | NULL | |
| content | text | YES | | NULL | |
| link | varchar(300) | YES | | NULL | |
+--------------+--------------+------+-----+---------+----------------+
And I also have a Dataframe that contains 15 rows and 4 columns (title, description, content, link).
If 1 of those rows has a title larger than 255 characters, it wont fit in the mysql table. I expected an error when doing df.to_sql('press', con=con, index=False, if_exists='append'), that way I know i have a problem to fix; but the actual result was that 14 ROWS where appended instead of 15.
This could work for me, but i need to know which row was discarded so i can flag it for later revision. Is it possible to tell pandas to let me know which indexes are ignored?
Thanks!
Related
Say I have a mysql table which contains two columns, one is a job number, and the other contains the stderr of the respective job:
+----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------+------+-----+---------+-------+
| job | int(11) | YES | UNI | NULL | |
| jerrors | text | YES | | NULL | |
+----------+---------+------+-----+---------+-------+
The text contains newline characters.
I would like to select from jerrors only the lines containing a given word, i.e. something like
(\n.*word*.*\n)
How would I construct such query? If I do
SELECT job, jerrors FROM mytable WHERE jerrors LIKE "%No such file%";
then I get the whole text in the column.
Here's a sample text
Error in <TCling::RegisterModule>: cannot find dictionary module A2Dict_rdict.pcm
Error in <TCling::RegisterModule>: cannot find dictionary module A1Dict_rdict.pcm
Error in <TCling::RegisterModule>: cannot find dictionary module SpectraDict_rdict.pcm
PedestalT0 File : conf/A2-roT0.job1685_0-run1685_2.dat: No such file or directory
[A1DataDecoder] WARNING: 12 Slots for 20 pedestals! Using '16' for last 19 pedestals ...
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libMathMore.so for ROOT::Math::GSMIntegrator
When I'm looking into new databases to explore what is there, usually I get tables with long column names but short contents, like:
mysql> select * from Seat limit 2;
+---------+---------------------+---------------+------------------+--------------+---------------+--------------+-------------+--------------+-------------+---------+---------+----------+------------+---------------+------------------+-----------+-------------+---------------+-----------------+---------------------+-------------------+-----------------+
| seat_id | seat_created | seat_event_id | seat_category_id | seat_user_id | seat_order_id | seat_item_id | seat_row_nr | seat_zone_id | seat_pmp_id | seat_nr | seat_ts | seat_sid | seat_price | seat_discount | seat_discount_id | seat_code | seat_status | seat_sales_id | seat_checked_by | seat_checked_date | seat_old_order_id | seat_old_status |
+---------+---------------------+---------------+------------------+--------------+---------------+--------------+-------------+--------------+-------------+---------+---------+----------+------------+---------------+------------------+-----------+-------------+---------------+-----------------+---------------------+-------------------+-----------------+
| 4897 | 2016-09-01 00:05:54 | 330 | 331 | NULL | NULL | NULL | 0 | NULL | NULL | 0 | NULL | NULL | NULL | 0.00 | NULL | NULL | free | NULL | NULL | 0000-00-00 00:00:00 | NULL | NULL |
| 4898 | 2016-09-01 00:05:54 | 330 | 331 | NULL | NULL | NULL | 0 | NULL | NULL | 0 | NULL | NULL | NULL | 0.00 | NULL | NULL | free | NULL | NULL | 0000-00-00 00:00:00 | NULL | NULL |
+---------+---------------------+---------------+------------------+--------------+---------------+--------------+-------------+--------------+-------------+---------+---------+----------+------------+---------------+------------------+-----------+-------------+---------------+-----------------+---------------------+-------------------+-----------------+
Since the length of the header is longer that the contents of each row, I see a unformatted output which is hard to standard, specially when you search for little clues like fields that aren't being used and so on.
Is there any way to tell mysql client to truncate column names automatically, for example, to 10 characters as maximum? With the first 10 character is usually enough to know which column they refer to.
Of course I could stablish column aliases for that with AS, but if there's too much columns and you want to do a fast exploration, that would take too long for each table.
Other solution will be to tell mysql to remove the prefix seat_ for each column for example (of course, for each column I would need to change the used prefix).
I don't think there's any way to do that automatically. Some options are:
1) Use a graphical UI such as PhpMyAdmin to view the table contents. These typically allow you to adjust column widths.
2) End the query with \G instead of ;:
mysql> SELECT * FROM seat LIMIT 2\G
This will display the columns horizontally instead of vertically:
seat_id: 4897
seat_created: 2016-09-01 00:05:54
seat_event_id: 330
...
I often use the latter for tables with lots of columns because reading the horizontal format can be difficult, especially when it wraps around on the terminal.
3) Use the less pager in a mode that doesn't wrap lines. You can then scroll left and right with the arrow keys.
mysql> pager less -S
See How to better display MySQL table on Terminal
You can skip the column names completely by running the MySQL client with the -N or --skip-column-names option. Then the width of your columns will be determined by the widest data, not the column name. But there would be no row for the column names.
You can also use column aliases to set your own column names, but you'd have to enter these yourself manually.
It might sound silly but Im just curious.
I have a table named posts:
+----------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| title | varchar(50) | YES | | NULL | |
| body | text | YES | | NULL | |
| created | datetime | YES | | NULL | |
| modified | datetime | YES | | NULL | |
+----------+------------------+------+-----+---------+----------------+
The values:
+----+-----------------------+----------------------------------------+---------------------+---------------------+
| id | title | body | created | modified |
+----+-----------------------+----------------------------------------+---------------------+---------------------+
| 2 | A title once again!!! | And the post body follows. Tralalalala | 2013-06-03 13:13:44 | 2013-06-05 09:36:51 |
| 3 | Title strikes back | This is really exciting! Not. | 2013-06-03 13:13:46 | NULL |
| 11 | Tomcat | Tommy boy!!! FFF | 2013-06-04 16:33:22 | 2013-06-04 16:48:40 |
| 12 | FFD | dsfdsf | 2013-06-04 16:48:56 | 2013-06-04 16:55:50 |
| 13 | fdf | dfdsf | 2013-06-04 16:57:47 | 2013-06-05 09:36:54 |
| 14 | GGD | dsfdsf | 2013-06-04 17:02:33 | 2013-06-04 17:02:33 |
| 15 | GG# | dsfdsfff322 | 2013-06-05 09:36:20 | 2013-06-05 09:36:28 |
+----+-----------------------+----------------------------------------+---------------------+---------------------+
Let's say I want to search for row that has the value Th (not case sensitive) regardless of the FIELD. This is like making a quick search function.
Normally I would do something like : SELECT * FROM posts WHERE title LIKE '%Th%' OR body LIKE '%Th%'
I did not include the other fields because obviously they are not gonna accept those values.
I wanna know if there's a shortcut to this? Like SELECT * FROM posts LIKE '%Th%'.
Please advise. Thanks.
Using plain old SQL you need to specify all the column names you wish to include.
If you want more search-box-like behavior, I'd suggest looking at MySQL's fulltext functions; see:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
The SQL language is based on the presumption of the schema being known. Thus, there is no "search any column" type of functionality. How would it work against non-text columns? What about columns of different collations? Aside from the language not having a feature, specifying the columns expresses your intent to the next developer and that as much as anything should be an overriding consideration.
Other answers have covered that you need to specify all the columns. Here is an alternative formulation that is a bit shorter:
SELECT *
FROM posts
WHERE concat(title, ' ', body) LIKE '%Th%'
If you are looking for an exact match, then you can do:
select *
from posts
where 'Th' in (title, body)
No there is no shortcut for using a where clause. and specifying the columns. Otherwise the query engine can never know what to filter and what column to filter unless you specify them in the where clause.
If you want a custom shortcut - you can write a function which takes a single parameter (the search string) and returns the required fields.
I'm afraid there isn't.
Not sure what your use case is... does this alternative approach work for your use case?
mysql -u{user} -p{password} -h{hostname} {database_name} -B -e "{query}" | grep "{search_string}"
It connects to the database and runs the specified query, returns query results in new lines, fields separated by tab stop. Then use Unix utility grep to filter returned rows.
Let's suppose I have a table of movies:
+------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| title | tinytext | YES | | NULL | |
| synopsis | synopsis | YES | | NULL | |
| year | int(4) | YES | | NULL | |
| ISBN | varchar(13) | YES | | NULL | |
| category | tinytext | YES | | NULL | |
| author | tinytext | YES | | NULL | |
| theme | tinytext | YES | | NULL | |
| edition | int(2) | YES | | NULL | |
| search | text | YES | | NULL | |
+------------+---------------------+------+-----+---------+----------------+
In this example, I'm using search column as a summary of the table. So, a possible record would be like the following:
+------------+-------------------------------------------------------------+
| Field | Value |
+------------+-------------------------------------------------------------+
| id | 1 |
| title | Awesome Book |
| synopsis | This is a cool book with a cool history |
| year | 2013 |
| ISBN | 1234567890123 |
| category | Horror |
| author | John Doe |
| theme | Programmer goes insane |
| edition | 2nd |
| search | 2013 horror john doe awesome book this is a cool book (...) |
+------------+---------------------+------+-----+---------+----------------+
This column search will be the one scanned when a search is made. Notice that it has all the words of other fields, in lower case, and possibly some extra words to help on a search.
I have two questions about it:
1) Knowing that this column is a text field and can get really big, is it okay to index it? Will it improve the performance as expected? Why?
2) Despite the index, is it a good idea to use this method to search or is it better to try every column on my query? How can I improve it?
OBS: I don't really have this table, it's just for example purposes. Please ignore any error in datatypes or syntax I may have done.
1) Knowing that this column is a text field and can get really big, is
it okay to index it? Will it improve the performance as expected? Why?
Yes, you can index it, but no, it won't improve performance. An index on string-type columns only helps when the query matches the start of the column - so in your case, someone searching '2013 horror john' would hit the index, but someone searching 'horror john 2013' would not.
2) Despite the index, is it a good idea to use this method to search
or is it better to try every column on my query? How can I improve it?
As Gordon Linoff writes, the best solution is probably full text searching - this is blazingly fast for text searches, deals with "fuzzy" matching, and generally allows you to write a search function similar to the way google works.
Indexing the search column is not helpful.
What you may want is full text search capabilities on the column, which you can read about here.
Which you use for search depends on whether the searches will be using context. If someone searches for "Clinton", do you want them to restrict the search to authors named "Clinton" or to books about "Clinton"? If you don't care about the context, then full text on one field is quite reasonable.
I need to add: you don't need to put all the search terms in a separate field to use full text search. You can create a full text index on multiple columns. This gives you the flexibility of using full text searches with context (by looking only in specific columns) or without context (by looking in all of them). Your question was about the search column in particular, but that is not the best way to implement the functionality that you are looking for.
I am saving a serialized object to a mysql database blob.
After inserting some test objects and then trying to view the table, i am presented with lots of garbage and "PuTTYPuTTY" several times.
I believe this has something to do with character encoding and the blob containing strange characters.
I am just wanting to check and see if this is going to cause problems with my database, or if this is just a problem with putty showing the data?
Description of the QuizTable:
+-------------+-------------+-------------------+------+-----+---------+----------------+---------------------------------+-------------------------------------------------------------------------------------------------------------------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------------+-------------+-------------------+------+-----+---------+----------------+---------------------------------+-------------------------------------------------------------------------------------------------------------------+
| classId | varchar(20) | latin1_swedish_ci | NO | | NULL | | select,insert,update,references | FK related to the ClassTable. This way each Class in the ClassTable is associated with its quiz in the QuizTable. |
| quizId | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | This is the quiz number associated with the quiz. |
| quizObject | blob | NULL | NO | | NULL | | select,insert,update,references | This is the actual quiz object. |
| quizEnabled | tinyint(1) | NULL | NO | | NULL | | select,insert,update,references | |
+-------------+-------------+-------------------+------+-----+---------+----------------+---------------------------------+-------------------------------------------------------------------------------------------------------------------+
What i see when i try to view the table contents:
select * from QuizTable;
questionTextq ~ xp sq ~ w
t q1a1t q1a2xt 1t q1sq ~ sq ~ w
t q2a1t q2a2t q2a3xt 2t q2xt test3 | 1 |
+-------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
3 rows in set (0.00 sec)
I believe you can use the hex function on blobs as well as strings. You can run a query like this.
Select HEX(quizObject) From QuizTable Where....
Putty is reacting to what it thinks are terminal control character strings in your output stream. These strings allow the remote host to change something about the local terminal without redrawing the entire screen, such as setting the title, positioning the cursor, clearing the screen, etc..
It just so happens that when trying to 'display' something encoded like this, that a lot of binary data ends up sending these characters.
You'll get this reaction catting binary files as well.
blob will completely ignore any character encoding settings you have. It's really intended for storing binary objects like images or zip files.
If this field will only contain text, I'd suggest using a text field.