I tried to open the csv file (http://archive.ics.uci.edu/ml/machine-learning-databases/00222/) with the pandas module, but the read_csv command did not open the file properly.
import pandas
bankfull = pandas.read_csv('bank-full.csv')
print bankfull.head()
and the result looks like
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y
0 58;"management";"married";"tertiary";"no";2143...
1 44;"technician";"single";"secondary";"no";29;"...
how can I fix the code so the csv file import as proper pandas Dataframe?
Thank you!
You need to set the separator arg sep=';', the default is comma , you can check the docs for read_csv:
pd.read_csv('bank-full.csv', sep=';')
Out[27]:
age job marital education default balance housing loan \
0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no
5 35 management married tertiary no 231 yes no
6 28 management single tertiary no 447 yes yes
7 42 entrepreneur divorced tertiary yes 2 yes no
8 58 retired married primary no 121 yes no
9 43 technician single secondary no 593 yes no
10 41 admin. divorced secondary no 270 yes no
11 29 admin. single secondary no 390 yes no
Related
I have data format coming from backend as
INPUT DATA FORMAT -
MODULE
FRQ
VAL
Module1
1
12
Module1
2
34
Module1
3
43
Module2
1
56
Module2
2
49
Module2
3
11
Module3
1
21
Module3
2
55
Module3
3
66
OUTPUT
Module 1 | 12 | 34 | 43
Module 2 | 56 | 59 | 11
How to display in table using above format using angular and module 1,2,3 is dynamic not fixed so if 8 modules it should display 8 rows as above output.
To get answer if the solution is possible
I want to perform OCR on images like this one:
It is a table with numerical data with colons as decimal separators.
It is not noisy, contrast is good, black text on white background.
As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract.
I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:
Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.
There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.
Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".
Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".
Any suggestions what else might be helpful in improving the output quality of tesseract here?
My tesseract version:
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE
Solution
Divide the image into the 5-different row
Apply division-normalization to each row
Set psm to 6 (Assume a single uniform block of text.)
Read
From the original image, we can see there are 5 different rows.
Each iteration, we will take a row, apply normalization and read.
We need to understand how to set image indexes carefully.
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
start_index = 0
end_index = int(h/5)
Question Why do we declare start and end indexes?
We want to read a single row in each iteration. Lets see in an example:
The current image height and width are 645 and 1597 pixels.
Divide the images based on indexes:
start-index
end-index
0
129
129
258 (129 + 129)
258
387 (258 + 129)
387
516 (387 + 129)
Lets see whether the images are readable?
start-index
end-index
image
0
129
129
258
258
387
387
516
Nope, they are not suitable for reading, maybe a little adjustment might help us. Like:
start-index
end-index
image
0
129 - 20
109
218
218
327
327
436
436
545
545
654
Now they are suitable for reading.
When we apply the division-normalization to each row:
start-index
end-index
image
0
109
109
218
218
327
327
436
436
545
545
654
Now when we read:
1,7 | 57 | 71 | 59 | .70 | 65
| 57 | 1,5 | 71 | 59 | 70 | 65
| 71 | 59 | 1,3 | 57 | 70 | 60
| 71 | 59 | 56 | 1,3 | 70 | 60
| 72 | 66 | 71 | 59 | 1,2 | 56
| 72 | 66 | 71 | 59 | 56 | 4,3
Code:
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
# print(img.shape[:2])
start_index = 0
end_index = int(h/5) - 20
for i in range(0, 6):
# print("{}->{}".format(start_index, end_index))
gry_crp = gry[start_index:end_index, 0:w]
blr = cv2.GaussianBlur(gry_crp, (145, 145), 0)
div = cv2.divide(gry_crp, blr, scale=192)
txt = image_to_string(div, config="--psm 6")
print(txt)
start_index = end_index
end_index = start_index + int(h/5) - 20
I have stored some hierarchical data of categories where each category is related to others, the trick is a single category can have multiple parents (Maximum 3, Minumum 0).
The table structures are:
category table
id - Primary Key
name - Name of the Category
ref_id - Reference ID that is being used for relationship
id
name
ref_id
1
everything
-1
2
computing
0
3
artificial intelligence
1
4
data science
2
5
machine learning (ML)
3
6
programming
4
7
web technologies
5
8
programming languages
7
9
content technologies
8
10
operating systems
9
11
algorithms
10
12
software development systems
102
category_relation table
id
child_ref_id
parent_ref_id
1
0
-1
2
1
0
3
2
0
4
3
1
5
3
2
6
4
102
7
5
0
8
7
4
9
8
0
10
9
0
11
10
0
12
10
4
13
102
0
as you can see in the diagram, the relationship is pretty complicated, algorithms has two parents computing and programming, similarly machine learning (ML) also has two parents artificial intelligence and data science
How can I retrieve all the children of a specific category, e.g. computing, I need to retrieve all the children till the third level, i.e. programming languages and algorithms.
MySQL dump of the database: https://github.com/codersrb/multi-parent-hierarchy/blob/main/taxonomy.sql
Assuming the data structure is fixed with a good PK, in MySQL 8.x you can do:
with recursive
n (id, name, ref_id, lvl) as (
select id, name, ref_id, 1 from category where id = 2 -- starting node
union all
select c.id, c.name, c.ref_id, n.lvl + 1
from n
join category_relation r on r.parent_ref_id = n.ref_id
join category c on c.ref_id = r.child_ref_id
)
select * from n where lvl <= 3
Result:
id name ref_id lvl
---- --------------------------------------- ------- ---
2 computing 0 1
3 artificial intelligence 1 2
4 data science 2 2
7 web technologies 5 2
9 content technologies 8 2
10 operating systems 9 2
11 algorithms 10 2
62 information science 61 2
103 software / systems development 102 2
165 scientific computing 165 2
296 image processing 316 2
297 text processing 317 2
301 Google 321 2
322 computer vision 343 2
5 machine learning (ML) 3 3
5 machine learning (ML) 3 3
6 programming 4 3
18 models 17 3
21 classification 20 3
27 data preparation 26 3
28 data analysis 27 3
29 imbalanced datasets 28 3
50 visualization 49 3
61 information retrieval 60 3
68 k-means 67 3
71 Random Forest algorithm 70 3
104 project management 103 3
105 software development methodologies 104 3
107 web development 106 3
113 kNN model 112 3
132 CRISP-DM methodology 131 3
143 data 142 3
153 SMOTE 153 3
154 MSMOTE 154 3
157 backward feature elimination 157 3
158 forward feature selection 158 3
176 deep feature synthesis (DFS) 177 3
196 unsupervised learning 197 3
210 mean-shift 211 3
212 DBSCAN 213 3
246 naïve Bayes algorithm 247 3
248 decision tree algorithm 249 3
249 support vector machine (SVM) algorithm 250 3
251 neural networks 252 3
252 artificial neural networks (ANN) 253 3
281 deep learning 300 3
281 deep learning 300 3
285 image classification 304 3
285 image classification 304 3
286 natural language processing (NLP) 305 3
286 natural language processing (NLP) 305 3
288 text representation 307 3
294 visual recognition 314 3
295 optical character recognition (OCR) 315 3
295 optical character recognition (OCR) 315 3
296 image processing 316 3
298 machine translation (MT) 318 3
299 speech recognition 319 3
300 TensorFlow 320 3
302 R 322 3
304 Android 324 3
322 computer vision 343 3
323 object detection 344 3
324 instance segmentation 345 3
325 edge detection 346 3
326 image filters 347 3
327 feature maps 348 3
328 stride 349 3
329 padding 350 3
335 text preprocessing 356 3
336 tokenization 357 3
337 case normalization 358 3
338 removing punctuation 359 3
339 stop words 360 3
340 stemming 361 3
341 lemmatization 362 3
342 Porter algorithm 363 3
350 word2vec 371 3
351 Skip-gram 372 3
364 convnets 385 3
404 multiplicative update algorithm 716 3
If you want to remove duplicates you can use DISTINCT. For example:
with recursive
n (id, name, ref_id, lvl) as (
select id, name, ref_id, 1 from category where id = 2 -- starting node
union all
select c.id, c.name, c.ref_id, n.lvl + 1
from n
join category_relation r on r.parent_ref_id = n.ref_id
join category c on c.ref_id = r.child_ref_id
)
select distinct * from n where lvl <= 3
See running example at DB Fiddle.
I want to read a CSV file in Octave which has a date column and 4 columns which are integers. I have used.
[num,txt,raw] = dlmread('Mitteilungen_data.csv');
ID = num(:,1) ;
DATE = datestr (date, yyyy-mm-dd) ;
FK_OBSERVERS= num(:,2) ;
GROUPS = num(:,3) ;
SUNSPOTS = num(:,4) ;
WOLF = num(:,5) ;
dn=datenum(DATE,'YYYY-MM-DD');
plot(dn,WOLF)
Sample Data:
ID DATE FK_OBSERVERS GROUPS SUNSPOTS WOLF
4939 1612-01-17 11 5 11 61
83855 1612-01-18 85 2 2 22
4940 1612-01-20 11 4 5 45
4941 1612-01-21 11 4 7 47
4942 1612-01-23 11 3 5 35
4943 1612-01-24 11 3 6 36
4944 1612-01-25 11 6 13 73
4945 1612-01-27 11 3 6 36
83856 1612-01-28 85 NULL NULL NULL
4946 1612-01-29 11 3 6 36
4947 1612-01-30 11 4 8 48
4948 1612-02-02 11 5 8 58
4949 1612-02-05 11 4 7 47
4950 1612-02-06 11 3 7 37
4951 1612-02-10 11 5 7 57
4952 1612-02-12 11 3 4 34
4953 1612-02-13 11 2 2 22
4954 1612-02-14 11 3 3 33
The Date column is showing an error: element number 2 undefined in return list. How can I fix this?
You are using
[num, txt, raw] = dlmread( %...
but dlmread does not return three outputs. Type help dlmread in your console to see the syntax.
What does seem to return these three arguments is the xlsread command. Perhaps you copied code that used xlsread?
However, even so, I would still use csv2cell. Type csv2cell('data.csv') (where data.csv is the name of your file) to see what kind of output it gives
Before you can use any of the commands defined in the io package, you need to load it on your workspace.
pkg load io
Does usage of NULL to fill empty cells on my tables make searches / queries faster?
For example is this
90 2 65 2011-04-08 NULL NULL
134 2 64 2011-04-13 NULL 07:00:00
135 2 64 2011-04-13 NULL 07:00:00
136 2 64 2011-04-13 NULL 22:45:00
137 2 64 2011-04-14 NULL 19:30:00
better than
90 2 65 2011-04-08
134 2 64 2011-04-13 07:00:00
135 2 64 2011-04-13 07:00:00
136 2 64 2011-04-13 22:45:00
137 2 64 2011-04-14 19:30:00
If anyone could tell me any specific advantage to using NULL (performance, good practice, etc) that would be much appreciated.
There is a semantic difference.
NULL means "value unknown or not applicable".
If this describes the data for that column, then use it.
The empty string means "value known, and it's 'nothingness'".
If this describes the data for that column, then use it. Of course, this applies only to strings; it's unusual to have such a scenario for other data types, but usually for numeric fields a value of 0 would be appropriate here.
In short, it depends mostly on what your fields mean. Worry about performance and profiling later, when you know what data you're representing. (NULL vs "" is never going to be your bottleneck, though.)