How should I treat null values while doing the data analysis? - data-analysis

I am currently analysing an employees layoff dataset available on kaggle. It has 1574 records and 9 columns, one of the columns name "Total_laid_off" (how many employees were laid off from the organisation,datatype: Int), it has 442 records with missing values. what should I do in this case to treat the missing values? should I replace them with median value or shall I drop missing values from the dataset?
I have a similar question for another column name "percentage of employees laid off" (percentage of employees fired out of the total workforce). in this case, also shall I replace 552 missing values with the median percentage value?
What could be the best course of action?
I personally thought I should replace the missing values with median values because dropping so many rows would result in a good amount of loss of information.

Related

Microsoft Access - Look-up tables that only show records not previously matched

Context: I have three tables.
Table #1: Tbl_TraumaCodes: It marks all the dates, times, and hospital beds where a medical team is alerted to go treat a patient with a serious traumatic injury.
Table #2: Tbl_Location: Lists the date, time, and location (area of the hospital, bed number) where a patient was with an identifying number.
Table #3: Tbl_Outcomes: Has an identifying number paired with discharge outcomes.
Task: I need to with a reasonable amount of surety, match records in Tbl_TraumaCodes with Tbl_Outcomes.
Matching Tbl_Location and Tbl_Outcomes is easy and automatic through a matching query using the identifying number. Matching Tbl_Location records with Tbl_Trauma Codes will create the link I need.
I designed a look-up table in Tbl_Location where the date, time, and location of records from Tbl_ TraumaCode appears so that I can match them. However, the times that are supposed to correspond between Table_Location and Table_TraumaCode are not exactly the same. The times are roughly within the same ballpark (usually 30 +/- min).
Problem: I have thousands of records to match. There may only be 10 records on a given day, which allows me to limit the options when I type in, say, July 1st in the look-up table. Not every item in Tbl_Location with have a matching item in Tbl_TraumaCode. That means I have to match 10 records when there may be 40 extra record to work with. It’s incorrect to assign an item (time) in Table_TraumaCode to more than one item in the Table_Location. My goal is to reduce the potential for human error.
Is there a way to make the records from the look-up table that are already assigned to a record within Tbl_Location NOT display in the look-up field? I thought about drawing the look-up table from a query, but I don’t know how I would create a TraumaCode query that only displays records that aren’t matched in another table. I also don't know if it would impact the previously assigned records.
I avail myself of the collective wisdom and humbly thank you.

SQL counting the top occurrences of substrings seperated by commas in a column

I have a column in MYSQL with a list of comma-separated names of varying lengths. Some example columns would be: ,bob,joe,mike, or ,steve,bill,dan,.
I'm looking to sort by the names that occur the most in all columns and be able to count how many times they occur. For example it could return that Joe is the most common name with x occurrences in all of the columns and that bob is the second most common name with y occurrences in all of the columns.
Is there an effective way to go about this or am I better off storing each name individually as their own record? This table has records added to it quite often so if I could cut down on the size that would be ideal.
I would definitely go for storing these values as 1 row each in the 'name' column of a 1-many table. That way you can use aggregate functions easily.

Unmatched Query with Limits

I have a database that uses a unique ID for each transaction. The transaction ID is the last two digits of a year followed by a four digit sequential number (eg. 0100 to 9999). That number resets back to 0100 at the start of each year. Not all numbers are used each year. Example, the last transaction in 2012 was 12-0409, in 2011 it was 11-0500. These numbers are not currently generated in the database but are created manually. I am in the process of getting them to switch to using automation but in the meantime I have to create patches to fix errors.
In the database, I have one table and one query. The query ([Offer Check]) lists the Transaction ID ([HL#]) and shows just the last four in two formats, one as a number format ([NumList]) and one as a text format ([TextList]). The table is a basic table that lists all the numbers between 0100 and 9999. I am trying to create a query that allows me to identify which Transaction IDs are missing, i.e. I have 13-0250 and 13-0252 but not 13-0251. I can create the query that identifies which numbers are missing, however it also lists all the numbers past the latest Transaction ID. How can I limit the query to the current maximum transaction ID #?
This is what I have so far.
SELECT YearlyOfferIds.YOID
FROM
YearlyOfferIds
LEFT JOIN [Offer Check]
ON YearlyOfferIds.[YOID] = [Offer Check].[TextList]
WHERE ((([Offer Check].TextList) Is Null));
And I'm trying to add or something that does the same thing.
SELECT Max([Offer Check].NumList) AS MaxOfNumList
FROM [Offer Check];
Your second query, SELECT Max(..., can be translated into a DMax expression.
DMax("NumList", "Offer Check")
My hunch is you can use that DMax in your first query's WHERE clause to limit the rows returned from YearlyOfferIds. Unfortunately, I don't know the name of the YearlyOfferIds field which you want to compare with the maximum [Offer Check].NumList. So I'll just call that field some_field.
WHERE
YearlyOfferIds.some_field <= DMax("NumList", "Offer Check")
AND [Offer Check].TextList Is Null

Generate a query that show the times that questions get wrong

I have a table named countwronganswer with columns cwa_id, question_num. How can I generate a table with query that shows two columns, one column lists all the question_num and second column lists the number of times that cwa_id that related to the question_num.
Question Number |Total # of Mistake |
1 12
2 22
..etc
ATTENTION: This question was asked without the awareness of the existence of count or Groupby method because of the knowledge level at that state. Count() or Groupby() were the key to generate the 2nd column of total # values which I did not aware of completely, therefore, any attempt, at that point of time, to write the code for the data will be close to meaningless. Vote up if possible if you think its useful or resolved your issue.
Probably something like this
SELECT question_num, COUNT(cwa_id) total_mistakes
FROM countwronganswer
GROUP BY question_num
select question_num , count(cwa_id)
from tableName group by question_num

Storing ids in a MySQL-Link-Table

I have a table "link_tabl" in which I want to link three other tables by id. So in every row I have a triplets (id_1, id_2, id_3). I could create for every element of the triplet a column and everything would be fine.
But I want more: =)
I need to respect one more "dimension". There is an Algorthm who creates the triplets (the linkings between the tables). The algorithm sometimes outputs different linkings.
Example:
table_person represents a person.
table_task represents a task.
table_loc reüpresents a location.
So a triplet of ids (p, t, l) means: A certain person did something at some location.
The tuple (person, task) are not changed by the algorithm. They are given. The algorithm outputs for a tuple (p,t) a location l. But sometimes the algorithm determines different locations for such a tuple. I want to store in a table the last 10 triplets for every tuple (author, task).
What would be the best approach for that?
I thought of something like:
IF there is a tuple (p,t) ALREADY stored in link_table ADD the id of location into the next free slot (column) of the row.
If there are already 10 values (all columns are full) delete the first one, move every value from column i to column i-1 and store the new value in the last column.
ELSE add a new row.
But I don't know if this is a good approach and if it is, how to realise that...
Own partial solution
I figured out, that I could make two columns. Onw which stores the author id. One which stores the task id. And by
...
UNIQUE INDEX (auth_id, task_id)
...
I could index them. So now I just have to figure out how to move values from column i to i-1 elegantly. =)
Kind regards
Aufwind
I would store the output of the algorithm in rows, with a date indicator. The requirement to only consider the last 10 records sounds fairly arbitrary - and I wouldn't enshrine it in my column layout. It also makes some standard relational tools redundant - for instance, the query "how many locations exist for person x and location y" couldn't be answered by "count", but instead by looking at which column is null.
So, I'd recommend something like:
personID taskID locationID dateCreated
1 1 1 1 April 20:20:10
1 1 2 1 April 20:20:11
1 1 3 1 April 20:20:12
The "only 10" requirement could be enforced by using "top 10" in select queries; you could even embed that in a view if necessary.