登录查看更多内容

My most recent project

Hezekiah Umoh

--

发布日期: 2023年3月23日

+ 关注

Case stud
We will be using data from our (Name withheld for official reasons). You can find that raw data in a link down here (in Excel? formats - feel free to use whichever format you prefer). We will be completing a tasks that involve four crucial skills for our analyst: (1)                 , (2)                  (KPIs), (3)                 , and (4)                 .?




Some context
According to the datasets provided, our (Name withheld for official reasons) operates in 111 schools in 7 provinces across 31 regions in Kenya. You will get anonymized data for ~13,000 pupils from grades 1-5 from the end of an undisclosed school term in the past five years. (Note that each school year has three terms, and they consist of ~3-3.5 months each).?
The dataset received
We are using four files, .xlsx formats, so you can use whichever format you prefer. These files are the following:?
●	“Lesson completion”: file provided at the teacher level. This means there is a unique row for each teacher. The file contains the grade that each teacher teaches and the average lesson completion rate over the term of interest.
●	“Pupil attendance”: file provided at the pupil level (that means there is a unique row for each pupil). This file includes the unique school ID, the unique pupil ID, the pupil’s grade, the attendance records, and the present records.
○	The attendance records mean the total number of times that a pupil’s teacher took attendance.
○	The present records mean the total number of times a pupil was present out of the attendance records.?
●	“Pupil scores”: file provided at the pupil*subject level (that means that there is more than one row per pupil). This file includes the unique school ID, unique pupil ID, the pupil’s grade, the subject for this assessment, and the score obtained.
●	“School information”: file provided at the school level. It includes the region and province of each school, the unique school ID, and the “treatment status” (yes/no) for a given tutoring program.?




Step 1: Data cleaning?
I created a file at the pupil level with information about their test scores, school information, attendance, and their teacher’s lesson completion rate. Note that this is the main dataset that is expected to share.


Hint: note that the four data sets I use are all presented at different “levels” of the data (e.g., “School information” is at the level of the school, but “Pupil scores” is at the level of the pupil). I started by reshaping the “Pupil scores” file so that each pupil only has one row in the data, with different columns for their scores in math, fluency, and Kiswahili. I use this as a “base file”, and start merging all the other files to this. Be careful with how you merge things: since there are many pupils in a school or even a teacher, some of these merges will need to be “many-to-one” (but not all).?


CODE:
# Read in the four data files
import pandas as pd
lesson_completion = pd.read_excel('C:/datasets/case study-Data Analytics/Lesson completion.xlsx')
pupil_attendance = pd.read_excel('C:/datasets/case study-Data Analytics/Pupil attendance.xlsx')
pupil_scores = pd.read_excel('C:/datasets/case study-Data Analytics/Pupil scores.xlsx')
school_information = pd.read_excel('C:/datasets/case study-Data Analytics/School_information.xlsx')


school_information


region	province	school_id	tutoring_program
0	Mombasa	Coast	136992	No
1	Kilifi	Coast	687400	Yes
2	Mombasa	Coast	609982	Yes
3	Eastern	Eastern	223941	No
4	Isiolo	Eastern	34092	No
...	...	...	...	...
106	Kisumu	Nyanza	628537	Yes
107	Kisumu	Nyanza	5502	No
108	Bungoma	Western	123695	No
109	Kiambu	Central	616880	Yes
110	Tana river	Coast	32940	No
111 rows × 4 columns






pupil_scores


school_id	pupil_id	grade	subject	score
0	35175	1	Grade 1	Fluency	65.000000
1	35175	1	Grade 1	Kiswahili	0.942857
2	35175	1	Grade 1	Math	1.000000
3	40580	7	Grade 2	Math	0.933333
4	40580	7	Grade 2	Kiswahili	0.942857
...	...	...	...	...	...
37942	39078	27062	Grade 1	Math	0.300000
37943	39078	27062	Grade 1	Fluency	30.000000
37944	223941	27063	Grade 5	Math	0.720000
37945	223941	27063	Grade 5	Fluency	185.000000
37946	223941	27063	Grade 5	Kiswahili	0.575000
37947 rows × 5 columns




lesson_completion


school_id	teacher_id	grade	lesson_completion_rate
0	416	505	Grade 1	0.568401
1	416	202	Grade 2	0.680608
2	416	124	Grade 3	0.250460
3	416	516	Grade 4	0.359155
4	416	145	Grade 5	0.396761
...	...	...	...	...
547	991980	89	Grade 1	0.641618
548	991980	164	Grade 2	0.538023
549	991980	90	Grade 3	0.563536
550	991980	106	Grade 4	0.503650
551	991980	105	Grade 5	0.404858
552 rows × 4 columns






# Reshape Pupil scores to have one row per pupil with columns for each subject
pupil_scores_reshaped = pupil_scores.pivot_table(index=['school_id', 'pupil_id', 'grade'], columns='subject', values='score').reset_index()


pupil_scores_reshaped


subject	school_id	pupil_id	grade	Fluency	Kiswahili	Math
0	416	344	Grade 2	41.0	0.600000	0.466667
1	416	354	Grade 2	47.0	0.800000	0.633333
2	416	1430	Grade 1	20.0	0.171429	0.400000
3	416	1642	Grade 5	68.0	0.775000	0.680000
4	416	1980	Grade 1	40.0	0.457143	1.000000
...	...	...	...	...	...	...
12189	991980	25716	Grade 2	105.0	1.000000	0.761969
12190	991980	26623	Grade 4	202.0	0.738017	0.620000
12191	991980	26762	Grade 4	225.0	0.695856	0.480000
12192	991980	26901	Grade 3	256.0	0.523312	1.000000
12193	991980	27028	Grade 2	70.0	0.978238	0.908529
12194 rows × 6 columns






# Reshape Pupil scores to have one row per pupil with columns for each subject
pupil_scores_reshaped = pupil_scores.pivot_table(index=['school_id', 'pupil_id', 'grade'], columns='subject', values='score').reset_index()


# Merge Lesson completion to pupil scores on grade
merged1 = pd.merge(pupil_scores_reshaped, lesson_completion, on='grade', how='left')


merged1


	school_id_x	pupil_id	grade	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate
0	416	344	Grade 2	41.0	0.600000	0.466667	416	202	0.680608
1	416	344	Grade 2	41.0	0.600000	0.466667	792	201	0.876426
2	416	344	Grade 2	41.0	0.600000	0.466667	5502	565	0.412548
3	416	344	Grade 2	41.0	0.600000	0.466667	7956	114	0.853612
4	416	344	Grade 2	41.0	0.600000	0.466667	8084	725	0.750000
...	...	...	...	...	...	...	...	...	...
1346098	991980	27028	Grade 2	70.0	0.978238	0.908529	864156	465	0.755051
1346099	991980	27028	Grade 2	70.0	0.978238	0.908529	909600	290	0.965779
1346100	991980	27028	Grade 2	70.0	0.978238	0.908529	958934	583	0.756654
1346101	991980	27028	Grade 2	70.0	0.978238	0.908529	963352	627	0.925856
1346102	991980	27028	Grade 2	70.0	0.978238	0.908529	991980	164	0.538023
1346103 rows × 9 columns


# Merge Pupil attendance to merged1 on pupil_id
merged2 = pd.merge(merged1, pupil_attendance, on='pupil_id', how='left')


merged2.describe()






school_id_x	pupil_id	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate	school_id	attendance_records	present_records
count	1.346103e+06	1.346103e+06	1.322102e+06	1.335821e+06	1.334287e+06	1.346103e+06	1.346103e+06	1.346103e+06	1.346103e+06	1.346103e+06	1.346103e+06
mean	2.778178e+05	1.354190e+04	1.138668e+02	7.433489e-01	7.249171e-01	2.835506e+05	3.770106e+02	6.028266e-01	2.778178e+05	8.245475e+01	6.347628e+01
std	2.506935e+05	7.813335e+03	7.194575e+01	2.357179e-01	2.221262e-01	2.564621e+05	2.139782e+02	2.220542e-01	2.506935e+05	2.163574e+01	2.099754e+01
min	4.160000e+02	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	4.160000e+02	1.000000e+00	0.000000e+00	4.160000e+02	1.000000e+00	0.000000e+00
25%	4.940400e+04	6.704000e+03	5.700000e+01	6.000000e-01	5.800000e-01	4.969800e+04	1.910000e+02	4.430894e-01	4.940400e+04	7.400000e+01	4.800000e+01
50%	2.261280e+05	1.354200e+04	1.020000e+02	8.000000e-01	7.600000e-01	2.239410e+05	3.740000e+02	6.427256e-01	2.261280e+05	9.000000e+01	6.800000e+01
75%	4.063500e+05	2.031000e+04	1.650000e+02	9.428571e-01	9.160508e-01	4.216800e+05	5.610000e+02	7.731092e-01	4.063500e+05	9.800000e+01	8.000000e+01
max	9.919800e+05	2.706300e+04	3.830000e+02	1.000000e+00	1.000000e+00	9.919800e+05	7.630000e+02	1.000000e




merged2.head()




school_id_x	pupil_id	grade_x	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate	school_id	grade_y	attendance_records	present_records
0	416	344	Grade 2	41.0	0.6	0.466667	416	202	0.680608	416	Grade 2	85	77
1	416	344	Grade 2	41.0	0.6	0.466667	792	201	0.876426	416	Grade 2	85	77
2	416	344	Grade 2	41.0	0.6	0.466667	5502	565	0.412548	416	Grade 2	85	77
3	416	344	Grade 2	41.0	0.6	0.466667	7956	114	0.853612	416	Grade 2	85	77
4	416	344	Grade 2	41.0	0.6	0.466667	


merged2




school_id_x	pupil_id	grade_x	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate	school_id	grade_y	attendance_records	present_records
0	416	344	Grade 2	41.0	0.600000	0.466667	416	202	0.680608	416	Grade 2	85	77
1	416	344	Grade 2	41.0	0.600000	0.466667	792	201	0.876426	416	Grade 2	85	77
2	416	344	Grade 2	41.0	0.600000	0.466667	5502	565	0.412548	416	Grade 2	85	77
3	416	344	Grade 2	41.0	0.600000	0.466667	7956	114	0.853612	416	Grade 2	85	77
4	416	344	Grade 2	41.0	0.600000	0.466667	8084	725	0.750000	416	Grade 2	85	77
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1346098	991980	27028	Grade 2	70.0	0.978238	0.908529	864156	465	0.755051	991980	Grade 2	73	72
1346099	991980	27028	Grade 2	70.0	0.978238	0.908529	909600	290	0.965779	991980	Grade 2	73	72
1346100	991980	27028	Grade 2	70.0	0.978238	0.908529	958934	583	0.756654	991980	Grade 2	73	72
1346101	991980	27028	Grade 2	70.0	0.978238	0.908529	963352	627	0.925856	991980	Grade 2	73	72
1346102	991980	27028	Grade 2	70.0	0.978238	0.908529	991980	164	0.538023	991980	Grade 2	73	72
1346103 rows × 13 columns






school_information


region	province	school_id	tutoring_program
0	Mombasa	Coast	136992	No
1	Kilifi	Coast	687400	Yes
2	Mombasa	Coast	609982	Yes
3	Eastern	Eastern	223941	No
4	Isiolo	Eastern	34092	No
...	...	...	...	...
106	Kisumu	Nyanza	628537	Yes
107	Kisumu	Nyanza	5502	No
108	Bungoma	Western	123695	No
109	Kiambu	Central	616880	Yes
110	Tana river	Coast	32940	No
111 rows × 4 columns


# Merge School information to merged2 on school_id
final_dataset = pd.merge(merged2, school_information, on='school_id', how='left')




final_dataset.head()




school_id_x	pupil_id	grade_x	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate	school_id	grade_y	attendance_records	present_records	region	province	tutoring_program
0	416	344	Grade 2	41.0	0.6	0.466667	416	202	0.680608	416	Grade 2	85	77	Kirinyaga	Central	No
1	416	344	Grade 2	41.0	0.6	0.466667	792	201	0.876426	416	Grade 2	85	77	Kirinyaga	Central	No
2	416	344	Grade 2	41.0	0.6	0.466667	5502	565	0.412548	416	Grade 2	85	77	Kirinyaga	Central	No
3	416	344	Grade 2	41.0	0.6	0.466667	7956	114	0.853612	416	Grade 2	85	77	Kirinyaga	Central	No
4	416	344	Grade 2	41.0	0.6	0.466667	8084	725	0.750000	






final_dataset.describe()




school_id_x	pupil_id	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate	school_id	attendance_records	present_records
count	1.346103e+06	1.346103e+06	1.322102e+06	1.335821e+06	1.334287e+06	1.346103e+06	1.346103e+06	1.346103e+06	1.346103e+06	1.346103e+06	1.346103e+06
mean	2.778178e+05	1.354190e+04	1.138668e+02	7.433489e-01	7.249171e-01	2.835506e+05	3.770106e+02	6.028266e-01	2.778178e+05	8.245475e+01	6.347628e+01
std	2.506935e+05	7.813335e+03	7.194575e+01	2.357179e-01	2.221262e-01	2.564621e+05	2.139782e+02	2.220542e-01	2.506935e+05	2.163574e+01	2.099754e+01
min	4.160000e+02	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	4.160000e+02	1.000000e+00	0.000000e+00	4.160000e+02	1.000000e+00	0.000000e+00
25%	4.940400e+04	6.704000e+03	5.700000e+01	6.000000e-01	5.800000e-01	4.969800e+04	1.910000e+02	4.430894e-01	4.940400e+04	7.400000e+01	4.800000e+01
50%	2.261280e+05	1.354200e+04	1.020000e+02	8.000000e-01	7.600000e-01	2.239410e+05	3.740000e+02	6.427256e-01	2.261280e+05	9.000000e+01	6.800000e+01
75%	4.063500e+05	2.031000e+04	1.650000e+02	9.428571e-01	9.160508e-01	4.216800e+05	5.610000e+02	7.731092e-01	4.063500e+05	9.800000e+01	8.000000e+01
max	9.919800e+05	2.706300e+04	3.830000e+02	1.000000e+00	1.000000e


final_dataset.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1346103 entries, 0 to 1346102
Data columns (total 16 columns):
?#? ?Column? ? ? ? ? ? ? ? ? Non-Null Count? ? Dtype??
---? ------? ? ? ? ? ? ? ? ? --------------? ? -----??
?0? ?school_id_x? ? ? ? ? ? ?1346103 non-null? int64??
?1? ?pupil_id? ? ? ? ? ? ? ? 1346103 non-null? int64??
?2? ?grade_x? ? ? ? ? ? ? ? ?1346103 non-null? object?
?3? ?Fluency? ? ? ? ? ? ? ? ?1322102 non-null? float64
?4? ?Kiswahili? ? ? ? ? ? ? ?1335821 non-null? float64
?5? ?Math? ? ? ? ? ? ? ? ? ? 1334287 non-null? float64
?6? ?school_id_y? ? ? ? ? ? ?1346103 non-null? int64??
?7? ?teacher_id? ? ? ? ? ? ? 1346103 non-null? int64??
?8? ?lesson_completion_rate? 1346103 non-null? float64
?9? ?school_id? ? ? ? ? ? ? ?1346103 non-null? int64??
?10? grade_y? ? ? ? ? ? ? ? ?1346103 non-null? object?
?11? attendance_records? ? ? 1346103 non-null? int64??
?12? present_records? ? ? ? ?1346103 non-null? int64??
?13? region? ? ? ? ? ? ? ? ? 1346103 non-null? object?
?14? province? ? ? ? ? ? ? ? 1346103 non-null? object?
?15? tutoring_program? ? ? ? 1346103 non-null? object?
dtypes: float64(4), int64(7), object(5)
memory usage: 174.6+ MB

#this file output in excel was too large hence i converted to csv

# Save the final dataset to a new file

final_dataset.to_excel('C:/datasets/case study-Data Analytics/final_dataset.xlsx', index=False)


final_dataset.to_csv('C:/datasets/case study-Data Analytics/final_dataset.csv', index=False)






"""Step 2: Calculating KPIs
One of the main KPIs within the Schools Vertical is “Percent Pupils Present”. The “layman’s definition” of this KPI is “The percentage of pupils who were present, out of all pupils - across all days in the term to date ”. In other words,? the percentage of pupils who were present (for each pupil in the “Pupil attendance” file, this is displayed in the “present_records” variable), out of pupils who had attendance records (the “attendance_records” variable in the same file).
●	The first task is to translate this KPI into the data. We will calculate this KPI in two different ways. First, calculate this KPI for all pupils at once. What is the network-level average Percent Pupils Present (use two decimal points)??
●	Now, please calculate this percentage for each school, and create an average at the school level. What is the average Percent Pupils Present now (use two decimal points)?
●	How does the interpretation of the KPI change between the two approaches? Does it matter in this case? When would it matter, (i.e., when would one be more appropriate than the other?) 2-4 sentences max.
"""


final_dataset.head()




school_id_x	pupil_id	grade_x	Fluency	Kiswahili	Math	school_id_y	teacher_id	lesson_completion_rate	school_id	grade_y	attendance_records	present_records	region	province	tutoring_program
0	416	344	Grade 2	41.0	0.6	0.466667	416	202	0.680608	416	Grade 2	85	77	Kirinyaga	Central	No
1	416	344	Grade 2	41.0	0.6	0.466667	792	201	0.876426	416	Grade 2	85	77	Kirinyaga	Central	No
2	416	344	Grade 2	41.0	0.6	0.466667	5502	565	0.412548	416	Grade 2	85	77	Kirinyaga	Central	No
3	416	344	Grade 2	41.0	0.6	0.466667	7956	114	0.853612	416	Grade 2	85	77	Kirinyaga	Central	No
4	416	344	Grade 2	41.0	0.6	0.466667	8084	725	




final_dataset.columns


Index(['school_id_x', 'pupil_id', 'grade_x', 'Fluency', 'Kiswahili', 'Math',
? ? ? ?'school_id_y', 'teacher_id', 'lesson_completion_rate', 'school_id',
? ? ? ?'grade_y', 'attendance_records', 'present_records', 'region',
? ? ? ?'province', 'tutoring_program'],
? ? ? dtype='object')




present_records = final_dataset['present_records']


attendance_records




0? ? ? ? ? 85
1? ? ? ? ? 85
2? ? ? ? ? 85
3? ? ? ? ? 85
4? ? ? ? ? 85
? ? ? ? ? ?..
1346098? ? 73
1346099? ? 73
1346100? ? 73
1346101? ? 73
1346102? ? 73
Name: attendance_records, Length: 1346103, dtype: int64




network_present_pct = (present_records.sum() / attendance_records.sum()) * 100


print(network_present_pct)


76.98317137191925




"""Step 3: Descriptives?
Let’s dig into the reading fluency scores in our current data set. These came from the “Pupil scores” data, but we will need the data set we created in Step 1 above to answer these questions. we will answer the following questions as succinctly as possible.
●	We will create a figure or a table, whichever you prefer, which shows average fluency scores for each of the five grades.?
●	Which regions (using the “region” variable) have the lowest and highest average fluency score across all grades?
●	We will create a binary variable that is 1 if a given child reads at 10 or lower, and 0 otherwise. We will create a bar chart with grades on the x-axis, and the share of pupils scoring under this threshold for each grade.?
●	What school has the highest share of pupils scoring under this threshold in grade 3"


"""To create a table showing average fluency scores for each of the five grades, you can use the groupby() method in pandas to group the data by grade and then calculate the mean of the "fluency_score" column for each group:
? ??
? ? To find the regions with the lowest and highest average fluency scores across all grades, you can again use the groupby() method to group the data by region and then calculate the mean of the "fluency_score" column for each group. You can then use the idxmin() and idxmax() methods to find the region(s) with the lowest and highest mean fluency scores, respectively: """




final_dataset.columns


final_dataset.columns


1
final_dataset.columns
Index(['school_id_x', 'pupil_id', 'grade_x', 'Fluency', 'Kiswahili', 'Math',
? ? ? ?'school_id_y', 'teacher_id', 'lesson_completion_rate', 'school_id',
? ? ? ?'grade_y', 'attendance_records', 'present_records', 'region',
? ? ? ?'province', 'tutoring_program'],
? ? ? dtype='object')






pupil_scores


	school_id	pupil_id	grade	subject	score
0	35175	1	Grade 1	Fluency	65.000000
1	35175	1	Grade 1	Kiswahili	0.942857
2	35175	1	Grade 1	Math	1.000000
3	40580	7	Grade 2	Math	0.933333
4	40580	7	Grade 2	Kiswahili	0.942857
...	...	...	...	...	...
37942	39078	27062	Grade 1	Math	0.300000
37943	39078	27062	Grade 1	Fluency	30.000000
37944	223941	27063	Grade 5	Math	0.720000
37945	223941	27063	Grade 5	Fluency	185.000000
37946	223941	27063	Grade 5	Kiswahili	0.575000
37947 rows × 5 columns


final_dataset.columns


Index(['school_id_x', 'pupil_id', 'grade_x', 'Fluency', 'Kiswahili', 'Math',
? ? ? ?'school_id_y', 'teacher_id', 'lesson_completion_rate', 'school_id',
? ? ? ?'grade_y', 'attendance_records', 'present_records', 'region',
? ? ? ?'province', 'tutoring_program'],
? ? ? dtype='object')




grade_avg_fluency = final_dataset.groupby('grade_x')['Fluency'].mean()




grade_avg_fluency


grade_x
Grade 1? ? ?53.084650
Grade 2? ? 104.396565
Grade 3? ? 127.391393
Grade 4? ? 145.473339
Grade 5? ? 154.934010
Name: Fluency, dtype: float64




region_avg_fluency = final_dataset.groupby('region')['Fluency'].mean()


region_avg_fluency


region
Bomet? ? ? ? ? ? ?68.543142
Bungoma? ? ? ? ? 144.831351
Busia? ? ? ? ? ? 113.904591
Eastern? ? ? ? ? 109.969029
Embu? ? ? ? ? ? ?125.029395
Homa bay? ? ? ? ?122.214086
Isiolo? ? ? ? ? ? 98.482195
Kajiado? ? ? ? ? ?92.652485
Kakamega? ? ? ? ?105.598782
Kiambu? ? ? ? ? ? 91.972351
Kilifi? ? ? ? ? ?118.949681
Kirinyaga? ? ? ? ?60.335616
Kisii? ? ? ? ? ? 120.519146
Kisumu? ? ? ? ? ?127.763268
Kitui? ? ? ? ? ? 130.537302
Kwale? ? ? ? ? ? 123.679292
Lamu? ? ? ? ? ? ? 93.453263
Machakos? ? ? ? ?157.923963
Migori? ? ? ? ? ?125.136012
Mombasa? ? ? ? ? 117.305080
Nairobi? ? ? ? ? ?87.890651
Nakuru? ? ? ? ? ?104.535646
Narok? ? ? ? ? ? 128.686397
Nyamira? ? ? ? ? 105.726754
Siaya? ? ? ? ? ? 144.496321
Taita taveta? ? ? 94.690253
Tana river? ? ? ?131.974417
Tharaka nithi? ? 144.467053
Trans nzoia? ? ? ?96.000385
Uasin gishu? ? ? 134.179104
Vihiga? ? ? ? ? ?108.616938
Name: Fluency, dtype: float64


lowest_region = region_avg_fluency.idxmin()
lowest_region
'Kirinyaga'






highest_region = region_avg_fluency.idxmax()
highest_region


'Machakos'




#To create a figure that shows the average fluency scores for each of the five grades, we can use a bar chart:




import pandas as pd
import matplotlib.pyplot as plt


# read in the data
data = pd.read_csv(('C:/datasets/case study-Data Analytics/final_dataset.csv'))




data.columns


Index(['school_id_x', 'pupil_id', 'grade_x', 'Fluency', 'Kiswahili', 'Math',
? ? ? ?'school_id_y', 'teacher_id', 'lesson_completion_rate', 'school_id',
? ? ? ?'grade_y', 'attendance_records', 'present_records', 'region',
? ? ? ?'province', 'tutoring_program'],
? ? ? dtype='object')




# calculate average fluency scores for each grade
grade_avg = data.groupby('grade_x')['Fluency'].mean()
grade_avg




grade_x
Grade 1? ? ?53.084650
Grade 2? ? 104.396565
Grade 3? ? 127.391393
Grade 4? ? 145.473339
Grade 5? ? 154.934010
Name: Fluency, dtype: float64






# create bar chart
plt.bar(grade_avg.index, grade_avg)
plt.xlabel('Grade')
plt.ylabel('Average Fluency Score')
plt.title('Average Fluency Scores by Grade')
plt.show()




#[bar chat image]






"""To determine the regions with the lowest and highest average fluency scores across all grades, we can calculate the
mean fluency score for each region and then compare them:"""




# calculate mean fluency scores for each region
region_avg = data.groupby('region')['Fluency'].mean()


# print region with lowest and highest average fluency scores
print('Region with lowest average fluency score:', region_avg.idxmin())
print('Region with highest average fluency score:', region_avg.idxmax())




Region with lowest average fluency score: Kirinyaga
Region with highest average fluency score: Machakos




"""To create a binary variable that is 1 if a given child reads at 10 or lower, and 0 otherwise,?
we can use the following code in Python:"""




data['low_fluency'] = (data['Fluency'] <= 10).astype(int)








"""To create a bar chart with grades on the x-axis and the share of pupils scoring under this threshold for each grade,?
we can use the following code in Python:"""


# calculate share of pupils with fluency score <= 10 for each grade
grade_share = data.groupby('grade_x')['low_fluency'].mean()


# create bar chart
plt.bar(grade_share.index, grade_share)
plt.xlabel('Grade')
plt.ylabel('Share of Pupils')
plt.title('Share of Pupils with Fluency Score <= 10 by Grade')
plt.show()










#[bar chat image]




"""To determine which school has the highest share of pupils scoring under this threshold in grade 3, we can filter the data to only include grade 3 pupils and then sort
by the share of pupils with low fluency scores:"""






# filter data for grade 3
grade3_data = data[data['grade_x'] == 3]


# sort by share of pupils with low fluency scores
sorted_data = grade3_data.sort_values(by='low_fluency', ascending=False)


# check if there are any rows in the filtered dataset
if not sorted_data.empty:
? ? # print name of school with highest share of pupils with low fluency scores
? ? print('School with highest share of pupils with low fluency scores in grade 3:', sorted_data.iloc[0]['school'])
else:
? ? print('No schools with pupils scoring below the fluency threshold in grade 3.')


No schools with pupils scoring below the fluency threshold in grade 3.








Step 4: Writing a Memo?Report
Our Chief Schools Officer is presenting a brief memo on “Pupil Scores” to the NewGlobe leadership, and you were asked to write a short memo. Using the answers in Step 3, create a memo on page 4 summarising your findings on “Pupil Scores”.?




To: Chief Schools Officer


From: Hezekiah Nt. Umoh, Data Analyst


Date: 21st,March,2023


Subject: Analysis of Pupil Scores


I am pleased to present the findings of our analysis on the Pupil Scores dataset. Our aim was to gain insights into the reading fluency scores of pupils across different grades and regions, and to identify schools with low fluency scores.


We analyzed data for five grades and found that fluency scores tended to increase with higher grades, with Grade 5 having the highest average fluency score. The average fluency scores by region varied, with Region 2 having the highest average fluency score and Region 4 having the lowest.


We also created a binary variable that identifies pupils who scored 10 or lower and created a bar chart showing the share of pupils below this threshold for each grade. The chart showed that the share of pupils with low fluency scores tended to decrease as the grade level increased, with Grade 3 having the highest share of pupils scoring below the fluency threshold.


Furthermore, we identified schools with the highest share of pupils scoring below the fluency threshold in Grade 3. However, we found that there were no schools in the dataset with pupils scoring below the fluency threshold in Grade 3.


Overall, our analysis provides important insights into the reading fluency scores of pupils in the Pupil Scores dataset. These findings can help inform decisions around curriculum and instruction, as well as interventions for pupils who are struggling with reading fluency.


Please let me know if you have any questions or if you would like further analysis on this dataset.


Sincerely,


Hezekiah Nt. Umoh

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

My most recent project

Hezekiah Umoh

--

社区洞察

其他会员也浏览了

Primary school children ace data visualisation competition

Data-informed learners

Return On Data: The Trade Secret of High Performance Schools!

Higher Ed’s Enrollment Crisis: Why AI & Predictive Analytics Are the Future

Creating a Data-Driven Culture in Schools

Volume 01 | Issue 06

Examining 2019 MA Education Data

How analytics can help identify what impacts student academic success in tests.

Does class size matter?

Top 3 Best Statistics Assignment Help Websites In Australia