The application of Python for analysis and visualization of survey form data and the use of Spearman's correlation
I conducted a data analysis of a survey form focused on educational topics. Due to professional reasons, I can’t disclose specific details of the questions. However, I will describe the methodology used to process the form data.
This methodology includes analyzing the frequency of responses and applying Spearman’s correlation to assess the relationships between variables.
import pandas as pd #1
import seaborn as sns #2
import matplotlib.pyplot as plt #3
# Leitura do arquivo
data = pd.read_csv('pesquisa.csv') #4
data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].str.title().str.strip()
correcoes = {
"Matemática": "Cadeira de Matemática",
"Cibernética": "Cadeira de Cibernética"
}
data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].replace(correcoes)
data['Qual seu Curso/Cadeira'].unique() # Mostra os valores únicos na coluna após a padroniza??o
# Removendo a coluna 'Carimbo de data/hora'
data = data.drop('Carimbo de data/hora', axis=1)
# Exibindo as primeiras linhas para verificar
data. Head()
# Substituindo 'NaN' por 'N?o Respondido' em todas as colunas
data = data.fillna('N?o Respondido')
data. Head()
data. Rename(columns={
'Qual seu Curso/Cadeira': 'P1',
'Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira.': 'P2',
}, inplace=True)
print(data. Columns)
# Calculando percentuais para a pergunta P2 - Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2
percentuais_p2 = data['P2'].value_counts(normalize=True) * 100
# Criando um gráfico de barras para percentuais
plt.figure(figsize=(10, 6))
sns.barplot(x=percentuais_p2.index, y=percentuais_p2.values)
plt.title('Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2')
plt.xlabel('Quantidade de militares')
plt.ylabel('Percentual')
plt.show()
# Verificando as alternativas de respostas para realizar o mapeamento
print(data['P2'].unique())
print(data['P3'].unique())
print(data['P4'].unique())
print(data['P15'].unique())
# Mapeamentos
mapeamento_p2 = {'1 - 2': 1, '3 - 5': 2, '6 - 10': 3, 'Mais de 10': 4}
mapeamento_p3 = {'Menos de 2 horas': 1, '2 a 3 horas': 2, 'Mais de 3 horas': 3}
mapeamento_p4 = {'Menos de 1 horas': 1, '1 a 2 horas': 2, 'Mais de 2 horas': 3}
mapeamento_p15 = {'Sim': 1, 'N?o': 0, 'N?o tenho certeza': 0.5}
# Aplicando os mapeamentos
data['P2_num'] = data['P2'].map(mapeamento_p2)
data['P3_num'] = data['P3'].map(mapeamento_p3)
data['P4_num'] = data['P4'].map(mapeamento_p4)
data['P15_num'] = data['P15'].map(mapeamento_p15)
correlacao_spearman = data[['P2_num', 'P3_num', 'P4_num', 'P15_num']].corr(method='spearman')
print(correlacao_spearman)
This is the entire code completed. I am explaining what I did.
The survey form was uploaded to Google Colab. Let’s see step by step
import pandas as pd #1
import seaborn as sns #2
import matplotlib.pyplot as plt #3
# Leitura do arquivo
data = pd.read_csv('pesquisa.csv') #4
Explanation:
#1 import pandas as pd: This line imports the Pandas library and names it as pd. Pandas is a Python library used for data manipulation and analysis. It provides data structures like DataFrames, which are essentially tables, making it easier to read, manipulate, and analyze tabular data.
#2 import seaborn as sns: Here, the Seaborn library is imported and named as sns. Seaborn is a data visualization library based on Matplotlib, providing a high-level interface for drawing more attractive and informative statistical graphics
#3 import matplotlib.pyplot as plt: This line imports the pyplot module from the Matplotlib library and names it as plt. Matplotlib is a plotting library for Python, which allows for the creation of static, animated, and interactive graphics and visualizations.
#4 data = pd.read_csv(‘pesquisa.csv’): This line reads a CSV file named ‘pesquisa.csv’ using the read_csv function of the Pandas library and stores the read data in a variable named data. The data now contains a DataFrame, which is a two-dimensional data structure with rows and columns, similar to a spreadsheet or a database table.
This code, therefore, sets up the environment for data analysis by importing the necessary libraries and reading data from a CSV file for further manipulation and analysis.
2. We begin processing the form. We need to fix this column because it has received data without standardization from the users.
data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].str.title().str.strip()
Explanation:
2.1?—?The expression data['Qual seu Curso/Cadeira'] is used to specify a particular column in the DataFrame named 'data'. Let's break down this expression to understand it better:
Therefore, data['Qual seu Curso/Cadeira'] is the way to reference and manipulate the column 'Qual seu Curso/Cadeira' in the DataFrame 'data'. By using this expression in conjunction with methods like?.str.title() and?.str.strip(), you are applying specific transformations to all the data in that column and updating the DataFrame with these new values
2.2?—?The methods?.str.title() and?.str.strip() are used in this line of code for several reasons related to data cleaning and standardization:
In summary, these methods are used to ensure the quality and uniformity of the data. Clean and standardized data are fundamental for reliable and accurate analyses, as inconsistencies in data can lead to misinterpretations or technical difficulties during the analysis.
3. In this part of the code, a series of operations are performed to standardize and correct the entries in a specific column of a DataFrame. Let’s break it down:
correcoes = {
"Matemática": "Cadeira de Matemática",
"Cibernética": "Cadeira de Cibernética"
}
data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].replace(correcoes)
data['Qual seu Curso/Cadeira'].unique() # Mostra os valores únicos na coluna após a padroniza??o
3.1?—?Creation of the correcoes Dictionary:
3.2?—?Replacing Values in the DataFrame:
3.3?—?Displaying Unique Values:
In summary, this part of the code is focused on cleaning and standardizing the data in the ‘Qual seu Curso/Cadeira’ column by replacing certain values with standardized ones and then verifying the result by listing all the unique values present in the column after these operations.
4. In this part of the code, two main operations are performed on the DataFrame ‘data’:
# Removendo a coluna 'Carimbo de data/hora'
data = data.drop('Carimbo de data/hora', axis=1)
# Exibindo as primeiras linhas para verificar
data. Head()
4.1?—?Removal of the ‘Carimbo de data/hora’ Column:
4.2?—?Displaying the First Rows of the DataFrame:
In summary, this part of the code focuses on cleaning the DataFrame ‘data’ by removing a specific column that may not be necessary for subsequent analysis and then visually verifying the result of this operation by displaying the first rows of the modified DataFrame
5. In this part of the code, two main operations are performed on the DataFrame data?, replacing ‘NaN’ and displaying the First Rows of the DataFrame:
# Substituindo 'NaN' por 'N?o Respondido' em todas as colunas
data = data.fillna('N?o Respondido')
data. Head()
5.1?—?Replacing ‘NaN’ with ‘N?o Respondido’ in All Columns:
领英推荐
5.2?—?Displaying the First Rows of the DataFrame:
In summary, this part of the code focuses on addressing missing data in the DataFrame data, replacing 'NaN' values with a more descriptive string ('N?o Respondido'), and then visually verifying the result of this operation by displaying the first rows of the modified DataFrame.
6. In this part of the code, we made a simplification: Long question titles can make the code confusing and hard to read. Using short labels like P1, P2, P3..., simplifies referencing the variables in the code, thus we transformed the questions from the form used in this work
data. Rename(columns={
'Qual seu Curso/Cadeira': 'P1',
'Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira.': 'P2',
}, inplace=True)
print(data. Columns)
In this part of the code, the DataFrame data is modified to rename some of its columns. The?.rename() method is used with the columns parameter to specify the desired changes in the column names. Here's what each line does:
6.1?—?Column Renaming:
6.2?—?Displaying Updated Column Names:
In summary, this part of the code is used to simplify the names of the columns, making the DataFrame easier to manipulate and understand, especially when dealing with long or complex column names.
7. This code snippet analyzes and visualizes data related to question P2 of your survey, which is about the “Average number of military personnel involved in setting up a Formal Test Process for your Course/department”. Let’s detail each part of the code:
# Calculando percentuais para a pergunta P2 - Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2
percentuais_p2 = data['P2'].value_counts(normalize=True) * 100
# Criando um gráfico de barras para percentuais
plt.figure(figsize=(10, 6))
sns.barplot(x=percentuais_p2.index, y=percentuais_p2.values)
plt.title('Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2')
plt.xlabel('Quantidade de militares')
plt.ylabel('Percentual')
plt.show()
7.1.?—?Calculating Percentages:
7.2?—?Setting Up the Bar Chart:
In summary, this code calculates the percentage of responses for each category in question P2 and then creates a bar chart to visualize these percentages, making it easier to interpret the data. The chart helps to understand how responses to question P2 are distributed among different categories of the number of military personnel involved.
8. In this part of the code, we process the responses to apply Spearman’s correlation.
# Verificando as alternativas de respostas para realizar o mapeamento
print(data['P2'].unique())
print(data['P3'].unique())
print(data['P4'].unique())
print(data['P15'].unique())
# Mapeamentos
mapeamento_p2 = {'1 - 2': 1, '3 - 5': 2, '6 - 10': 3, 'Mais de 10': 4}
mapeamento_p3 = {'Menos de 2 horas': 1, '2 a 3 horas': 2, 'Mais de 3 horas': 3}
mapeamento_p4 = {'Menos de 1 horas': 1, '1 a 2 horas': 2, 'Mais de 2 horas': 3}
mapeamento_p15 = {'Sim': 1, 'N?o': 0, 'N?o tenho certeza': 0.5}
correlacao_spearman = data[['P2_num', 'P3_num', 'P4_num', 'P15_num']].corr(method='spearman')
print(correlacao_spearman)
8.1?—?This part of the code focuses on identifying all the different response options provided for questions P2, P3, P4, and P15 in the survey.
8.2?—?After identifying the response options for each question, the code defines a mapping for each. This mapping is used to convert textual responses into numerical values, which facilitates subsequent statistical analysis.
Each mapping is then applied to their respective columns in the DataFrame to transform the textual responses into numerical values, facilitating analyses such as correlations, frequency analyses, and others.
In summary, this code is an essential preparation for data analysis, where the first part identifies the responses to be mapped, and the second part effectively performs the mapping, converting categorical data into numerical for subsequent analyses.
9. Application of Mappings to the Responses of Questions
The first main action in the code is the application of specific mappings to the responses of questions P2, P3, P4, and P15. This process converts the textual (categorical) responses of these questions into numerical values.
9.1?—?How Does the Mapping Application Work?
9.2?—?Calculation of the Spearman Correlation
The second main action is the calculation of the Spearman correlation between the resulting numerical variables from questions P2, P3, P4, and P15.
9.2.1?—?What is the Spearman Correlation?
9.2.2?—?How is the Correlation Calculated?
In summary, this code snippet first converts the textual responses of selected questions into numerical values (facilitating statistical analysis) and then uses these numerical values to calculate and explore relationships between the variables through the Spearman correlation.
The result for the Spearman correlation between questions P2, P3, P4, and P15 was the following: