The application of Python for analysis and visualization of survey form data and the use of Spearman's correlation

The application of Python for analysis and visualization of survey form data and the use of Spearman's correlation


I conducted a data analysis of a survey form focused on educational topics. Due to professional reasons, I can’t disclose specific details of the questions. However, I will describe the methodology used to process the form data.

This methodology includes analyzing the frequency of responses and applying Spearman’s correlation to assess the relationships between variables.

import pandas as pd #1
import seaborn as sns #2
import matplotlib.pyplot as plt #3

# Leitura do arquivo
data = pd.read_csv('pesquisa.csv') #4

data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].str.title().str.strip()
correcoes = {
    
    "Matemática": "Cadeira de Matemática",
    "Cibernética": "Cadeira de Cibernética"
    
}
data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].replace(correcoes)
data['Qual seu Curso/Cadeira'].unique()  # Mostra os valores únicos na coluna após a padroniza??o

# Removendo a coluna 'Carimbo de data/hora'
data = data.drop('Carimbo de data/hora', axis=1)
# Exibindo as primeiras linhas para verificar
data. Head()

# Substituindo 'NaN' por 'N?o Respondido' em todas as colunas
data = data.fillna('N?o Respondido')
data. Head()

data. Rename(columns={
    'Qual seu Curso/Cadeira': 'P1',
    'Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira.': 'P2',
   
}, inplace=True)
print(data. Columns)

# Calculando percentuais para a pergunta P2 - Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2
percentuais_p2 = data['P2'].value_counts(normalize=True) * 100

# Criando um gráfico de barras para percentuais
plt.figure(figsize=(10, 6))
sns.barplot(x=percentuais_p2.index, y=percentuais_p2.values)
plt.title('Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2')
plt.xlabel('Quantidade de militares')
plt.ylabel('Percentual')
plt.show()

# Verificando as alternativas de respostas para realizar o mapeamento
print(data['P2'].unique())
print(data['P3'].unique())
print(data['P4'].unique())
print(data['P15'].unique())

# Mapeamentos
mapeamento_p2 = {'1 - 2': 1, '3 - 5': 2, '6 - 10': 3, 'Mais de 10': 4}
mapeamento_p3 = {'Menos de 2 horas': 1, '2 a 3 horas': 2, 'Mais de 3 horas': 3}
mapeamento_p4 = {'Menos de 1 horas': 1, '1 a 2 horas': 2, 'Mais de 2 horas': 3}
mapeamento_p15 = {'Sim': 1, 'N?o': 0, 'N?o tenho certeza': 0.5}

# Aplicando os mapeamentos
data['P2_num'] = data['P2'].map(mapeamento_p2)
data['P3_num'] = data['P3'].map(mapeamento_p3)
data['P4_num'] = data['P4'].map(mapeamento_p4)
data['P15_num'] = data['P15'].map(mapeamento_p15)

correlacao_spearman = data[['P2_num', 'P3_num', 'P4_num', 'P15_num']].corr(method='spearman')
print(correlacao_spearman)        

This is the entire code completed. I am explaining what I did.


The survey form was uploaded to Google Colab. Let’s see step by step

  1. First, we need to import the libraries we will need and then read the form which is in a?.csv format

import pandas as pd #1
import seaborn as sns #2
import matplotlib.pyplot as plt #3
# Leitura do arquivo
data = pd.read_csv('pesquisa.csv') #4        

Explanation:

#1 import pandas as pd: This line imports the Pandas library and names it as pd. Pandas is a Python library used for data manipulation and analysis. It provides data structures like DataFrames, which are essentially tables, making it easier to read, manipulate, and analyze tabular data.

#2 import seaborn as sns: Here, the Seaborn library is imported and named as sns. Seaborn is a data visualization library based on Matplotlib, providing a high-level interface for drawing more attractive and informative statistical graphics

#3 import matplotlib.pyplot as plt: This line imports the pyplot module from the Matplotlib library and names it as plt. Matplotlib is a plotting library for Python, which allows for the creation of static, animated, and interactive graphics and visualizations.

#4 data = pd.read_csv(‘pesquisa.csv’): This line reads a CSV file named ‘pesquisa.csv’ using the read_csv function of the Pandas library and stores the read data in a variable named data. The data now contains a DataFrame, which is a two-dimensional data structure with rows and columns, similar to a spreadsheet or a database table.

This code, therefore, sets up the environment for data analysis by importing the necessary libraries and reading data from a CSV file for further manipulation and analysis.

2. We begin processing the form. We need to fix this column because it has received data without standardization from the users.

data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].str.title().str.strip()        

Explanation:

2.1?—?The expression data['Qual seu Curso/Cadeira'] is used to specify a particular column in the DataFrame named 'data'. Let's break down this expression to understand it better:

  • data: This is the name of the DataFrame you are working with. A DataFrame is a two-dimensional data structure, similar to a table, provided by the Pandas library in Python. It consists of rows and columns, where each column can contain data of a specific type.
  • ['Qual seu Curso/Cadeira']: This syntax is used to access a specific column in a DataFrame. By placing the name of the column inside brackets and quotes, you are indicating that you want to work with that particular column. In this case, the column you are accessing is named 'Qual seu Curso/Cadeira'.
  • data['Qual seu Curso/Cadeira'] = This part of the code is assigning a new value to the column 'Qual seu Curso/Cadeira' in the DataFrame 'data'. Whatever is on the right side of the equals sign (=) is calculated and then assigned back to the same column. This means you are modifying the original data of that column in the DataFrame.

Therefore, data['Qual seu Curso/Cadeira'] is the way to reference and manipulate the column 'Qual seu Curso/Cadeira' in the DataFrame 'data'. By using this expression in conjunction with methods like?.str.title() and?.str.strip(), you are applying specific transformations to all the data in that column and updating the DataFrame with these new values

2.2?—?The methods?.str.title() and?.str.strip() are used in this line of code for several reasons related to data cleaning and standardization:

  • Text standardization with?.str.title(): Data collected, especially in forms, often exhibit variations in how they are typed. For instance, the names of courses or chairs might be entered in uppercase, lowercase, or a mix of both. The?.str.title() method is used to standardize these texts by converting the first letter of each word to uppercase and the rest to lowercase. This ensures that all data in the column have a consistent format, which is crucial for accurate analyses, especially if comparisons or groupings based on these texts are to be performed.
  • Removal of unnecessary spaces with?.str.strip(): When collecting data, it's common for extra spaces to be accidentally inserted at the beginning or end of entries. These spaces can cause issues during data analysis, such as incorrect identification of unique categories or errors when trying to match or compare strings. The?.str.strip() method removes these spaces, ensuring that strings are compared and processed correctly.

In summary, these methods are used to ensure the quality and uniformity of the data. Clean and standardized data are fundamental for reliable and accurate analyses, as inconsistencies in data can lead to misinterpretations or technical difficulties during the analysis.

3. In this part of the code, a series of operations are performed to standardize and correct the entries in a specific column of a DataFrame. Let’s break it down:

correcoes = {
    
    "Matemática": "Cadeira de Matemática",
    "Cibernética": "Cadeira de Cibernética"
    
}
data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].replace(correcoes)
data['Qual seu Curso/Cadeira'].unique()  # Mostra os valores únicos na coluna após a padroniza??o        

3.1?—?Creation of the correcoes Dictionary:

  • correcoes = { "Matemática": "Cadeira de Matemática", "Cibernética": "Cadeira de Cibernética" }
  • This dictionary, named correcoes, maps certain values to their corrected forms. For example, every instance of "Matemática" in the data will be replaced with "Cadeira de Matemática", and similarly for "Cibernética".

3.2?—?Replacing Values in the DataFrame:

  • data['Qual seu Curso/Cadeira'] = data['Qual seu Curso/Cadeira'].replace(correcoes)
  • This line applies the corrections defined in the correcoes dictionary to the 'Qual seu Curso/Cadeira' column of the DataFrame data.
  • The?.replace() method is used here to substitute the values in the DataFrame according to the mapping provided in the correcoes dictionary.

3.3?—?Displaying Unique Values:

  • data['Qual seu Curso/Cadeira'].unique()
  • This line of code is used to display the unique values in the ‘Qual seu Curso/Cadeira’ column after the standardization and corrections have been applied.
  • The?.unique() method is a Pandas function that returns an array of all unique values in the specified column, which is useful for verifying that the data has been standardized and corrected as intended.

In summary, this part of the code is focused on cleaning and standardizing the data in the ‘Qual seu Curso/Cadeira’ column by replacing certain values with standardized ones and then verifying the result by listing all the unique values present in the column after these operations.

4. In this part of the code, two main operations are performed on the DataFrame ‘data’:

# Removendo a coluna 'Carimbo de data/hora'
data = data.drop('Carimbo de data/hora', axis=1)
# Exibindo as primeiras linhas para verificar
data. Head()        

4.1?—?Removal of the ‘Carimbo de data/hora’ Column:

  • data = data.drop('Carimbo de data/hora', axis=1)
  • This line of code removes the column titled ‘Carimbo de data/hora’ from the DataFrame ‘data’.
  • The?.drop() method is used to delete the specified column. The argument 'Carimbo de data/hora' indicates the name of the column to be removed, and the parameter axis=1 specifies that the operation should be performed on the columns (in contrast to axis=0, which would apply to rows).
  • After executing this line, the DataFrame ‘data’ will no longer contain the ‘Carimbo de data/hora’ column.

4.2?—?Displaying the First Rows of the DataFrame:

  • data.head()
  • This line of code is used to display the first rows of the DataFrame ‘data’ after the removal of the column.
  • The?.head() method is a Pandas function that returns the first five rows of a DataFrame by default. It is a useful tool for quickly checking the content and structure of the DataFrame, especially after making modifications like removing a column.

In summary, this part of the code focuses on cleaning the DataFrame ‘data’ by removing a specific column that may not be necessary for subsequent analysis and then visually verifying the result of this operation by displaying the first rows of the modified DataFrame

5. In this part of the code, two main operations are performed on the DataFrame data?, replacing ‘NaN’ and displaying the First Rows of the DataFrame:

# Substituindo 'NaN' por 'N?o Respondido' em todas as colunas
data = data.fillna('N?o Respondido')
data. Head()        

5.1?—?Replacing ‘NaN’ with ‘N?o Respondido’ in All Columns:

  • data = data.fillna('N?o Respondido')
  • This line of code uses the?.fillna() method to replace all 'NaN' (Not a Number, i.e., a missing or undefined value) in the DataFrame data with the text 'N?o Respondido'.
  • The?.fillna() method is a Pandas function that allows for replacing missing values in a DataFrame or series. By passing 'N?o Respondido' as an argument, all places in the DataFrame data containing 'NaN' are filled with this string.
  • This operation is useful for handling missing data in a way that makes it more understandable and manageable, especially in situations where the absence of a response might be significant for analysis.

5.2?—?Displaying the First Rows of the DataFrame:

  • data.head()
  • This line of code is used to display the first rows of the DataFrame data after replacing the 'NaN' values.
  • As mentioned earlier, the?.head() method returns the first five rows of a DataFrame. It is a useful tool to check if the desired changes have been correctly applied to the DataFrame.

In summary, this part of the code focuses on addressing missing data in the DataFrame data, replacing 'NaN' values with a more descriptive string ('N?o Respondido'), and then visually verifying the result of this operation by displaying the first rows of the modified DataFrame.

6. In this part of the code, we made a simplification: Long question titles can make the code confusing and hard to read. Using short labels like P1, P2, P3..., simplifies referencing the variables in the code, thus we transformed the questions from the form used in this work

data. Rename(columns={
    'Qual seu Curso/Cadeira': 'P1',
    'Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira.': 'P2',
   
}, inplace=True)
print(data. Columns)        

In this part of the code, the DataFrame data is modified to rename some of its columns. The?.rename() method is used with the columns parameter to specify the desired changes in the column names. Here's what each line does:

6.1?—?Column Renaming:

  • data.rename(columns={'Qual seu Curso/Cadeira': 'P1', 'Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira.': 'P2',}, inplace=True)
  • This line changes the name of the column ‘Qual seu Curso/Cadeira’ to ‘P1’ and the name of the column ‘Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira.’ to ‘P2’.
  • The argument inplace=True indicates that the modification should be made directly in the DataFrame data, without the need to create a new copy.

6.2?—?Displaying Updated Column Names:

  • print(data.columns)
  • This line prints the names of the columns of the DataFrame data after the renaming.
  • This is to verify that the columns have been renamed as expected.

In summary, this part of the code is used to simplify the names of the columns, making the DataFrame easier to manipulate and understand, especially when dealing with long or complex column names.

7. This code snippet analyzes and visualizes data related to question P2 of your survey, which is about the “Average number of military personnel involved in setting up a Formal Test Process for your Course/department”. Let’s detail each part of the code:

# Calculando percentuais para a pergunta P2 - Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2
percentuais_p2 = data['P2'].value_counts(normalize=True) * 100

# Criando um gráfico de barras para percentuais
plt.figure(figsize=(10, 6))
sns.barplot(x=percentuais_p2.index, y=percentuais_p2.values)
plt.title('Quantidade média de militares empenhados para a montagem de um Processo de Prova Formal no seu Curso/cadeira-P2')
plt.xlabel('Quantidade de militares')
plt.ylabel('Percentual')
plt.show()        

7.1.?—?Calculating Percentages:

  • data['P2'].value_counts(normalize=True): This function counts the frequency of each unique response in the 'P2' column of the data DataFrame. The normalize=True argument means that the function returns the proportion (or percentage) of each response instead of the total number of counts.
  • * 100: Converts the proportions into percentages. For example, a proportion of 0.25 is converted into 25%.
  • The result is stored in percentuais_p2, which now contains the percentages for each unique response to question P2.

7.2?—?Setting Up the Bar Chart:

  • plt.figure(figsize=(10, 6)): Sets the size of the chart figure. In this case, the width is 10 and the height is 6 (these are arbitrary measures that can be adjusted for better viewing).
  • sns.barplot(...): Uses the Seaborn library (referred to as sns) to create a bar chart.
  • x=percentuais_p2.index: Sets the labels of the X-axis of the chart, which are the unique responses to question P2.
  • y=percentuais_p2.values: Sets the values of the Y-axis of the chart, which are the percentages for each response.
  • plt.title(...): Adds a title to the chart.
  • plt.xlabel('Quantidade de militares'): Sets the label for the X-axis (categorization of responses).
  • plt.ylabel('Percentual'): Sets the label for the Y-axis (percentage of responses).
  • plt.show(): Displays the chart.

In summary, this code calculates the percentage of responses for each category in question P2 and then creates a bar chart to visualize these percentages, making it easier to interpret the data. The chart helps to understand how responses to question P2 are distributed among different categories of the number of military personnel involved.

8. In this part of the code, we process the responses to apply Spearman’s correlation.

# Verificando as alternativas de respostas para realizar o mapeamento
print(data['P2'].unique())
print(data['P3'].unique())
print(data['P4'].unique())
print(data['P15'].unique())

# Mapeamentos
mapeamento_p2 = {'1 - 2': 1, '3 - 5': 2, '6 - 10': 3, 'Mais de 10': 4}
mapeamento_p3 = {'Menos de 2 horas': 1, '2 a 3 horas': 2, 'Mais de 3 horas': 3}
mapeamento_p4 = {'Menos de 1 horas': 1, '1 a 2 horas': 2, 'Mais de 2 horas': 3}
mapeamento_p15 = {'Sim': 1, 'N?o': 0, 'N?o tenho certeza': 0.5}

correlacao_spearman = data[['P2_num', 'P3_num', 'P4_num', 'P15_num']].corr(method='spearman')
print(correlacao_spearman)        

8.1?—?This part of the code focuses on identifying all the different response options provided for questions P2, P3, P4, and P15 in the survey.

  • print(data['P2'].unique()): This command prints all the unique response options that appear in the 'P2' column of the DataFrame data. This helps to understand all the possible answers for question P2.
  • The same process is repeated for questions P3, P4, and P15 using data['P3'].unique(), data['P4'].unique(), and data['P15'].unique(). Each of these commands prints the unique response options for their respective questions.

8.2?—?After identifying the response options for each question, the code defines a mapping for each. This mapping is used to convert textual responses into numerical values, which facilitates subsequent statistical analysis.

  • P2: The mapping for question P2 (mapeamento_p2) converts the response options into numbers. For example, '1 - 2' is mapped to 1, '3 - 5' to 2, and so on.
  • P3: Similarly, mapeamento_p3 maps the response options of P3 into numerical values. 'Less than 2 hours' is mapped to 1, '2 to 3 hours' to 2, and so on.
  • P4: mapeamento_p4 does the same for P4, with 'Less than 1 hour' mapped to 1, etc.
  • P15: For question P15, mapeamento_p15 assigns 1 to 'Yes', 0 to 'No', and 0.5 to 'Not sure'.

Each mapping is then applied to their respective columns in the DataFrame to transform the textual responses into numerical values, facilitating analyses such as correlations, frequency analyses, and others.

In summary, this code is an essential preparation for data analysis, where the first part identifies the responses to be mapped, and the second part effectively performs the mapping, converting categorical data into numerical for subsequent analyses.

9. Application of Mappings to the Responses of Questions

The first main action in the code is the application of specific mappings to the responses of questions P2, P3, P4, and P15. This process converts the textual (categorical) responses of these questions into numerical values.

9.1?—?How Does the Mapping Application Work?

  • For each question (P2, P3, P4, and P15), you have a predefined mapping (like mapeamento_p2, mapeamento_p3, etc.).
  • The?.map() method is used to apply this mapping to the corresponding column in the DataFrame data.
  • For example, data['P2'].map(mapeamento_p2) transforms all the responses in the 'P2' column according to the mapping defined in mapeamento_p2. This means each textual response in 'P2' is replaced by its corresponding numerical value as defined in the mapping.
  • This process is repeated for questions P3, P4, and P15.

9.2?—?Calculation of the Spearman Correlation

The second main action is the calculation of the Spearman correlation between the resulting numerical variables from questions P2, P3, P4, and P15.

9.2.1?—?What is the Spearman Correlation?

  • The Spearman correlation is a statistical measure that assesses the strength and direction of the relationship between two variables that are at least of ordinal nature.
  • Unlike Pearson correlation, which requires a linear relationship and normal distribution of the variables, Spearman correlation can be used with non-parametric data and is based on the ranks of the variables.

9.2.2?—?How is the Correlation Calculated?

  • data[['P2_num', 'P3_num', 'P4_num', 'P15_num']].corr(method='spearman'): This command calculates the Spearman correlation for all possible combinations between the variables 'P2_num', 'P3_num', 'P4_num', and 'P15_num'.
  • The result is a correlation matrix, where each cell shows the Spearman correlation coefficient between two variables. Values close to +1 or -1 indicate a strong positive or negative correlation, respectively, while values close to 0 indicate little or no correlation.

In summary, this code snippet first converts the textual responses of selected questions into numerical values (facilitating statistical analysis) and then uses these numerical values to calculate and explore relationships between the variables through the Spearman correlation.

The result for the Spearman correlation between questions P2, P3, P4, and P15 was the following:

Heatmap of the Spearman Correlation



要查看或添加评论,请登录

Rodrigo C.的更多文章

社区洞察

其他会员也浏览了