Exploring the Fascinating World of Exoplanets: A Statistical and Machine Learning Analysis
Introduction:
The study of exoplanets has captivated the minds of scientists and space enthusiasts alike, unveiling new realms of possibilities beyond our solar system. In this article, we embark on an exciting journey of data analysis, hypothesis testing, and machine learning to unravel the mysteries of exoplanets.
I. Exporatory Data Analysis
1. Exploring the Dataset:
Let's start by delving into the dataset of exoplanets provided by NASA. Below is a snapshot of the first 10 lines from the dataset:
The dataset comes with a variety of columns that provide diverse information. Here are the names of the columns:
Each column carries essential information about the exoplanets. Here's an overview of the column details:
Let's take a look at some key statistical information about the dataset:
2. Hypothesis Tests
I. Analysis of Exoplanet Discoveries by Discovery Method
Our first hypothesis revolves around the methods of exoplanet discovery and their differences in terms of discovery counts. We will employ a Chi-Square Test or ANOVA to compare the frequencies of discoveries by method.
Chi-Square and ANOVA:
Kruskal-Wallis
I initially conducted the chi-square and ANOVA tests and observed discrepancies between their outcomes. Consequently, I visualized the distributions of the labels and noted that they follow non-parametric distributions. Thus, I re-conducted the second test, replacing ANOVA with Kruskal-Wallis and re-evaluated the results.
This can be interpreted as an indication that exoplanet discovery methods are indeed yielding different outcomes in terms of discovery counts, providing support for the analysis that the non-parametric distributions exhibit significant differences.
II. The distribution of exoplanet masses follows a normal distribution.
Shapiro-Wilk Test.
In summary, the analysis of exoplanet masses revealed that the Shapiro-Wilk test yielded a p-value of 1.0, suggesting no significant departure from normality. However, the Q-Q plot displayed an upward exponential curvature at the end, indicating deviations from normality, particularly in the tails. This suggests that while the Shapiro-Wilk test did not reject normality, visual inspection of the Q-Q plot hints at the presence of heavy tails in the distribution. These heavy tails could imply the presence of outliers or rare events in the exoplanet mass data. Researchers should consider exploring outlier identification, data transformation, or robust statistical methods to handle the potential influence of these outliers and non-normal distribution characteristics in further analyses.
III. There is a positive correlation between the exoplanet mass and the mass of the host star.
Pearson's Correlation Coefficient
Based on the analysis of the correlation between exoplanet mass and host star mass, we observed a Pearson correlation coefficient of approximately 0.26. This positive value indicates a weak correlation between the two variables. Although the relationship is not strong, the presence of a positive correlation suggests that, in general, exoplanets with larger masses tend to orbit host stars with larger masses. However, it's important to note that other factors may influence this relationship and that correlation does not necessarily imply a cause-and-effect relationship between exoplanet and star masses.
IV. The masses of exoplanets differ between systems with different numbers of stars.
After conducting the Student's t-test to compare exoplanet masses between systems with 1 and 2 host stars, we observed a t-statistic value of approximately -1.57 and a p-value of about 0.12. The p-value is greater than the usual significance level of 0.05, indicating that there is not enough evidence to reject the null hypothesis that the exoplanet masses are equal between the two groups. Therefore, I did not find statistically significant differences in exoplanet masses between systems with different numbers of host stars.
The Results:
T-Statistic: -1.5736487102572945
P-Value: 0.11562735525096766
V. The relationship between semi-major axis and orbital period follows Kepler's law.
Based on the results of the analysis of the relationship between Semi-Major Axis and Orbital Period of exoplanets, we observe a strong positive correlation between these two variables. By fitting a polynomial curve to the data, we find that this curve fits well with the experimental data points, indicating a non-linear relationship between Semi-Major Axis and Orbital Period. This is consistent with Kepler's Law, which describes the relationship between the orbital parameters of a planetary system. Therefore, we can conclude that the results support the idea that Kepler's Law provides an accurate description of the relationship between Semi-Major Axis and Orbital Period of the analyzed exoplanets.
VI. The distributions of visual magnitude, infrared magnitudes, and Gaia magnitude are different.
Results:
KS Test Results:
Visual vs Infrared - KS Statistic: 0.5279374522923377, P-Value: 0.0
Visual vs Gaia - KS Statistic: 0.08053122400252233, P-Value: 6.873588918671671e-97
Infrared vs Gaia - KS Statistic: 0.49627150013308535, P-Value: 0.0
These results indicate that magnitudes measured in different wavelength ranges have statistically significant differences among them. This could be related to varying sensitivities of measurement instruments in each wavelength range, light absorption by interstellar medium, and other variables affecting observations at different wavelengths. Therefore, when comparing magnitudes across different wavelength ranges, it's important to consider potential sources of variation that could contribute to these differences.
领英推荐
VII. There is a relationship between the distance and the visual magnitude of host stars.
Results:
Regression Slope: 0.0027953678529972206
Regression Intercept: 11.771645119985003
R-squared: 0.3336357728105161
P-Value: 0.0
Overall, the results suggest that there is a statistically significant but relatively weak positive relationship between distance and visual magnitude of host stars. The R-squared value indicates that other factors not included in the analysis may also contribute to the variability in visual magnitude.
VIII. There is a temporal trend in the discoveries of exoplanets over the years.
The significant increases in exoplanet discoveries in 2014 and 2016 could be attributed to various factors, including advancements in observation techniques, improvements in data analysis methods, and the launch of new space telescopes or missions that were particularly effective at detecting exoplanets during those years. Additionally, collaborative efforts among different research groups, increased funding, and dedicated exoplanet discovery missions could have also played a role in boosting the number of discoveries during those specific years. It would be beneficial to investigate historical records, scientific publications, and announcements related to exoplanet research during those years to gain a better understanding of the specific factors that contributed to the observed increases.
IX. There is a difference in stellar properties among different discovery methods.
The Box-Plots and Clusterization analysis, shows evidences of groups of similarity and difference between properties, for methods, alongside the exoplanets dataset.
OBS: The Clusterization was created after the dimension reduction algorithm.
3. Machine Learning Analysis
The Machine Learning techniques were used fo determine the planet's year length and its orbital distance from the host star. I used a linear regression approach for the results, after some data analysis and choices.
I. Pearson's Correlation Matrix
After choosing the variables with huge correlations (negative and positives) with our output variable (orbital period), I built a simple RNN architecture with Mean Squared Error to analyze the results. Then, using Keras Tuner, I selected the best Hyperparameters and reorganized the architecture.
4. Explaining The AI.
But this is not enough. After training the Neural Network, I've tried to visualize what was happening on the training. Understand the influence of each variable in each layer. This is an example of how can we visualize this.
Those results could help us to better understand the physics behind complex data, systems and problems, as also to improve the tech and science development and discoveries.
Other visualizations
5. Conclusion and Next Steps
I'm Eager to Connect and Collaborate!
At the intersection of data science, astronomy, and exploration lies a universe of endless possibilities. As I delve deeper into the mysteries of exoplanets, I invite fellow researchers, scientists, and enthusiasts to join in this cosmic journey. Your insights, feedback, and collaborative spirit can fuel new discoveries and foster innovative thinking. Let's forge connections that transcend the boundaries of space and knowledge.
If you're as passionate about exoplanetary science as I am, I'd be thrilled to connect with you. Feel free to explore my profile and drop me a message. Whether it's sharing your perspectives, discussing new research directions, or simply indulging in the wonders of the cosmos, I'm excited to engage in fruitful conversations with like-minded individuals.
Let's embark on a shared voyage of curiosity, exploration, and discovery. Together, we can unlock the secrets of the universe, one dataset at a time."
Connect with me on LinkedIn: [Your LinkedIn Profile Link]
Stay curious, stay inspired, and let's chart the course to the stars together! Let's keep data!
6. References