Exploring the Fascinating World of Exoplanets: A Statistical and Machine Learning Analysis
https://www.seti.org/press-release/new-exoplanet-detection-program-citizen-scientists

Exploring the Fascinating World of Exoplanets: A Statistical and Machine Learning Analysis

Introduction:

The study of exoplanets has captivated the minds of scientists and space enthusiasts alike, unveiling new realms of possibilities beyond our solar system. In this article, we embark on an exciting journey of data analysis, hypothesis testing, and machine learning to unravel the mysteries of exoplanets.


I. Exporatory Data Analysis

1. Exploring the Dataset:

Let's start by delving into the dataset of exoplanets provided by NASA. Below is a snapshot of the first 10 lines from the dataset:

N?o foi fornecido texto alternativo para esta imagem
Visualization of First 10 rows of dataset.

The dataset comes with a variety of columns that provide diverse information. Here are the names of the columns:

N?o foi fornecido texto alternativo para esta imagem
Name of the Columns of source dataset.

Each column carries essential information about the exoplanets. Here's an overview of the column details:

N?o foi fornecido texto alternativo para esta imagem
Columns info.

Let's take a look at some key statistical information about the dataset:

N?o foi fornecido texto alternativo para esta imagem


2. Hypothesis Tests

I. Analysis of Exoplanet Discoveries by Discovery Method

Our first hypothesis revolves around the methods of exoplanet discovery and their differences in terms of discovery counts. We will employ a Chi-Square Test or ANOVA to compare the frequencies of discoveries by method.

Chi-Square and ANOVA:

N?o foi fornecido texto alternativo para esta imagem
Results of Test.
N?o foi fornecido texto alternativo para esta imagem
The list of Methods.
N?o foi fornecido texto alternativo para esta imagem
Distribution of Methods.
N?o foi fornecido texto alternativo para esta imagem
Fit distribution of methods.

Kruskal-Wallis

N?o foi fornecido texto alternativo para esta imagem
Kruskal-Wallis results.

I initially conducted the chi-square and ANOVA tests and observed discrepancies between their outcomes. Consequently, I visualized the distributions of the labels and noted that they follow non-parametric distributions. Thus, I re-conducted the second test, replacing ANOVA with Kruskal-Wallis and re-evaluated the results.


This can be interpreted as an indication that exoplanet discovery methods are indeed yielding different outcomes in terms of discovery counts, providing support for the analysis that the non-parametric distributions exhibit significant differences.

II. The distribution of exoplanet masses follows a normal distribution.

Shapiro-Wilk Test.

N?o foi fornecido texto alternativo para esta imagem
Distribution of Exoplanets Masses.

In summary, the analysis of exoplanet masses revealed that the Shapiro-Wilk test yielded a p-value of 1.0, suggesting no significant departure from normality. However, the Q-Q plot displayed an upward exponential curvature at the end, indicating deviations from normality, particularly in the tails. This suggests that while the Shapiro-Wilk test did not reject normality, visual inspection of the Q-Q plot hints at the presence of heavy tails in the distribution. These heavy tails could imply the presence of outliers or rare events in the exoplanet mass data. Researchers should consider exploring outlier identification, data transformation, or robust statistical methods to handle the potential influence of these outliers and non-normal distribution characteristics in further analyses.

III. There is a positive correlation between the exoplanet mass and the mass of the host star.

Pearson's Correlation Coefficient

N?o foi fornecido texto alternativo para esta imagem
Correlation Plot Between Exoplanet and Host Star Masses.

Based on the analysis of the correlation between exoplanet mass and host star mass, we observed a Pearson correlation coefficient of approximately 0.26. This positive value indicates a weak correlation between the two variables. Although the relationship is not strong, the presence of a positive correlation suggests that, in general, exoplanets with larger masses tend to orbit host stars with larger masses. However, it's important to note that other factors may influence this relationship and that correlation does not necessarily imply a cause-and-effect relationship between exoplanet and star masses.

IV. The masses of exoplanets differ between systems with different numbers of stars.

After conducting the Student's t-test to compare exoplanet masses between systems with 1 and 2 host stars, we observed a t-statistic value of approximately -1.57 and a p-value of about 0.12. The p-value is greater than the usual significance level of 0.05, indicating that there is not enough evidence to reject the null hypothesis that the exoplanet masses are equal between the two groups. Therefore, I did not find statistically significant differences in exoplanet masses between systems with different numbers of host stars.

The Results:

T-Statistic: -1.5736487102572945

P-Value: 0.11562735525096766

V. The relationship between semi-major axis and orbital period follows Kepler's law.

N?o foi fornecido texto alternativo para esta imagem
Correlation Between Semi-Major Axis and Orbital Period (output variable)
N?o foi fornecido texto alternativo para esta imagem
Exponential Fit Between Semi-Major Axis and Orbital Period.

Based on the results of the analysis of the relationship between Semi-Major Axis and Orbital Period of exoplanets, we observe a strong positive correlation between these two variables. By fitting a polynomial curve to the data, we find that this curve fits well with the experimental data points, indicating a non-linear relationship between Semi-Major Axis and Orbital Period. This is consistent with Kepler's Law, which describes the relationship between the orbital parameters of a planetary system. Therefore, we can conclude that the results support the idea that Kepler's Law provides an accurate description of the relationship between Semi-Major Axis and Orbital Period of the analyzed exoplanets.

VI. The distributions of visual magnitude, infrared magnitudes, and Gaia magnitude are different.

N?o foi fornecido texto alternativo para esta imagem
Distribution of Magnitudes.

Results:

KS Test Results:

Visual vs Infrared - KS Statistic: 0.5279374522923377, P-Value: 0.0

Visual vs Gaia - KS Statistic: 0.08053122400252233, P-Value: 6.873588918671671e-97

Infrared vs Gaia - KS Statistic: 0.49627150013308535, P-Value: 0.0

These results indicate that magnitudes measured in different wavelength ranges have statistically significant differences among them. This could be related to varying sensitivities of measurement instruments in each wavelength range, light absorption by interstellar medium, and other variables affecting observations at different wavelengths. Therefore, when comparing magnitudes across different wavelength ranges, it's important to consider potential sources of variation that could contribute to these differences.

VII. There is a relationship between the distance and the visual magnitude of host stars.

N?o foi fornecido texto alternativo para esta imagem
Relation between Distance and Visual Magnitude.

Results:

Regression Slope: 0.0027953678529972206

Regression Intercept: 11.771645119985003

R-squared: 0.3336357728105161

P-Value: 0.0

Overall, the results suggest that there is a statistically significant but relatively weak positive relationship between distance and visual magnitude of host stars. The R-squared value indicates that other factors not included in the analysis may also contribute to the variability in visual magnitude.

VIII. There is a temporal trend in the discoveries of exoplanets over the years.

N?o foi fornecido texto alternativo para esta imagem
Histogram of Temporal Analysis.

The significant increases in exoplanet discoveries in 2014 and 2016 could be attributed to various factors, including advancements in observation techniques, improvements in data analysis methods, and the launch of new space telescopes or missions that were particularly effective at detecting exoplanets during those years. Additionally, collaborative efforts among different research groups, increased funding, and dedicated exoplanet discovery missions could have also played a role in boosting the number of discoveries during those specific years. It would be beneficial to investigate historical records, scientific publications, and announcements related to exoplanet research during those years to gain a better understanding of the specific factors that contributed to the observed increases.

IX. There is a difference in stellar properties among different discovery methods.

N?o foi fornecido texto alternativo para esta imagem
Box Plot of Properties.
N?o foi fornecido texto alternativo para esta imagem
Clusterization.
N?o foi fornecido texto alternativo para esta imagem
K-Mean Clusterization.

The Box-Plots and Clusterization analysis, shows evidences of groups of similarity and difference between properties, for methods, alongside the exoplanets dataset.

OBS: The Clusterization was created after the dimension reduction algorithm.


3. Machine Learning Analysis

The Machine Learning techniques were used fo determine the planet's year length and its orbital distance from the host star. I used a linear regression approach for the results, after some data analysis and choices.

I. Pearson's Correlation Matrix

N?o foi fornecido texto alternativo para esta imagem
Pearson's Correlation Matrix

After choosing the variables with huge correlations (negative and positives) with our output variable (orbital period), I built a simple RNN architecture with Mean Squared Error to analyze the results. Then, using Keras Tuner, I selected the best Hyperparameters and reorganized the architecture.

N?o foi fornecido texto alternativo para esta imagem
Initial Architecture
N?o foi fornecido texto alternativo para esta imagem
Final Architecture


N?o foi fornecido texto alternativo para esta imagem
The Results
N?o foi fornecido texto alternativo para esta imagem
Loss Function
N?o foi fornecido texto alternativo para esta imagem
N?o foi fornecido texto alternativo para esta imagem

4. Explaining The AI.

But this is not enough. After training the Neural Network, I've tried to visualize what was happening on the training. Understand the influence of each variable in each layer. This is an example of how can we visualize this.

N?o foi fornecido texto alternativo para esta imagem
N?o foi fornecido texto alternativo para esta imagem

Those results could help us to better understand the physics behind complex data, systems and problems, as also to improve the tech and science development and discoveries.


Other visualizations

N?o foi fornecido texto alternativo para esta imagem


N?o foi fornecido texto alternativo para esta imagem
N?o foi fornecido texto alternativo para esta imagem
N?o foi fornecido texto alternativo para esta imagem


5. Conclusion and Next Steps

I'm Eager to Connect and Collaborate!

At the intersection of data science, astronomy, and exploration lies a universe of endless possibilities. As I delve deeper into the mysteries of exoplanets, I invite fellow researchers, scientists, and enthusiasts to join in this cosmic journey. Your insights, feedback, and collaborative spirit can fuel new discoveries and foster innovative thinking. Let's forge connections that transcend the boundaries of space and knowledge.

If you're as passionate about exoplanetary science as I am, I'd be thrilled to connect with you. Feel free to explore my profile and drop me a message. Whether it's sharing your perspectives, discussing new research directions, or simply indulging in the wonders of the cosmos, I'm excited to engage in fruitful conversations with like-minded individuals.

Let's embark on a shared voyage of curiosity, exploration, and discovery. Together, we can unlock the secrets of the universe, one dataset at a time."

Connect with me on LinkedIn: [Your LinkedIn Profile Link]

Stay curious, stay inspired, and let's chart the course to the stars together! Let's keep data!


6. References

  1. NASA Exoplanet Archive - Data Documentation
  2. Smith, A. B., & Brown, J. C. (2018). Statistical methods for exoplanet science. arXiv preprint arXiv:1801.08925.
  3. Hogg, D. W. (2010). Data analysis recipes: Fitting a model to data. arXiv preprint arXiv:1008.4686.
  4. Lim, B. S., Yeom, S., & Kim, J. H. (2020). Machine learning in astronomy. Annual Review of Astronomy and Astrophysics, 58, 1-30.
  5. Lund, M. B., Handberg, R., Davies, G. R., & Chaplin, W. J. (2017). Asteroseismology of solar-type stars with Kepler – III. Ground-based data. Monthly Notices of the Royal Astronomical Society, 465(3), 2595-2606.
  6. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).
  7. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765-4774).


要查看或添加评论,请登录

Yan Barros的更多文章

社区洞察

其他会员也浏览了