| Peer-Reviewed

Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game

Received: 23 September 2021     Accepted: 15 October 2021     Published: 17 November 2021
Views:       Downloads:
Abstract

To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.

Published in American Journal of Data Mining and Knowledge Discovery (Volume 6, Issue 2)
DOI 10.11648/j.ajdmkd.20210602.12
Page(s) 24-30
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2021. Published by Science Publishing Group

Keywords

Data Science, Baseball, Machine Learning, NLP

References
[1] “Sports in the 1920s.” NCpedia, www.ncpedia.org/sports/golden-age-sports.
[2] Castrovince, Anthony. “A Look at Baseball's 'Unwritten Rule Book'.” MLB.com, MLB, 19 Aug. 2020, www.mlb.com/news/the-unwritten-rules-of-baseball.
[3] Lindholm, Scott. “The Importance of Baseball Statistics.” Beyond the Box Score, Beyond the Box Score, 24 Apr. 2014, www.beyondtheboxscore.com/2014/4/24/5635638/chicago-white-sox-ken-harrelson-baseball-statistics-twtw-the-will-to-win.
[4] Schreiber, Le Anne. “Geography Lesson: Breaking down the Bias in ESPN's Coverage.” ESPN, ESPN Internet Ventures, 14 Aug. 2008, www.espn.com/espn/columns/story?columnist=schreiber_leanne&id=3534299.
[5] “What Are Word Clouds? The Value of Simple Visualizations.” Boost Labs, 3 Nov. 2020, boostlabs.com/blog/what-are-word-clouds-value-simple-visualizations/.
[6] Hall, Sharon Hurley. “What Is a Heat Map, How to Generate One, Example and Case Studies.” The Daily Egg, 6 Apr. 2021, www.crazyegg.com/blog/understanding-using-heatmaps-studies/.
[7] Scherer, Keith, et al. “Baseball Prospectus Basics: How to Read a Box Score.” Baseball Prospectus, 25 Feb. 2004, www.baseballprospectus.com/news/article/2601/baseball-prospectus-basics-how-to-read-a-box-score/.
[8] “Python - Remove Stopwords.” Tutorialspoint, www.tutorialspoint.com/python_text_processing/python_remove_stopwords.htm#:~:text=Stopwords%20are%20the%20English%20 words,the%2C%20he%2C%20have%20etc.
[9] “Rays 9, Blue Jays 7 (Final Score) on MLB Gameday.” MLB.com, www.mlb.com/gameday/rays-vs-blue-jays/2021/05/21/634078#game_state=final,game_tab=box,game=634078.
[10] “9.16 Earned Runs and Runs Allowed.” Baseball Rules Academy, 15 Mar. 2020, baseballrulesacademy.com/official-rule/mlb/9-16-earned-runs-runs-allowed/.
[11] “Quality Start (QS): Glossary.” MLB.com, www.mlb.com/glossary/standard-stats/quality-start.
[12] “(Tutorial) Understanding Logistic REGRESSION in PYTHON.” Data Camp Community, www.datacamp.com/community/tutorials/understanding-logistic-regression-python.
[13] “An Introduction to Support Vector Machines (SVM).” MonkeyLearn Blog, 22 June 2017, monkeylearn.com/blog/introduction-to-support-vector-machines-svm/.
[14] “1.9. Naive Bayes.” Scikit, scikit-learn.org/stable/modules/naive_bayes.html#:~:text=Naive%20Bayes%20methods%20are%20a,value%20of%20the%20class%20variable.
[15] Brownlee, Jason. “Develop k-Nearest Neighbors in Python From Scratch.” Machine Learning Mastery, 23 Feb. 2020, machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/.
Cite This Article
  • APA Style

    Jung-Hun Baeck. (2021). Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. American Journal of Data Mining and Knowledge Discovery, 6(2), 24-30. https://doi.org/10.11648/j.ajdmkd.20210602.12

    Copy | Download

    ACS Style

    Jung-Hun Baeck. Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. Am. J. Data Min. Knowl. Discov. 2021, 6(2), 24-30. doi: 10.11648/j.ajdmkd.20210602.12

    Copy | Download

    AMA Style

    Jung-Hun Baeck. Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. Am J Data Min Knowl Discov. 2021;6(2):24-30. doi: 10.11648/j.ajdmkd.20210602.12

    Copy | Download

  • @article{10.11648/j.ajdmkd.20210602.12,
      author = {Jung-Hun Baeck},
      title = {Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game},
      journal = {American Journal of Data Mining and Knowledge Discovery},
      volume = {6},
      number = {2},
      pages = {24-30},
      doi = {10.11648/j.ajdmkd.20210602.12},
      url = {https://doi.org/10.11648/j.ajdmkd.20210602.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajdmkd.20210602.12},
      abstract = {To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game
    AU  - Jung-Hun Baeck
    Y1  - 2021/11/17
    PY  - 2021
    N1  - https://doi.org/10.11648/j.ajdmkd.20210602.12
    DO  - 10.11648/j.ajdmkd.20210602.12
    T2  - American Journal of Data Mining and Knowledge Discovery
    JF  - American Journal of Data Mining and Knowledge Discovery
    JO  - American Journal of Data Mining and Knowledge Discovery
    SP  - 24
    EP  - 30
    PB  - Science Publishing Group
    SN  - 2578-7837
    UR  - https://doi.org/10.11648/j.ajdmkd.20210602.12
    AB  - To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.
    VL  - 6
    IS  - 2
    ER  - 

    Copy | Download

Author Information
  • St. Mark’s School, Southborough, Massachusetts, United States

  • Sections