Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game

Jung-Hun Baeck

doi:doi:10.11648/j.ajdmkd.20210602.12

| Peer-Reviewed

Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game

Jung-Hun Baeck

Published in American Journal of Data Mining and Knowledge Discovery (Volume 6, Issue 2)

Received: 23 September 2021 Accepted: 15 October 2021 Published: 17 November 2021

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.

Published in	American Journal of Data Mining and Knowledge Discovery (Volume 6, Issue 2)
DOI	10.11648/j.ajdmkd.20210602.12
Page(s)	24-30
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2021. Published by Science Publishing Group

Keywords

Data Science, Baseball, Machine Learning, NLP

References

[1]	“Sports in the 1920s.” NCpedia, www.ncpedia.org/sports/golden-age-sports.
[2]	Castrovince, Anthony. “A Look at Baseball's 'Unwritten Rule Book'.” MLB.com, MLB, 19 Aug. 2020, www.mlb.com/news/the-unwritten-rules-of-baseball.
[3]	Lindholm, Scott. “The Importance of Baseball Statistics.” Beyond the Box Score, Beyond the Box Score, 24 Apr. 2014, www.beyondtheboxscore.com/2014/4/24/5635638/chicago-white-sox-ken-harrelson-baseball-statistics-twtw-the-will-to-win.
[4]	Schreiber, Le Anne. “Geography Lesson: Breaking down the Bias in ESPN's Coverage.” ESPN, ESPN Internet Ventures, 14 Aug. 2008, www.espn.com/espn/columns/story?columnist=schreiber_leanne&id=3534299.
[5]	“What Are Word Clouds? The Value of Simple Visualizations.” Boost Labs, 3 Nov. 2020, boostlabs.com/blog/what-are-word-clouds-value-simple-visualizations/.
[6]	Hall, Sharon Hurley. “What Is a Heat Map, How to Generate One, Example and Case Studies.” The Daily Egg, 6 Apr. 2021, www.crazyegg.com/blog/understanding-using-heatmaps-studies/.
[7]	Scherer, Keith, et al. “Baseball Prospectus Basics: How to Read a Box Score.” Baseball Prospectus, 25 Feb. 2004, www.baseballprospectus.com/news/article/2601/baseball-prospectus-basics-how-to-read-a-box-score/.
[8]	“Python - Remove Stopwords.” Tutorialspoint, www.tutorialspoint.com/python_text_processing/python_remove_stopwords.htm#:~:text=Stopwords%20are%20the%20English%20 words,the%2C%20he%2C%20have%20etc.
[9]	“Rays 9, Blue Jays 7 (Final Score) on MLB Gameday.” MLB.com, www.mlb.com/gameday/rays-vs-blue-jays/2021/05/21/634078#game_state=final,game_tab=box,game=634078.
[10]	“9.16 Earned Runs and Runs Allowed.” Baseball Rules Academy, 15 Mar. 2020, baseballrulesacademy.com/official-rule/mlb/9-16-earned-runs-runs-allowed/.
[11]	“Quality Start (QS): Glossary.” MLB.com, www.mlb.com/glossary/standard-stats/quality-start.
[12]	“(Tutorial) Understanding Logistic REGRESSION in PYTHON.” Data Camp Community, www.datacamp.com/community/tutorials/understanding-logistic-regression-python.
[13]	“An Introduction to Support Vector Machines (SVM).” MonkeyLearn Blog, 22 June 2017, monkeylearn.com/blog/introduction-to-support-vector-machines-svm/.
[14]	“1.9. Naive Bayes.” Scikit, scikit-learn.org/stable/modules/naive_bayes.html#:~:text=Naive%20Bayes%20methods%20are%20a,value%20of%20the%20class%20variable.
[15]	Brownlee, Jason. “Develop k-Nearest Neighbors in Python From Scratch.” Machine Learning Mastery, 23 Feb. 2020, machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/.

Cite This Article

Plain Text BibTeX RIS

APA Style

Jung-Hun Baeck. (2021). Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. American Journal of Data Mining and Knowledge Discovery, 6(2), 24-30. https://doi.org/10.11648/j.ajdmkd.20210602.12

Copy | Download

ACS Style

Jung-Hun Baeck. Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. Am. J. Data Min. Knowl. Discov. 2021, 6(2), 24-30. doi: 10.11648/j.ajdmkd.20210602.12

Copy | Download

AMA Style

Jung-Hun Baeck. Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. Am J Data Min Knowl Discov. 2021;6(2):24-30. doi: 10.11648/j.ajdmkd.20210602.12

Copy | Download

@article{10.11648/j.ajdmkd.20210602.12,
  author = {Jung-Hun Baeck},
  title = {Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game},
  journal = {American Journal of Data Mining and Knowledge Discovery},
  volume = {6},
  number = {2},
  pages = {24-30},
  doi = {10.11648/j.ajdmkd.20210602.12},
  url = {https://doi.org/10.11648/j.ajdmkd.20210602.12},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajdmkd.20210602.12},
  abstract = {To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.},
 year = {2021}
}

Copy | Download

TY  - JOUR
T1  - Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game
AU  - Jung-Hun Baeck
Y1  - 2021/11/17
PY  - 2021
N1  - https://doi.org/10.11648/j.ajdmkd.20210602.12
DO  - 10.11648/j.ajdmkd.20210602.12
T2  - American Journal of Data Mining and Knowledge Discovery
JF  - American Journal of Data Mining and Knowledge Discovery
JO  - American Journal of Data Mining and Knowledge Discovery
SP  - 24
EP  - 30
PB  - Science Publishing Group
SN  - 2578-7837
UR  - https://doi.org/10.11648/j.ajdmkd.20210602.12
AB  - To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.
VL  - 6
IS  - 2
ER  -

Copy | Download

Author Information

Jung-Hun Baeck

St. Mark’s School, Southborough, Massachusetts, United States

Download PDF

Submit an Article

Sections

Plain Text BibTeX RIS

APA Style

Jung-Hun Baeck. (2021). Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. American Journal of Data Mining and Knowledge Discovery, 6(2), 24-30. https://doi.org/10.11648/j.ajdmkd.20210602.12

Copy | Download

ACS Style

Jung-Hun Baeck. Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. Am. J. Data Min. Knowl. Discov. 2021, 6(2), 24-30. doi: 10.11648/j.ajdmkd.20210602.12

Copy | Download

AMA Style

Jung-Hun Baeck. Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game. Am J Data Min Knowl Discov. 2021;6(2):24-30. doi: 10.11648/j.ajdmkd.20210602.12

Copy | Download

@article{10.11648/j.ajdmkd.20210602.12,
  author = {Jung-Hun Baeck},
  title = {Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game},
  journal = {American Journal of Data Mining and Knowledge Discovery},
  volume = {6},
  number = {2},
  pages = {24-30},
  doi = {10.11648/j.ajdmkd.20210602.12},
  url = {https://doi.org/10.11648/j.ajdmkd.20210602.12},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajdmkd.20210602.12},
  abstract = {To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.},
 year = {2021}
}

Copy | Download

TY  - JOUR
T1  - Implementing Auto Highlight Writer by Predicting the Best and the Worst Players in a Baseball Game
AU  - Jung-Hun Baeck
Y1  - 2021/11/17
PY  - 2021
N1  - https://doi.org/10.11648/j.ajdmkd.20210602.12
DO  - 10.11648/j.ajdmkd.20210602.12
T2  - American Journal of Data Mining and Knowledge Discovery
JF  - American Journal of Data Mining and Knowledge Discovery
JO  - American Journal of Data Mining and Knowledge Discovery
SP  - 24
EP  - 30
PB  - Science Publishing Group
SN  - 2578-7837
UR  - https://doi.org/10.11648/j.ajdmkd.20210602.12
AB  - To discover any biases that the sports media have, such as preferring and mentioning certain teams more often impartially, we recorded the statistics of Toronto Blue Jays players, and also collected the news and highlights articles of the team. Because baseball especially regards statistics as significant, the project tried to determine whether the medias’ focus on certain players is related to their performance or their fame and popularity in the first part of the project. The project first created a word cloud based on the keywords from the game highlights articles. In the statistics, we chose the best and worst player of the day for every game solely based on the statistics, and one interesting point we found was that some of the players who were chosen the most as the best player were also chosen often as the worst player depending on the day. We compared the list of names mentioned most often from the news and the ones we chose, and the two had some names in common while there were also questionable names from the news. Then, to develop a machine learning model that will select the player of the game after analyzing the statistics, we used a heatmap to identify the key factors of choosing the best player. According to the heatmap, for a batter, key elements were RBIs and hits, while for a pitcher, it was Innings Played and Runs allowed. We tested multiple machine learning models to see which model had the highest accuracy, and after several trials, a model named Logical Regression appeared to predict the player of the game based on statistics most accurately. Also, a sentence bank was created for the computer program. A sample sentence was provided to the program so that the program can put the statistics of each game in the sentence and write a written summary. With a sentence each for each player, the program could write a summary of every player, and also pick and write who the best and worst players of the game were.
VL  - 6
IS  - 2
ER  -

Copy | Download