Pokemon Go is (was…) a very popular game on mobile platforms where players act as Pokemon Trainers and try to catch various Pokemons around the globe. The aim of this blog post is to summarise the statistics of the Pokemons and see if we can pick out the best. Techniques covered in this post includes simple data loading, cleaning, summarising and visualisation. The data can be found on Kaggle posted by Alberto Barradas. This analysis was carried out as a team effort with Dorian, Revathy and Andrea.
Exploratory Data Analysis (EDA) is a key part of the data science workflow. It’s purpose is to gain a better understanding of the underlying data structure that we are working. This is important as we can usually find out relationships between features and their distributions. Although it looks very simple, very often this is already enough to solve real world problems. For example businesses can find out their main customers age group by just plotting a histogram. This is even more important when a data scientist would like to do something more complicated such as performing machine learning tasks. EDA can reveal the underlying data structure to allow the data scientist to decide his/her approach. This is because different algorithms have different requirements and assumptions for which if not met, would drastically affect performance.
Load and Clean
I will be working with Python in a Jupyter Notebook. The Notebook allows instant display of results which is very convenient for data science work.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline ''' The line above is a 'magic' which allows plots to be showed in the Notebook.''' df_go = pd.read_csv('pokemonGO.csv')
The code above shows how one can load a data set into the Notebook. Pandas is used to store the data as a data frame which allows us to manipulate the contents easily. CSV stands for Comma-seperated values (although nowadays CSV is not limited to using commas. Some use tabs and others use vertical-bars). Pandas can transform these files into a nicely tabulated format just like the ones we commonly see in Excel.
Using the head() method we can look at the top 5 lines of the data frame.
It seems that we have identification information about the Pokemon (Pokemon No. and Name) and the properties of the Pokemons (Types, CP and HP). For those who are not familiar with the game, CP is related to how hard a Pokemon hits and HP is related to how tough it is. In other words they are the ‘attack’ and ‘defence’ properties respectively. Using the head method is very helpful because it shows us what the data is actually like. In this particular data frame we have strings and integers for different features. It is a great way of checking data without potentially printing our a looooonnnnnnnng list.
Another way to look at the overall data structure is by applying the info() method. Replacing the .head() with .info() we will get a result like this.
It tells us the name and type of data we have for each column (e.g. int64, object <- which we can be interpreted as strings). More importantly it tells us how many non-null values exist in the data frame. This is very useful because machine learning algorithms (Scikit-learn) DO NOT accept null values as inputs so if one has to use this data for a model, they must be dealt with! In this case we will fill it up with something – a string that says there are no second type for this Pokemon. This is done by the .fillna() method. Sometimes we can delete the rows that have missing values all together using the .dropna() method. However one must be careful with such approach as information could be thrown away unnecessarily! We print the head() again and notice the change of value from NaN to No_Type_2.
df_go['Type 2'] = df_go.['Type 2'].fillna('No_Type_2')
Descriptive Statistics and Visualisations
To compute the statistics we can use codes such as the following
df_go['Max CP'].mean() #Compute the mean df_go['Max HP'].median() #Compute the median df_go['Pokemon No.'].std() #Compute the standard deviation np.percentile(df_go['Max HP'],25) #Compute the percentile (in this case 25%)
Knowing the statistics is important as they can be used to perform hypothesis tests. A more convenient way of doing the above would be using the .describe() method in Pandas. This would return a statistical summary for the data frame.
Note that the strings columns are automatically excluded!
However sometimes these statistics do not tell the whole story. This is because different distributions can have a very similar mean and standard deviations. This is when visualisations come into play.
From the above summaries we know that there are 151 unique Pokemon available at the time when the data was collected. Since we have no previous information about CP and HP we can not immediately tell whether those values are reasonable. Therefore the data would be p and plotted and explored.
fig = plt.figure(figsize=(16,6)) ax1= fig.add_subplot(121) ax2= fig.add_subplot(122) df_go['Max CP'].hist(ax=ax1) df_go['Max HP'].hist(ax=ax2) ax1.set(xlabel='Max CP', ylabel='Count', title = 'CP distribution') ax2.set(xlabel='Max HP', ylabel='Count', title = 'HP distribution') plt.show()
From the plots we can see that both the distributions are slightly positively skewed. The distribution for both CP and HP are relatively close together without any particular obvious outlier. In order to find out relationships between features, a scatter plot can be used. If the plotted points forms a line/trend, it is highly likely that there is a relationship between the features. It would be interesting to see that whether there are any relationships between CP values and HP values. Common sense tells us if you are tough, you are probably strong as well.
fig = plt.figure(figsize=(10,7)) ax = fig.add_subplot(111) ax.scatter('Max CP', 'Max HP', data=df_go) ax.set(xlabel='Max CP', ylabel='Max HP', title='Relationship between CP and HP') plt.show()
It can be seen that there is in fact a relationship between those features. The tougher the Pokemon is, the harder it hits! Finally it would be interesting to see whether different types of Pokemon have different abilities! This can be shown by box plots. Box plots are informative as it indicates where the range, quartiles and median lies. A normally distributed data would be symmetric along the horizontal line at the median (or in this case the mean as well).
plt.figure( figsize=(10,7)) sns.boxplot(x='Type 1', y='Max HP', data=df_go) plt.show()
What a colourful plot! It should be noted that this box plot is created by Seaborn rather than the the commonly used Matplotlib. Apart from the interesting colours, I find the implementation of box plots in Seaborn to be a lot simpler when comparing to Matplotlib. This is a very important point to emphasise because although some modules might have similar functions, the implementations can be very different. It is very beneficial to know more than one way to achieve the same thing in order to maximise productivity depending on environment. And for the interpretation of the graph, it can be seen that both Fairy type and Ice type are very strong! However the toughest one by far is a Normal type Pokemon. After some investigations, it turns out to be Chansey!
(Image from: http://bulbapedia.bulbagarden.net/wiki/Chansey_(Pokémon))
In summary this post has covered some basic techniques in data loading, examining, cleaning and visualising. We have also discussed the importance of these operations. Although rather simple, they remain to be some of the most important kits in the data science tool box. This can be proved by that we have already generated some insights (such as the toughest Pokemon type) with the data. It is always worth remembering that great things are built up by many simple yet critical components! In the next post we will look into the geographical data related to Pokemon Go and how long do they hang around. Until then, happy coding!