introduction_pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=\"white\">.</font> | <font color=\"white\">.</font> | <font color=\"white\">.</font>\n",
    "-- | -- | --\n",
    "![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size=\"+3\">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)\n",
    "\n",
    "---\n",
    "\n",
    "<center><h1><font color=\"red\" size=\"+3\">Introduction to Pandas</font></h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%html\n",
    "<!DOCTYPE html>\n",
    "<html lang=\"en\">\n",
    "  <head> </head>\n",
    "  <body>\n",
    "<script src=\"https://bot.voiceatlas.mysmce.com/v1/chatlas.js\"></script>\n",
    "<app-chatlas\n",
    "\tatlas-id=\"f759a188-f8bb-46bb-9046-3b1b961bd6aa\"\n",
    "\twidget-background-color=\"#3f51b5ff\"\n",
    "\twidget-text-color=\"#ffffffff\"\n",
    "\twidget-title=\"Chatlas\">\n",
    "</app-chatlas>\n",
    "  </body>\n",
    "</html>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color=\"red\">Useful References</font>\n",
    "- <a href=\"https://bitbucket.org/hrojas/learn-pandas/src/master/\">Learn Pandas</a> (by Hernan Rojas)\n",
    "- <a href=\"https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/\"> Python Pandas Tutorial: A Complete Introduction for Beginners</a>\n",
    "- <a href=\"https://www.python-course.eu/pandas.php\">Introduction into Pandas</a>\n",
    "- <a href=\"http://earthpy.org/pandas-basics.html\">Time series analysis with pandas</a>\n",
    "- <a href=\"https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html\">Working with Time Series</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![fig_logo](https://miro.medium.com/max/3200/1*9v51-jsfHtk6fgAIYLoiHQ.jpeg)\n",
    "Image Source: pandas.pydata.org\n",
    "\n",
    "## <font color=\"red\">What is Pandas?</font>\n",
    "+ `Pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\n",
    "+ Some key features:\n",
    "    - Fast and efficient DataFrame object with default and customized indexing.\n",
    "    - Tools for loading data into in-memory data objects from different file formats.\n",
    "    - Data alignment and integrated handling of missing data.\n",
    "    - Reshaping and pivoting of data sets.\n",
    "    - Label-based slicing, indexing and subsetting of large data sets.\n",
    "    - Columns from a data structure can be deleted or inserted.\n",
    "    - Group by data for aggregation and transformations.\n",
    "    - High performance merging and joining of data.\n",
    "    - Time Series functionality.\n",
    "+ Able to manipulate several <a href=\"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\">types of files</a>, including CSVs, TSVs , JSONs, HTML, xlsx, HDF5, Python Pickle, among others.\n",
    "* Is compatible with many of the other data analysis libraries, like Scikit-Learn, Matplotlib, NumPy, and more. \n",
    "\n",
    "Some of key features of `Pandas` are captured in the diagram below:\n",
    "\n",
    "![fig_features](https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2019/04/Python-Pandas-Features.jpg)\n",
    "Image Source: data-flair.training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What Will be Covered\n",
    "\n",
    "1. Pandas data structures\n",
    "2. Grouby Function\n",
    "3. Reading remote CSV files and tables.\n",
    "4. Cleaning and formatting data\n",
    "5. Manipulating time series data\n",
    "6. Performing statistical calculations\n",
    "7. Visualizing the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Standard imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import datetime\n",
    "import pandas as pd\n",
    "print('Using pandas version ',pd.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Notebook settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "# Only 10 rows of data will be displayed\n",
    "pd.set_option(\"max_rows\", 10) \n",
    "\n",
    "# Print floating point numbers using fixed point notation\n",
    "np.set_printoptions(suppress=True)\n",
    "\n",
    "# Set figure size\n",
    "LARGE_FIGSIZE = (8, 6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Graphics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#from seaborn import set_style\n",
    "#set_style(\"darkgrid\")\n",
    "import seaborn as sns\n",
    "sns.set(style='ticks', context='talk')\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color=\"red\">`pandas` Data Structures\n",
    "- Pandas data structures are similar to numpy ndarrays but with extra functionality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1D data structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A <font color='red'>Series</font>  is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. \n",
    "\n",
    "Think of a Series as a cross between a list and a dict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/pandas_series.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A series can be constructed with the `pd.Series` constructor (passing a list or array of values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_list = [5, 8, 13, 0.1, -5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a list to create a Numpy array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = np.array(my_list)\n",
    "print(type(a))\n",
    "print(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a list to create a Pandas Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr = pd.Series(my_list)\n",
    "print(type(sr))\n",
    "print(sr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "...get default index values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### NumPy arrays as backend of Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Contains an array of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr.values  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- If nothing else is specified, the values are labeled with their index number. \n",
    "- The Pandas Series will then have an associated array of data labels from `0`, to `N-1`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sr.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_rows = list(range(5))\n",
    "print(my_rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sr.index.values "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain statistical information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### More on the index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename the index values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr.index = ['A','B','C','D','E']\n",
    "print(sr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or pass the index values during Pandas series creation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr1 = pd.Series(my_list, index=['A','B','C','D','E'])\n",
    "print(sr1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Numpy Array has an implicitly defined integer index used to access the values while the Pandas Series has an explicitly defined index associated with the values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get value at position `n` in series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(sr[3])  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `iloc` (integer location) to get value at position `n`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sr.iloc[3]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Value at given index using dictionary-like syntax"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sr.loc['D'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also create a Pandas Series from a dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr2 = pd.Series(dict(A=5, B=8, C=13, D=0.1, E=-5))\n",
    "sr2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also, create a Pandas Series from a scalar data. But, if you pass a single value with multiple indexes, the value will be same for all the indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr3 = pd.Series(10.5, index=['A','B','C','D','E'])\n",
    "print(sr3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2D data structures\n",
    "\n",
    "Pandas: <font color='red'>DataFrame</font> is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.\n",
    "\n",
    "A <font color='red'>DataFrame</font> is like a sequence of aligned <font color='red'>Series</font> objects, i.e. they share the same index.\n",
    "\n",
    "![title](img/pandas_df.png)\n",
    "\n",
    "\n",
    "A DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data=[[5, True, 'x', 2.7],\n",
    "                        [8, True, 'y', 3.1],\n",
    "                        [13,False,'z', np.NaN],\n",
    "                        [1, False, 'a', 0.1],\n",
    "                        [-5, True, 'b', -2]],\n",
    "                  index=['A','B','C','D','E'],\n",
    "                  columns=['num', 'bool', 'str', 'real'])\n",
    "print(type(df))\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display the first few rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display the last few rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the number of rows and columns as a tuple:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the type of each column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain basic data information (column count, number of values in each column, data type of each column, etc.):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain descriptive statistics of each numeric column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can pass the argument `include='object'` to return the descriptive statistics of categorical (object) columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe(include='object')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get list of column names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the index values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get specific column(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['num']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['num','real']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Label-based selection\n",
    "\n",
    "Get specific row(s) by name(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['C']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[['B', 'D']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['A':'E':2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get specific row(s) and column(s) by name(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['A':'D':2, ['num', 'real']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['A':'C', 'num':'real']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Index-based selection\n",
    "\n",
    "Get specific row(s) by position(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[1:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get specific row(s) and column(s) by position(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[[2,4], [1,3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display one random row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select columns based on datatype:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.select_dtypes(include='object')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apply masking:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.real > 1.0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.real == 3.1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Problem with `NaN`:\n",
    "- In Python (and NumPy), the `nan`'s don’t compare to equal. \n",
    "- Pandas/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`.\n",
    "- A scalar equality comparison versus a `None/np.nan` doesn’t provide useful information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.real"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.real == np.NaN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `isnull` method to find out which dataframe entries are '`NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull().values.any()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a DataFrame from a 2D Numpy array\n",
    "\n",
    "Given a two-dimensional array of data, we can create a dataframe with any specified column and index names. If left out, an integer index will be used for each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_nparray = np.random.rand(3, 2)\n",
    "print(\"Numpy array: \", my_nparray)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a Pandas dataframe using a Numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pddf = pd.DataFrame(my_nparray,\n",
    "                    columns=['foo', 'bar'],\n",
    "                    index=['a', 'b', 'c'])\n",
    "pddf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a NumPy array from a Pandas DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_nparray2 = pddf.to_numpy()\n",
    "print(f\"Numpy array: {my_nparray2}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a Pandas dataframe using Pandas series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdsr1 = pd.Series(np.random.rand(3))\n",
    "print(\"First_Series: \\n\", pdsr1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdsr2 = pd.Series(np.random.rand(3))\n",
    "print(\"Second_Series: \\n\", pdsr2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame(dict(First_Series = pdsr1, Second_Series = pdsr2))\n",
    "print(df1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color='green'>Exercise</font>\n",
    "In the above Pandas dataframe, relabel the index as `['Row0', 'Row1', 'Row2']`."
   ]