Skip to content
Snippets Groups Projects
introduction_pandas.ipynb 83.2 KiB
Newer Older
Jules Kouatchou's avatar
Jules Kouatchou committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "<font color=\"white\">.</font> | <font color=\"white\">.</font> | <font color=\"white\">.</font>\n",
    "-- | -- | --\n",
    "![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size=\"+3\">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "\n",
    "---\n",
    "\n",
    "<center><h1><font color=\"red\" size=\"+3\">Introduction to Pandas</font></h1></center>"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%html\n",
    "<!DOCTYPE html>\n",
    "<html lang=\"en\">\n",
    "  <head> </head>\n",
    "  <body>\n",
    "<script src=\"https://bot.voiceatlas.mysmce.com/v1/chatlas.js\"></script>\n",
    "<app-chatlas\n",
    "\tatlas-id=\"f759a188-f8bb-46bb-9046-3b1b961bd6aa\"\n",
    "\twidget-background-color=\"#3f51b5ff\"\n",
    "\twidget-text-color=\"#ffffffff\"\n",
    "\twidget-title=\"Chatlas\">\n",
    "</app-chatlas>\n",
    "  </body>\n",
    "</html>"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color=\"red\">Useful References</font>\n",
    "- <a href=\"https://bitbucket.org/hrojas/learn-pandas/src/master/\">Learn Pandas</a> (by Hernan Rojas)\n",
    "- <a href=\"https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/\"> Python Pandas Tutorial: A Complete Introduction for Beginners</a>\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "- <a href=\"https://www.python-course.eu/pandas.php\">Introduction into Pandas</a>\n",
    "- <a href=\"http://earthpy.org/pandas-basics.html\">Time series analysis with pandas</a>\n",
    "- <a href=\"https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html\">Working with Time Series</a>"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "![fig_logo](https://miro.medium.com/max/3200/1*9v51-jsfHtk6fgAIYLoiHQ.jpeg)\n",
    "Image Source: pandas.pydata.org\n",
    "\n",
    "## <font color=\"red\">What is Pandas?</font>\n",
    "+ `Pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "+ Some key features:\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "    - Fast and efficient DataFrame object with default and customized indexing.\n",
    "    - Tools for loading data into in-memory data objects from different file formats.\n",
    "    - Data alignment and integrated handling of missing data.\n",
    "    - Reshaping and pivoting of data sets.\n",
    "    - Label-based slicing, indexing and subsetting of large data sets.\n",
    "    - Columns from a data structure can be deleted or inserted.\n",
    "    - Group by data for aggregation and transformations.\n",
    "    - High performance merging and joining of data.\n",
    "    - Time Series functionality.\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "+ Able to manipulate several <a href=\"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\">types of files</a>, including CSVs, TSVs , JSONs, HTML, xlsx, HDF5, Python Pickle, among others.\n",
    "* Is compatible with many of the other data analysis libraries, like Scikit-Learn, Matplotlib, NumPy, and more. \n",
    "\n",
    "Some of key features of `Pandas` are captured in the diagram below:\n",
    "\n",
    "![fig_features](https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2019/04/Python-Pandas-Features.jpg)\n",
    "Image Source: data-flair.training"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What Will be Covered\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "\n",
    "1. Pandas data structures\n",
    "2. Grouby Function\n",
    "3. Reading remote CSV files and tables.\n",
    "4. Cleaning and formatting data\n",
    "5. Manipulating time series data\n",
    "6. Performing statistical calculations\n",
    "7. Visualizing the data"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "### Standard imports"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "import numpy as np\n",
    "import datetime\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "import pandas as pd\n",
    "print('Using pandas version ',pd.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Notebook settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "# Only 10 rows of data will be displayed\n",
    "pd.set_option(\"max_rows\", 10) \n",
    "\n",
    "# Print floating point numbers using fixed point notation\n",
    "np.set_printoptions(suppress=True)\n",
    "\n",
    "# Set figure size\n",
    "LARGE_FIGSIZE = (8, 6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Graphics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#from seaborn import set_style\n",
    "#set_style(\"darkgrid\")\n",
    "import seaborn as sns\n",
    "sns.set(style='ticks', context='talk')\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color=\"red\">`pandas` Data Structures\n",
    "- Pandas data structures are similar to numpy ndarrays but with extra functionality."
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1D data structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A <font color='red'>Series</font>  is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. \n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "\n",
    "Think of a Series as a cross between a list and a dict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/pandas_series.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A series can be constructed with the `pd.Series` constructor (passing a list or array of values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_list = [5, 8, 13, 0.1, -5]"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a list to create a Numpy array:"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "a = np.array(my_list)\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "print(type(a))\n",
    "print(a)"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a list to create a Pandas Series:"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "sr = pd.Series(my_list)\n",
    "print(type(sr))\n",
    "print(sr)"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "...get default index values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### NumPy arrays as backend of Pandas"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Contains an array of data:"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "sr.values  "
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "- If nothing else is specified, the values are labeled with their index number. \n",
    "- The Pandas Series will then have an associated array of data labels from `0`, to `N-1`:"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "sr.index"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_rows = list(range(5))\n",
    "print(my_rows)"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sr.index.values "
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain statistical information:"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "sr.describe()"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### More on the index"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename the index values:"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "sr.index = ['A','B','C','D','E']\n",
    "print(sr)"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or pass the index values during Pandas series creation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "sr1 = pd.Series(my_list, index=['A','B','C','D','E'])\n",
    "print(sr1)"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Numpy Array has an implicitly defined integer index used to access the values while the Pandas Series has an explicitly defined index associated with the values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get value at position `n` in series"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "print(sr[3])  "
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Use `iloc` (integer location) to get value at position `n`"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "print(sr.iloc[3]) "
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Value at given index using dictionary-like syntax"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "print(sr.loc['D'])"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also create a Pandas Series from a dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr2 = pd.Series(dict(A=5, B=8, C=13, D=0.1, E=-5))\n",
    "sr2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also, create a Pandas Series from a scalar data. But, if you pass a single value with multiple indexes, the value will be same for all the indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sr3 = pd.Series(10.5, index=['A','B','C','D','E'])\n",
    "print(sr3)"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "#### 2D data structures\n",
    "\n",
    "Pandas: <font color='red'>DataFrame</font> is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.\n",
    "\n",
    "A <font color='red'>DataFrame</font> is like a sequence of aligned <font color='red'>Series</font> objects, i.e. they share the same index.\n",
    "\n",
    "![title](img/pandas_df.png)\n",
    "\n",
    "\n",
    "A DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data=[[5, True, 'x', 2.7],\n",
    "                        [8, True, 'y', 3.1],\n",
    "                        [13,False,'z', np.NaN],\n",
    "                        [1, False, 'a', 0.1],\n",
    "                        [-5, True, 'b', -2]],\n",
    "                  index=['A','B','C','D','E'],\n",
    "                  columns=['num', 'bool', 'str', 'real'])\n",
    "print(type(df))\n",
    "df"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Display the first few rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Display the last few rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.tail()"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Get the number of rows and columns as a tuple:"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the type of each column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Obtain basic data information (column count, number of values in each column, data type of each column, etc.):"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "df.info()"
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Obtain descriptive statistics of each numeric column:"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "df.describe()"
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can pass the argument `include='object'` to return the descriptive statistics of categorical (object) columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe(include='object')"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get list of column names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the index values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get specific column(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['num']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['num','real']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "#### Label-based selection\n",
    "\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Get specific row(s) by name(s):"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['C']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[['B', 'D']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['A':'E':2]"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get specific row(s) and column(s) by name(s):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['A':'D':2, ['num', 'real']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc['A':'C', 'num':'real']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "#### Index-based selection\n",
    "\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Get specific row(s) by position(s):"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[1:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "Get specific row(s) and column(s) by position(s):"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "df.iloc[[2,4], [1,3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display one random row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select columns based on datatype:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.select_dtypes(include='object')"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apply masking:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.real > 1.0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.real == 3.1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Problem with `NaN`:\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "- In Python (and NumPy), the `nan`'s don’t compare to equal. \n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "- Pandas/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`.\n",
    "- A scalar equality comparison versus a `None/np.nan` doesn’t provide useful information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.real"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.real == np.NaN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `isnull` method to find out which dataframe entries are '`NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull().values.any()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull().sum()"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a DataFrame from a 2D Numpy array\n",
    "\n",
    "Given a two-dimensional array of data, we can create a dataframe with any specified column and index names. If left out, an integer index will be used for each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_nparray = np.random.rand(3, 2)\n",
    "print(\"Numpy array: \", my_nparray)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a Pandas dataframe using a Numpy array"
Jules Kouatchou's avatar
Jules Kouatchou committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pddf = pd.DataFrame(my_nparray,\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "                    columns=['foo', 'bar'],\n",
    "                    index=['a', 'b', 'c'])\n",
Jules Kouatchou's avatar
Jules Kouatchou committed
    "pddf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a NumPy array from a Pandas DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_nparray2 = pddf.to_numpy()\n",
    "print(f\"Numpy array: {my_nparray2}\")"
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a Pandas dataframe using Pandas series"
   ]
  },
Jules Kouatchou's avatar
Jules Kouatchou committed
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdsr1 = pd.Series(np.random.rand(3))\n",
    "print(\"First_Series: \\n\", pdsr1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "pdsr2 = pd.Series(np.random.rand(3))\n",
    "print(\"Second_Series: \\n\", pdsr2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "df1 = pd.DataFrame(dict(First_Series = pdsr1, Second_Series = pdsr2))\n",
    "print(df1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Jules Kouatchou's avatar
Jules Kouatchou committed
    "## <font color='green'>Exercise</font>\n",
    "In the above Pandas dataframe, relabel the index as `['Row0', 'Row1', 'Row2']`."
   ]