Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"white\">.</font> | <font color=\"white\">.</font> | <font color=\"white\">.</font>\n",
"-- | -- | --\n",
" | <h1><font size=\"+3\">ASTG Python Courses</font></h1> | \n",
"\n",
"<center><h1><font color=\"red\" size=\"+3\">Introduction to Pandas</font></h1></center>"
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%html\n",
"<!DOCTYPE html>\n",
"<html lang=\"en\">\n",
" <head> </head>\n",
" <body>\n",
"<script src=\"https://bot.voiceatlas.mysmce.com/v1/chatlas.js\"></script>\n",
"<app-chatlas\n",
"\tatlas-id=\"f759a188-f8bb-46bb-9046-3b1b961bd6aa\"\n",
"\twidget-background-color=\"#3f51b5ff\"\n",
"\twidget-text-color=\"#ffffffff\"\n",
"\twidget-title=\"Chatlas\">\n",
"</app-chatlas>\n",
" </body>\n",
"</html>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## <font color=\"red\">Useful References</font>\n",
"- <a href=\"https://bitbucket.org/hrojas/learn-pandas/src/master/\">Learn Pandas</a> (by Hernan Rojas)\n",
"- <a href=\"https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/\"> Python Pandas Tutorial: A Complete Introduction for Beginners</a>\n",
"- <a href=\"https://www.python-course.eu/pandas.php\">Introduction into Pandas</a>\n",
"- <a href=\"http://earthpy.org/pandas-basics.html\">Time series analysis with pandas</a>\n",
"- <a href=\"https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html\">Working with Time Series</a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Image Source: pandas.pydata.org\n",
"\n",
"## <font color=\"red\">What is Pandas?</font>\n",
"+ `Pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\n",
" - Fast and efficient DataFrame object with default and customized indexing.\n",
" - Tools for loading data into in-memory data objects from different file formats.\n",
" - Data alignment and integrated handling of missing data.\n",
" - Reshaping and pivoting of data sets.\n",
" - Label-based slicing, indexing and subsetting of large data sets.\n",
" - Columns from a data structure can be deleted or inserted.\n",
" - Group by data for aggregation and transformations.\n",
" - High performance merging and joining of data.\n",
" - Time Series functionality.\n",
"+ Able to manipulate several <a href=\"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\">types of files</a>, including CSVs, TSVs , JSONs, HTML, xlsx, HDF5, Python Pickle, among others.\n",
"* Is compatible with many of the other data analysis libraries, like Scikit-Learn, Matplotlib, NumPy, and more. \n",
"\n",
"Some of key features of `Pandas` are captured in the diagram below:\n",
"\n",
"\n",
"Image Source: data-flair.training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"1. Pandas data structures\n",
"2. Grouby Function\n",
"3. Reading remote CSV files and tables.\n",
"4. Cleaning and formatting data\n",
"5. Manipulating time series data\n",
"6. Performing statistical calculations\n",
"7. Visualizing the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
"import pandas as pd\n",
"print('Using pandas version ',pd.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Notebook settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"# Only 10 rows of data will be displayed\n",
"pd.set_option(\"max_rows\", 10) \n",
"\n",
"# Print floating point numbers using fixed point notation\n",
"np.set_printoptions(suppress=True)\n",
"\n",
"# Set figure size\n",
"LARGE_FIGSIZE = (8, 6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Graphics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#from seaborn import set_style\n",
"#set_style(\"darkgrid\")\n",
"import seaborn as sns\n",
"sns.set(style='ticks', context='talk')\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## <font color=\"red\">`pandas` Data Structures\n",
"- Pandas data structures are similar to numpy ndarrays but with extra functionality."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1D data structures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A <font color='red'>Series</font> is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. \n",
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
"\n",
"Think of a Series as a cross between a list and a dict."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A series can be constructed with the `pd.Series` constructor (passing a list or array of values)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_list = [5, 8, 13, 0.1, -5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use a list to create a Numpy array:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use a list to create a Pandas Series:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sr = pd.Series(my_list)\n",
"print(type(sr))\n",
"print(sr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...get default index values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### NumPy arrays as backend of Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Contains an array of data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- If nothing else is specified, the values are labeled with their index number. \n",
"- The Pandas Series will then have an associated array of data labels from `0`, to `N-1`:"
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_rows = list(range(5))\n",
"print(my_rows)"
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"sr.index.values "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain statistical information:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### More on the index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename the index values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sr.index = ['A','B','C','D','E']\n",
"print(sr)"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or pass the index values during Pandas series creation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sr1 = pd.Series(my_list, index=['A','B','C','D','E'])\n",
"print(sr1)"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Numpy Array has an implicitly defined integer index used to access the values while the Pandas Series has an explicitly defined index associated with the values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get value at position `n` in series"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `iloc` (integer location) to get value at position `n`"
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Value at given index using dictionary-like syntax"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also create a Pandas Series from a dictionary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sr2 = pd.Series(dict(A=5, B=8, C=13, D=0.1, E=-5))\n",
"sr2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also, create a Pandas Series from a scalar data. But, if you pass a single value with multiple indexes, the value will be same for all the indexes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sr3 = pd.Series(10.5, index=['A','B','C','D','E'])\n",
"print(sr3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
"#### 2D data structures\n",
"\n",
"Pandas: <font color='red'>DataFrame</font> is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.\n",
"\n",
"A <font color='red'>DataFrame</font> is like a sequence of aligned <font color='red'>Series</font> objects, i.e. they share the same index.\n",
"\n",
"\n",
"\n",
"\n",
"A DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(data=[[5, True, 'x', 2.7],\n",
" [8, True, 'y', 3.1],\n",
" [13,False,'z', np.NaN],\n",
" [1, False, 'a', 0.1],\n",
" [-5, True, 'b', -2]],\n",
" index=['A','B','C','D','E'],\n",
" columns=['num', 'bool', 'str', 'real'])\n",
"print(type(df))\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the first few rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the last few rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the type of each column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.dtypes"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain basic data information (column count, number of values in each column, data type of each column, etc.):"
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain descriptive statistics of each numeric column:"
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can pass the argument `include='object'` to return the descriptive statistics of categorical (object) columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.describe(include='object')"
]
},
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get list of column names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the index values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get specific column(s):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['num']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.num"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[['num','real']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.loc['C']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.loc[['B', 'D']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.loc['A':'E':2]"
]
},
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get specific row(s) and column(s) by name(s):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.loc['A':'D':2, ['num', 'real']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.loc['A':'C', 'num':'real']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.iloc[2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.iloc[1:4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
"df.iloc[[2,4], [1,3]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display one random row:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.sample()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select columns based on datatype:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.select_dtypes(include='object')"
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply masking:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[df.real > 1.0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[df.real == 3.1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem with `NaN`:\n",
"- In Python (and NumPy), the `nan`'s don’t compare to equal. \n",
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
"- Pandas/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`.\n",
"- A scalar equality comparison versus a `None/np.nan` doesn’t provide useful information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.real"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.real == np.NaN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the `isnull` method to find out which dataframe entries are '`NaN`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.isnull()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.isnull().values.any()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a DataFrame from a 2D Numpy array\n",
"\n",
"Given a two-dimensional array of data, we can create a dataframe with any specified column and index names. If left out, an integer index will be used for each."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_nparray = np.random.rand(3, 2)\n",
"print(\"Numpy array: \", my_nparray)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a Pandas dataframe using a Numpy array"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pddf = pd.DataFrame(my_nparray,\n",
" columns=['foo', 'bar'],\n",
" index=['a', 'b', 'c'])\n",
"pddf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a NumPy array from a Pandas DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_nparray2 = pddf.to_numpy()\n",
"print(f\"Numpy array: {my_nparray2}\")"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a Pandas dataframe using Pandas series"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdsr1 = pd.Series(np.random.rand(3))\n",
"print(\"First_Series: \\n\", pdsr1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Second_Series: \\n\", pdsr2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.DataFrame(dict(First_Series = pdsr1, Second_Series = pdsr2))\n",
"print(df1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## <font color='green'>Exercise</font>\n",
"In the above Pandas dataframe, relabel the index as `['Row0', 'Row1', 'Row2']`."
]