Content:
Pandas is a Python library used for data analysis and manipulation. It includes data structures and tools suitable for working with, cleaning and processing large datasets.
It’s free, open source, and incredibly flexible.
In this guide, we’ll take a look at the basis of Pandas, including the initial setup.
Installing Pandas
Pandas is not part of the Python core, so you’ll probably need to install it. There are a few ways you can do this.
In this guide, we’ll focus on more generic setups, installing packages using pip. It’s as simple as running
pip install pandas
This will ensure you’re running the latest available version of Pandas.
If you’re planning to further with your Python data analysis, you might want to take a look at using Anaconda instead. Anaconda includes several packages useful for data analysis, including the conda package manager.
More details of the Anaconda installation process can be found here.
It’s possible to use both pip and conda together, though this is beyond the scope of this tutorial, and is often not recommended.
Whichever method you choose, continue to the next step once you have Pandas installed.
Importing Pandas in your Project
In order to load data into our Python project, we first need to import the freshly-installed Pandas library. Add the following line to the top of your project.
import pandas as pd
While you’re free to alias Pandas using any name you’d like, we’d recommend sticking with pd
. It’s standard practice to do this, and you’ll often find other tutorials using it.
Loading Data Using Pandas
We’ll start by loading data from a CSV file. Download the Iris data set from here, and make a note of the path to the downloaded file.
This file can be loaded into our code using read_csv()
.
df = pd.read_csv('path/to/csv/file.csv')
Substitute the path for the path on your system to the downloaded file. There are other options that can be passed to read_csv()
, but for now, this is enough to load our data.
read_csv()
returns a data structure called a ‘data frame’. A data frame can be thought of as the pandas representation of a table. Again, you can use any variable name here, though it’s common to use df
somewhere in the name.
If you have data in a dictionary, it’s possible to use it to create a data frame.
data = {
'Name': ['John', 'Emma', 'Sam'],
'Age': [25, 28, 30],
'City': ['New York', 'London', 'Paris']
}
df = pd.DataFrame(data)
However you’ve chosen to load the data, to display the table, simply call the variable containing the data frame.
df
The output should look something like the one below.
Name Age City
0 John 25 New York
1 Emma 28 London
2 Sam 30 Paris
Querying the DataFrame
With data loaded into a data frame, it’s possible to run queries to extract data or provide data analysis.
For anyone that has used either spreadsheet software or database software, some of the queries will probably be similar to what you’ve seen before.
Accessing Values
To access values for a single column, the data frame can be treated like a list.
name = df['Name']
For multiple columns, column names must be provided inside a list.
fields = df[['Name', 'Age']]
Fetching data by row can be done using either loc
or iloc
.
row = df.loc[0]
There’s a small, but important difference between the two.
loc
searches based on row index – the numbers on the left-hand size of the previous table output. If not provided, indices are added automatically in an incremental fashion.
iloc
, on the other hand, uses the current order in the table.
On an unaltered table, the two will return the same value.
Where the two differ is where the table has been altered. For example, sorting the table by ‘Name’ will change the order of records, but the indices will not be altered.
Name Age City
1 Emma 28 London
0 John 25 New York
2 Sam 30 Paris
In this instance, we will see the following results.
## df.loc[0]
Name Age City
0 John 25 New York
## df.iloc[0]
Name Age City
1 Emma 28 London
It’s also possible to add conditions to filter the rows in the data frame.
filtered_rows = df[df['Age'] > 25]
Getting Basic Statistics
Pandas simplifies the process of calculating statistics on your data. You can compute various metrics like the mean, maximum, minimum, and more with ease. For example, to find the average age in your data frame, you can use the mean()
method.
average_age = df['Age'].mean()
Similarly, to determine the maximum salary data frame, you can use max()
.
max_salary = df['Salary'].max()
Functions are available to compute a range of other column statistics, such as the median, sum, maximum, value counts and unique value counts.
Column Manipulation
Pandas provides a versatile set of tools for column manipulation. This is useful for altering and restructuring your data frame if you’d like to perform further analysis.
Column Renaming
To rename a column, use the rename()
method. For instance, if you want to rename the “Age” column to “Years,” you can do so like this:
df = df.rename(columns={'Age': 'Years'})
Renaming columns can improve the clarity and readability of your data.
Column Deletion
Sometimes, you may want to remove unnecessary columns from your data frame. The drop()
method is used for this purpose.
df = df.drop(columns=['Salary'])
Dropping columns is useful when you need to focus on specific information or reduce data dimensionality.
Column Creation
You can create a new column by performing operations on existing columns. For example, to calculate the annual income of individuals in your data frame, you can do the following:
df['Annual Income'] = df['Salary'] * 12
This not only works with numeric values, but can also be used to copy and manipulate strings. In addition, you are able to use any data type specific functions (e.g. formatting DateTime fields) when creating your new value.
It’s also possible to manually specify values.
df['Salary'] = [50000, 60000, 70000]
Updating Values
A column in a data frame can be treated as a variable. To update the values in a field, functions or other data manipulation techniques can be assigned to the column. It is possible to apply lambda functions to your data, to perform complex mathematical calculations.
df['Age'] = df['Age'].apply(lambda x: x + 1)
The specified calculation will be applied to all values in this field.
Conclusion
This only touches the surface of what is possible with data frames, but is enough to get you started. Check out our more advanced tutorials for more in-depth queries.