- 11 Mar 2024
- 6 Minutes to read
- Print
- DarkLight
Basics
- Updated on 11 Mar 2024
- 6 Minutes to read
- Print
- DarkLight
Pandas is a part of the SciPy (Scientific Python) technology stack. It allows easy and efficient data processing and visualization. Below are some examples of some uses for Pandas:
- tabular data, consisting of rows and columns similar to SQL tables or spreadsheets
- statistical and other algorithmic analysis
- plotting and visualization
- time series and time-date calculations
- importing and exporting data
There are three primary data structures that Pandas uses:
- series (1 dimensional)
- data frame (2 dimensional)
- panel (3 dimensional)
A series is very much like an array in which there is an index associated with a value. An index of a series can be any data type and not just the position of the array.
Data frames are the primary object Pandas uses, and they are essentially a series of series and are similar to a spreadsheet.
Panels are generally not needed for compound queries but they might be useful in very specific applications.
Getting a Data Frame
The primary method of getting a data frame through IMAT is using the get_hits()
function.
Query:
dataframe = get_hits('()UNIVERSE', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex')
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | |
0 | 99541 | 2 | male |
1 | 99541 | 2 | male |
2 | 99541 | 3 | male |
3 | 99541 | 2 | male |
4 | 99541 | 0 | male |
This data frame pulled out five records from the "Records" store and displayed the requested information about each record. The first column is the index which defaults to the position of each row. The other three columns are the fields requested in the query: patient_id, age and gender.
There are other methods to create a data frame which can be found on the Pandas documentation website.
One example is to create a data frame using a dictionary:
Query:
import pandas as pd
dataframe = pd.DataFrame({'col1': [0,1,2,3], 'col2': ['a', 'b', 'c', 'd']})
print(dataframe)
Response:
col1 | col2 | |
0 | 0 | a |
1 | 1 | b |
2 | 2 | c |
3 | 3 | d |
Accessing Columns
Some data frames can be quite large and the report writer may wish to view a specific column to check for all values in the column or to use for statistical analysis.
Refer to the first query and response in Getting a Data Frame; these will be used in the following examples.
Using the previous dataframe query, enter the following code:
Query:
age=dataframe['extract.age']
print(age)
Response:
0 | 2 |
1 | 2 |
2 | 3 |
3 | 2 |
4 | 0 |
Name: extract.age, dtype:int64 |
This is returning a series object. To get a Python list object from any series, use series.tolist()
:
Query:
age = dataframe['extract.age'].tolist()
print(age)
Response:
[2, 2, 3, 2, 0]
It is possible to access multiple columns to get a smaller data frame:
Query:
dataframe = dataframe[['extract.age', 'extract.patient_sex']]
print(dataframe)
Response:
extract.age | extract.patient_sex | |
0 | 2 | male |
1 | 2 | male |
2 | 3 | male |
3 | 2 | male |
4 | 0 | male |
Iterating Over a Data Frame
Sometimes it is necessary to do a complex operation on each row of the data frame. To do this, use .iterrows()
, which returns a tuple of the index and the row. Once again, refer to the dataframe query and response at the beginning of Getting a Dataframe.
Query:
for index, row in dataframe.iterrows():
print("INDEX = %s\nROW = %s\n" % (index, row))
Response:
INDEX = 0
ROW = extract.patient_id 99541
extract.age 2
extract.patient_sex male
Name: 0, type: object
INDEX = 1
ROW = extract.patient_id 99541
extract.age 2
extract.patient_sex male
Name: 1, type: object
INDEX = 2
ROW = extract.patient_id 99541
extract.age 3
extract.patient_sex male
Name: 2, type: object
INDEX = 3
ROW = extract.patient_id 99541
extract.age 2
extract.patient_sex male
Name: 3, type: object
INDEX = 4
ROW = extract.patient_id 99541
extract.age 0
extract.patient_sex male
Name: 4, type: object
Selecting Rows
There are many methods of selecting rows from a data frame. The five most common ways are below. Print the function directly, print(dataframe.iloc[2])
.
- Select by index label:
dataframe.loc[dataframe['extract.patient_id'].isin([1668])] #will return all rows that have a patient ID of 1668
- Select by index location:
dataframe.iloc[2] #will return the third row (Python starts with 0, not 1)
- Select by index label or index location:
dataframe.ix[] #primarily label-based but will fall back to location based
- Select by slices:
dataframe[0:3] #will return the first three rows(0, 1, 2)
- Select by []:
dataframe[dataframe['extract.patient_id'].isin([1668])] #will work like dataframe.loc["label"] dataframe[1] #will work like dataframe.iloc[1]
When using .ix[]
or []
there can be some confusion between labels and locations. For example, there might be a label called 2 and a location at 2. In these cases, it is better to be specific using .loc[]
or .iloc[]
.
Two useful functions for series (or data frame columns) are .idxmax()
and .idxmin()
. They return the index location of the max or min value in the series:
print(dataframe['extract.age'].idxmax())
This displays 2 because it is the index of the row where the extract age is 3, which is the largest age value.
Sorting
After the original data frame has been established, it is possible to sort the table by any field desired. The following query and response are used for the examples in this section as well as in the Filtering and Removing Duplicates sections:
Query:
dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex, extract.patient_full_name')
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name | |
0 | 94451 | 43 | female | Gail Hill |
1 | 94451 | 43 | female | Gail Hill |
2 | 92830 | 36 | female | Rhoda Sweet |
3 | 92721 | 13 | female | Jescie Lyons |
4 | 92570 | 70 | male | Lyle Cash |
Given the data frame above, it may be desirable to sort by the patient_full_name column. Add this code after the query above has been written:
dataframe = dataframe.sort_values(by=['extract.patient_full_name'])
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name | |
0 | 94451 | 43 | female | Gail Hill |
1 | 94451 | 43 | female | Gail Hill |
3 | 92721 | 13 | female | Jescie Lyons |
4 | 92570 | 70 | male | Lyle Cash |
2 | 92830 | 36 | female | Rhoda Sweet |
It is also possible to sort in descending order:
dataframe = dataframe.sort_values(by=['extract.patient_full_name'], ascending=False)
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name | |
2 | 92830 | 36 | female | Rhoda Sweet |
4 | 92570 | 70 | male | Lyle Cash |
3 | 92721 | 13 | female | Jescie Lyons |
0 | 94451 | 43 | female | Gail Hill |
1 | 94451 | 43 | female | Gail Hill |
Filtering
Filtering is both very common and very easy to do with Pandas. Filtering can be done by any field in the data frame. Remember to use proper Python comparators and quotes around any string. Use the same base query and responses from the Sorting section for the following examples.
To filter the data frame to print only those that are over 40 years old, use the following code:
Query:
dataframe = dataframe[dataframe['extract.age'] > 40]
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name | |
0 | 94451 | 43 | female | Gail Hill |
1 | 94451 | 43 | female | Gail Hill |
4 | 92570 | 70 | male | Lyle Cash |
To understand how this works, examine the inner bracket expression, dataframe['extract.age'] > 40
. This is essentially the same as writing the code like this:
above40=dataframe['extract.age'] > 40
print(above40)
Response:
0 | True |
1 | True |
2 | False |
3 | False |
4 | True |
Name: extract.age, type: bool |
This is returning a series which shows the index of the row correlating to that index matched the Boolean expression > 40
.
The expression could be rewritten as dataframe = dataframe[above40]
.
This displays each index and row where the index of above40 is True.
Removing Duplicates
Pandas has a .drop_duplicates()
function which can easily remove duplicate information.
Use the .drop_duplicates()
function without any parameters to remove only complete duplicates:
dataframe = dataframe.drop_duplicates()
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name | |
0 | 94451 | 43 | female | Gail Hill |
2 | 92830 | 36 | female | Rhoda Sweet |
3 | 92721 | 13 | female | Jescie Lyons |
4 | 92570 | 70 | male | Lyle Cash |
Notice the second occurrence of Gail Hill has been removed.
It is also possible to select duplicates based on columns. In this example, all rows with a gender of female is considered a duplicate and can be removed. To do so, the subset
parameter must be used:
dataframe = dataframe.drop_duplicates(subset=['extract.patient_sex'])
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name | |
0 | 94451 | 43 | female | Gail Hill |
4 | 92570 | 70 | male | Lyle Cash |
NaN Values
NaN (Not a Number) Values are used to specify that a value is missing, frequently NaNs need to be removed or substituted for certain algorithms.
Query:
dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis')
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_full_name | extract.diagnosis | |
0 | 94451 | 43 | Gail Hill | NaN |
1 | 94451 | 43 | Gail Hill | NaN |
2 | 92830 | 36 | Rhoda Sweet | NaN |
3 | 92721 | 13 | Jescie Lyons | NaN |
4 | 92570 | 70 | Lyle Cash | NaN |
Use the .dropna()
function to remove rows that contain a single NaN value or if every value is NaN:
dataframe = dataframe.dropna(how='any')
print(dataframe)
Response:
Columns: [extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis]
Index: []
This removed all rows because they all contained NaN.
The .dropna()
function can also be used with columns, but parameters must be used:
dataframe = dataframe.dropna(how='any', axis=1)
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_full_name | |
0 | 94451 | 43 | Gail Hill |
1 | 94451 | 43 | Gail Hill |
2 | 92830 | 36 | Rhoda Sweet |
3 | 92721 | 13 | Jescie Lyons |
4 | 92570 | 70 | Lyle Cash |
Sometimes it is necessary to replace the NaN values with another value such as a negative number or 0. Below Not Found replaces all NaN values.
dataframe = dataframe.fillna(value="Not Found")
print(dataframe)
Response:
extract.patient_id | extract.age | extract.patient_full_name | extract.diagnosis | |
0 | 94451 | 43 | Gail Hill | Not Found |
1 | 94451 | 43 | Gail Hill | Not Found |
2 | 92830 | 36 | Rhoda Sweet | Not Found |
3 | 92721 | 13 | Jescie Lyons | Not Found |
4 | 92570 | 70 | Lyle Cash | Not Found |
There are also methods such as forward fill which will use the last found value.
Exporting
Pandas makes exporting the data simple. The export functions all start with to_
. The most common exports used in IMAT are to_csv
, to_html
, and to_json
.
Indexing
Sometimes a user needs to make one of the columns in a data frame into the index, for example dates and IDs are common index columns.
The set_index()
function will set the index:
dataframe = dataframe.set_index('extract.patient_id')
extract.patient_id | extract.age | extract.patient_full_name | extract.weight_standard |
94451 | 43 | Gail Hill | 170.0 |
94451 | 43 | Gail Hill | 170.0 |
92830 | 36 | Rhoda Sweet | 220.5 |
92721 | 13 | Jescie Lyons | 138.0 |
92570 | 70 | Lyle Cash | 257.0 |
The reset_index()
does the opposite of set_index()
and moves the index column into a new column and use a sequential index.
dataframe = dataframe.reset_index()
extract.age | extract.patient_full_name | extract.patient_id | extract.weight_standard | |
0 | 43 | Gail Hill | 94451 | 170.0 |
1 | 43 | Gail Hill | 94451 | 170.0 |
2 | 36 | Rhoda Sweet | 92830 | 220.5 |
3 | 13 | Jescie Lyons | 92721 | 138.0 |
4 | 70 | Lyle Cash | 92570 | 257.0 |