Basics
  • 11 Mar 2024
  • 6 Minutes to read
  • Dark
    Light

Basics

  • Dark
    Light

Article summary

Pandas is a part of the SciPy (Scientific Python) technology stack. It allows easy and efficient data processing and visualization. Below are some examples of some uses for Pandas:

  • tabular data, consisting of rows and columns similar to SQL tables or spreadsheets
  • statistical and other algorithmic analysis
  • plotting and visualization
  • time series and time-date calculations
  • importing and exporting data

There are three primary data structures that Pandas uses:

  • series (1 dimensional)
  • data frame (2 dimensional)
  • panel (3 dimensional)

A series is very much like an array in which there is an index associated with a value. An index of a series can be any data type and not just the position of the array.

Data frames are the primary object Pandas uses, and they are essentially a series of series and are similar to a spreadsheet.

Panels are generally not needed for compound queries but they might be useful in very specific applications.


Getting a Data Frame

The primary method of getting a data frame through IMAT is using the get_hits() function.

Query:

dataframe = get_hits('()UNIVERSE', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex')
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sex
0995412male
1995412male
2995413male
3995412male
4995410male

This data frame pulled out five records from the "Records" store and displayed the requested information about each record. The first column is the index which defaults to the position of each row. The other three columns are the fields requested in the query: patient_id, age and gender.


There are other methods to create a data frame which can be found on the Pandas documentation website.

One example is to create a data frame using a dictionary:

Query:

import pandas as pd
dataframe = pd.DataFrame({'col1': [0,1,2,3], 'col2': ['a', 'b', 'c', 'd']})
print(dataframe)

Response:


col1col2
00a
11b
22c
33d

Accessing Columns

Some data frames can be quite large and the report writer may wish to view a specific column to check for all values in the column or to use for statistical analysis.

Refer to the first query and response in Getting a Data Frame; these will be used in the following examples.

Using the previous dataframe query, enter the following code:

Query:

age=dataframe['extract.age']
print(age)

Response:

02
12
23
32
40
Name: extract.age, dtype:int64

This is returning a series object. To get a Python list object from any series, use series.tolist():

Query:

age = dataframe['extract.age'].tolist()
print(age)

Response:

 [2, 2, 3, 2, 0]

It is possible to access multiple columns to get a smaller data frame:

Query:

dataframe = dataframe[['extract.age', 'extract.patient_sex']]
print(dataframe)

Response:


extract.ageextract.patient_sex
02male
12male
23male
32male
40male

Iterating Over a Data Frame

Sometimes it is necessary to do a complex operation on each row of the data frame. To do this, use .iterrows(), which returns a tuple of the index and the row. Once again, refer to the dataframe query and response at the beginning of Getting a Dataframe.

Query:

for index, row in dataframe.iterrows():
    print("INDEX = %s\nROW = %s\n" % (index, row))

Response:

INDEX = 0
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 0, type: object

INDEX = 1
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 1, type: object

INDEX = 2
ROW = extract.patient_id  99541
extract.age      3
extract.patient_sex   male
Name: 2, type: object

INDEX = 3
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 3, type: object

INDEX = 4
ROW = extract.patient_id  99541
extract.age      0
extract.patient_sex   male
Name: 4, type: object

Selecting Rows

There are many methods of selecting rows from a data frame. The five most common ways are below. Print the function directly, print(dataframe.iloc[2]).

  • Select by index label:
    dataframe.loc[dataframe['extract.patient_id'].isin([1668])] #will return all rows that have a patient ID of 1668
  • Select by index location:
    dataframe.iloc[2] #will return the third row (Python starts with 0, not 1)
  • Select by index label or index location:
    dataframe.ix[] #primarily label-based but will fall back to location based
  • Select by slices:
    dataframe[0:3] #will return the first three rows(0, 1, 2)
  • Select by []:
    dataframe[dataframe['extract.patient_id'].isin([1668])] #will work like dataframe.loc["label"]
    dataframe[1] #will work like dataframe.iloc[1]

When using .ix[] or [] there can be some confusion between labels and locations. For example, there might be a label called 2 and a location at 2. In these cases, it is better to be specific using .loc[] or .iloc[].

Two useful functions for series (or data frame columns) are .idxmax() and .idxmin(). They return the index location of the max or min value in the series:

print(dataframe['extract.age'].idxmax())

This displays 2 because it is the index of the row where the extract age is 3, which is the largest age value.


Sorting

After the original data frame has been established, it is possible to sort the table by any field desired. The following query and response are used for the examples in this section as well as in the Filtering and Removing Duplicates sections:

Query:

dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex, extract.patient_full_name')
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sexextract.patient_full_name
09445143femaleGail Hill
19445143femaleGail Hill
29283036femaleRhoda Sweet
39272113femaleJescie Lyons
49257070maleLyle Cash

Given the data frame above, it may be desirable to sort by the patient_full_name column. Add this code after the query above has been written:

dataframe = dataframe.sort_values(by=['extract.patient_full_name'])
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sexextract.patient_full_name
09445143femaleGail Hill
19445143femaleGail Hill
39272113femaleJescie Lyons
49257070maleLyle Cash
29283036femaleRhoda Sweet

It is also possible to sort in descending order:

dataframe = dataframe.sort_values(by=['extract.patient_full_name'], ascending=False)
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sexextract.patient_full_name
29283036femaleRhoda Sweet
49257070maleLyle Cash
39272113femaleJescie Lyons
09445143femaleGail Hill
19445143femaleGail Hill

Filtering

Filtering is both very common and very easy to do with Pandas. Filtering can be done by any field in the data frame. Remember to use proper Python comparators and quotes around any string. Use the same base query and responses from the Sorting section for the following examples.

To filter the data frame to print only those that are over 40 years old, use the following code:

Query:

dataframe = dataframe[dataframe['extract.age'] > 40]
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sexextract.patient_full_name
09445143femaleGail Hill
19445143femaleGail Hill
49257070maleLyle Cash

To understand how this works, examine the inner bracket expression, dataframe['extract.age'] > 40. This is essentially the same as writing the code like this:

above40=dataframe['extract.age'] > 40
print(above40)

Response:

0True
1True
2False
3False
4True
Name: extract.age, type: bool

This is returning a series which shows the index of the row correlating to that index matched the Boolean expression  > 40.

The expression could be rewritten as dataframe = dataframe[above40].

This displays each index and row where the index of above40 is True.


Removing Duplicates

Pandas has a .drop_duplicates() function which can easily remove duplicate information.

Use the .drop_duplicates() function without any parameters to remove only complete duplicates:

dataframe = dataframe.drop_duplicates()
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sexextract.patient_full_name
09445143femaleGail Hill
29283036femaleRhoda Sweet
39272113femaleJescie Lyons
49257070maleLyle Cash

Notice the second occurrence of Gail Hill has been removed.


It is also possible to select duplicates based on columns. In this example, all rows with a gender of female is considered a duplicate and can be removed. To do so, the subset parameter must be used:

dataframe = dataframe.drop_duplicates(subset=['extract.patient_sex'])
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_sexextract.patient_full_name
09445143femaleGail Hill
49257070maleLyle Cash

NaN Values

NaN (Not a Number) Values are used to specify that a value is missing, frequently NaNs need to be removed or substituted for certain algorithms.

Query:

dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis')
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_full_nameextract.diagnosis
09445143Gail HillNaN
19445143Gail HillNaN
29283036Rhoda SweetNaN
39272113Jescie LyonsNaN
49257070Lyle CashNaN

Use the .dropna() function to remove rows that contain a single NaN value or if every value is NaN:

dataframe = dataframe.dropna(how='any')
print(dataframe)

Response:

Columns: [extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis]
Index: []

This removed all rows because they all contained NaN.


The .dropna() function can also be used with columns, but parameters must be used:

dataframe = dataframe.dropna(how='any', axis=1)
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_full_name
09445143Gail Hill
19445143Gail Hill
29283036Rhoda Sweet
39272113Jescie Lyons
49257070Lyle Cash

Sometimes it is necessary to replace the NaN values with another value such as a negative number or 0. Below Not Found replaces all NaN values.

dataframe = dataframe.fillna(value="Not Found")
print(dataframe)

Response:


extract.patient_idextract.ageextract.patient_full_nameextract.diagnosis
09445143Gail HillNot Found
19445143Gail HillNot Found
29283036Rhoda SweetNot Found
39272113Jescie LyonsNot Found
49257070Lyle CashNot Found

There are also methods such as forward fill which will use the last found value.


Exporting

Pandas makes exporting the data simple. The export functions all start with to_. The most common exports used in IMAT are to_csv, to_html, and to_json.


Indexing

Sometimes a user needs to make one of the columns in a data frame into the index, for example dates and IDs are common index columns.

The set_index() function will set the index:

dataframe = dataframe.set_index('extract.patient_id')
extract.patient_idextract.ageextract.patient_full_nameextract.weight_standard
9445143Gail Hill170.0
9445143Gail Hill170.0
9283036Rhoda Sweet220.5
9272113Jescie Lyons138.0
9257070Lyle Cash257.0

The reset_index() does the opposite of set_index() and moves the index column into a new column and use a sequential index.

dataframe = dataframe.reset_index()
extract.ageextract.patient_full_nameextract.patient_idextract.weight_standard
043Gail Hill94451170.0
143Gail Hill94451170.0
236Rhoda Sweet92830220.5
313Jescie Lyons92721138.0
470Lyle Cash92570257.0

Was this article helpful?