Basics

Pandas is a part of the SciPy (Scientific Python) technology stack. It allows easy and efficient data processing and visualization. Below are some examples of some uses for Pandas:

tabular data, consisting of rows and columns similar to SQL tables or spreadsheets
statistical and other algorithmic analysis
plotting and visualization
time series and time-date calculations
importing and exporting data

There are three primary data structures that Pandas uses:

series (1 dimensional)
data frame (2 dimensional)
panel (3 dimensional)

A series is very much like an array in which there is an index associated with a value. An index of a series can be any data type and not just the position of the array.

Data frames are the primary object Pandas uses, and they are essentially a series of series and are similar to a spreadsheet.

Panels are generally not needed for compound queries but they might be useful in very specific applications.

Getting a Data Frame

The primary method of getting a data frame through IMAT is using the get_hits() function.

Query:

dataframe = get_hits('()UNIVERSE', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex')
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex
0	99541	2	male
1	99541	2	male
2	99541	3	male
3	99541	2	male
4	99541	0	male

This data frame pulled out five records from the "Records" store and displayed the requested information about each record. The first column is the index which defaults to the position of each row. The other three columns are the fields requested in the query: patient_id, age and gender.

There are other methods to create a data frame which can be found on the Pandas documentation website.

One example is to create a data frame using a dictionary:

Query:

import pandas as pd
dataframe = pd.DataFrame({'col1': [0,1,2,3], 'col2': ['a', 'b', 'c', 'd']})
print(dataframe)

Response:

	col1	col2
0	0	a
1	1	b
2	2	c
3	3	d

Accessing Columns

Some data frames can be quite large and the report writer may wish to view a specific column to check for all values in the column or to use for statistical analysis.

Refer to the first query and response in Getting a Data Frame; these will be used in the following examples.

Using the previous dataframe query, enter the following code:

Query:

age=dataframe['extract.age']
print(age)

Response:

0	2
1	2
2	3
3	2
4	0
Name: extract.age, dtype:int64

This is returning a series object. To get a Python list object from any series, use series.tolist():

Query:

age = dataframe['extract.age'].tolist()
print(age)

Response:

 [2, 2, 3, 2, 0]

It is possible to access multiple columns to get a smaller data frame:

Query:

dataframe = dataframe[['extract.age', 'extract.patient_sex']]
print(dataframe)

Response:

	extract.age	extract.patient_sex
0	2	male
1	2	male
2	3	male
3	2	male
4	0	male

Iterating Over a Data Frame

Sometimes it is necessary to do a complex operation on each row of the data frame. To do this, use .iterrows(), which returns a tuple of the index and the row. Once again, refer to the dataframe query and response at the beginning of Getting a Dataframe.

Query:

for index, row in dataframe.iterrows():
    print("INDEX = %s\nROW = %s\n" % (index, row))

Response:

INDEX = 0
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 0, type: object

INDEX = 1
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 1, type: object

INDEX = 2
ROW = extract.patient_id  99541
extract.age      3
extract.patient_sex   male
Name: 2, type: object

INDEX = 3
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 3, type: object

INDEX = 4
ROW = extract.patient_id  99541
extract.age      0
extract.patient_sex   male
Name: 4, type: object

Selecting Rows

There are many methods of selecting rows from a data frame. The five most common ways are below. Print the function directly, print(dataframe.iloc[2]).

Select by index label:

dataframe.loc[dataframe['extract.patient_id'].isin([1668])] #will return all rows that have a patient ID of 1668

Select by index location:

dataframe.iloc[2] #will return the third row (Python starts with 0, not 1)

Select by index label or index location:

dataframe.ix[] #primarily label-based but will fall back to location based

Select by slices:

dataframe[0:3] #will return the first three rows(0, 1, 2)

Select by []:

dataframe[dataframe['extract.patient_id'].isin([1668])] #will work like dataframe.loc["label"]
dataframe[1] #will work like dataframe.iloc[1]

When using .ix[] or [] there can be some confusion between labels and locations. For example, there might be a label called 2 and a location at 2. In these cases, it is better to be specific using .loc[] or .iloc[].

Two useful functions for series (or data frame columns) are .idxmax() and .idxmin(). They return the index location of the max or min value in the series:

print(dataframe['extract.age'].idxmax())

This displays 2 because it is the index of the row where the extract age is 3, which is the largest age value.

Sorting

After the original data frame has been established, it is possible to sort the table by any field desired. The following query and response are used for the examples in this section as well as in the Filtering and Removing Duplicates sections:

Query:

dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex, extract.patient_full_name')
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex	extract.patient_full_name
0	94451	43	female	Gail Hill
1	94451	43	female	Gail Hill
2	92830	36	female	Rhoda Sweet
3	92721	13	female	Jescie Lyons
4	92570	70	male	Lyle Cash

Given the data frame above, it may be desirable to sort by the patient_full_name column. Add this code after the query above has been written:

dataframe = dataframe.sort_values(by=['extract.patient_full_name'])
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex	extract.patient_full_name
0	94451	43	female	Gail Hill
1	94451	43	female	Gail Hill
3	92721	13	female	Jescie Lyons
4	92570	70	male	Lyle Cash
2	92830	36	female	Rhoda Sweet

It is also possible to sort in descending order:

dataframe = dataframe.sort_values(by=['extract.patient_full_name'], ascending=False)
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex	extract.patient_full_name
2	92830	36	female	Rhoda Sweet
4	92570	70	male	Lyle Cash
3	92721	13	female	Jescie Lyons
0	94451	43	female	Gail Hill
1	94451	43	female	Gail Hill

Filtering

Filtering is both very common and very easy to do with Pandas. Filtering can be done by any field in the data frame. Remember to use proper Python comparators and quotes around any string. Use the same base query and responses from the Sorting section for the following examples.

To filter the data frame to print only those that are over 40 years old, use the following code:

Query:

dataframe = dataframe[dataframe['extract.age'] > 40]
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex	extract.patient_full_name
0	94451	43	female	Gail Hill
1	94451	43	female	Gail Hill
4	92570	70	male	Lyle Cash

To understand how this works, examine the inner bracket expression, dataframe['extract.age'] > 40. This is essentially the same as writing the code like this:

above40=dataframe['extract.age'] > 40
print(above40)

Response:

0	True
1	True
2	False
3	False
4	True
Name: extract.age, type: bool

This is returning a series which shows the index of the row correlating to that index matched the Boolean expression > 40.

The expression could be rewritten as dataframe = dataframe[above40].

This displays each index and row where the index of above40 is True.

Removing Duplicates

Pandas has a .drop_duplicates() function which can easily remove duplicate information.

Use the .drop_duplicates() function without any parameters to remove only complete duplicates:

dataframe = dataframe.drop_duplicates()
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex	extract.patient_full_name
0	94451	43	female	Gail Hill
2	92830	36	female	Rhoda Sweet
3	92721	13	female	Jescie Lyons
4	92570	70	male	Lyle Cash

Notice the second occurrence of Gail Hill has been removed.

It is also possible to select duplicates based on columns. In this example, all rows with a gender of female is considered a duplicate and can be removed. To do so, the subset parameter must be used:

dataframe = dataframe.drop_duplicates(subset=['extract.patient_sex'])
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_sex	extract.patient_full_name
0	94451	43	female	Gail Hill
4	92570	70	male	Lyle Cash

NaN Values

NaN (Not a Number) Values are used to specify that a value is missing, frequently NaNs need to be removed or substituted for certain algorithms.

Query:

dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis')
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_full_name	extract.diagnosis
0	94451	43	Gail Hill	NaN
1	94451	43	Gail Hill	NaN
2	92830	36	Rhoda Sweet	NaN
3	92721	13	Jescie Lyons	NaN
4	92570	70	Lyle Cash	NaN

Use the .dropna() function to remove rows that contain a single NaN value or if every value is NaN:

dataframe = dataframe.dropna(how='any')
print(dataframe)

Response:

Columns: [extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis]
Index: []

This removed all rows because they all contained NaN.

The .dropna() function can also be used with columns, but parameters must be used:

dataframe = dataframe.dropna(how='any', axis=1)
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_full_name
0	94451	43	Gail Hill
1	94451	43	Gail Hill
2	92830	36	Rhoda Sweet
3	92721	13	Jescie Lyons
4	92570	70	Lyle Cash

Sometimes it is necessary to replace the NaN values with another value such as a negative number or 0. Below Not Found replaces all NaN values.

dataframe = dataframe.fillna(value="Not Found")
print(dataframe)

Response:

	extract.patient_id	extract.age	extract.patient_full_name	extract.diagnosis
0	94451	43	Gail Hill	Not Found
1	94451	43	Gail Hill	Not Found
2	92830	36	Rhoda Sweet	Not Found
3	92721	13	Jescie Lyons	Not Found
4	92570	70	Lyle Cash	Not Found

There are also methods such as forward fill which will use the last found value.

Exporting

Pandas makes exporting the data simple. The export functions all start with to_. The most common exports used in IMAT are to_csv, to_html, and to_json.

Indexing

Sometimes a user needs to make one of the columns in a data frame into the index, for example dates and IDs are common index columns.

The set_index() function will set the index:

dataframe = dataframe.set_index('extract.patient_id')

extract.patient_id	extract.age	extract.patient_full_name	extract.weight_standard
94451	43	Gail Hill	170.0
94451	43	Gail Hill	170.0
92830	36	Rhoda Sweet	220.5
92721	13	Jescie Lyons	138.0
92570	70	Lyle Cash	257.0

The reset_index() does the opposite of set_index() and moves the index column into a new column and use a sequential index.

dataframe = dataframe.reset_index()

extract.age	extract.patient_full_name	extract.patient_id	extract.weight_standard
0	43	Gail Hill	94451	170.0
1	43	Gail Hill	94451	170.0
2	36	Rhoda Sweet	92830	220.5
3	13	Jescie Lyons	92721	138.0
4	70	Lyle Cash	92570	257.0