Rows and Columns

At times it becomes necessary to manipulate the data in columns and rows to clean up data or make accessing the data more efficient.

Apply

The .apply() function applies a function to every row or column in a data frame. There are two main applications for using .apply():

inplace changes
new column creation

An inplace change generally involves cleaning up the data or converting the data type. For example, there is an extra comma found in the full_name column in the table below. It is possible to remove that comma using an inplace change.

	extract.age	extract.patient_full_name	extract.patient_id
0	43	Gail Hill,	94451
1	43	Gail Hill,	94451
2	36	Rhoda Sweet,	92830
3	13	Jescie Lyons,	92721
4	70	Lyle Cash,	92570

To remove the comma, we can create a custom function and use .apply():

dataframe['extract.patient_full_name'] = dataframe['extract.patient_full_name'].apply(lambda value: value.replace(',', ''))

The code above is applying a lambda function, which removes commas from the extract.patient_full_name column and then overwrites the old extract.patient_full_name column.

	extract.age	extract.patient_full_name	extract.patient_id
0	43	Gail Hill	94451
1	43	Gail Hill	94451
2	36	Rhoda Sweet	92830
3	13	Jescie Lyons	92721
4	70	Lyle Cash	92570

Another use for .apply is to convert columns to numeric, date, or object (string) types by including the line import pandas as pd at the beginning of the report. By converting strings to numeric objects, the values can be used in mathematical functions.

import pandas as pd
...
dataframe = dataframe.apply(pd.to_numeric, errors="ignore")

Splitting Columns

Splitting a column can be useful if a column contains multiple pieces of information or if a complex operation needs an intermediate calculation for simplicity or efficiency.

In the above table, it may be desirable to split the full name into two columns for the first and last names. To do so, use the .apply() function once again:

dataframe['extract.patient_first_name'] = dataframe['extract.patient_full_name'].apply(lambda value: value.split()[0])
dataframe['extract.patient_last_name'] = dataframe['extract.patient_full_name'].apply(lambda value: value.split()[1])

	extract.age	extract.patient_full_name	extract.patient_id	extract.patient_first_name	extract.patient_last_name
0	43	Gail Hill	94451	Gail	Hill
1	43	Gail Hill	94451	Gail	Hill
2	36	Rhoda Sweet	92830	Rhoda	Sweet
3	13	Jescie Lyons	92721	Jescie	Lyons
4	70	Lyle Cash	92570	Lyle	Cash

Combining Columns

It is also possible to combine columns. In the example below, the first name and patient name concatenate to become a single column:

dataframe['nameId'] = dataframe.apply(lambda row: row['extract.patient_first_name'] + str(row['extract.patient_id']), axis=1)

This time the .apply() function is working across the entire row so that multiple columns can be used.

	extract.age	extract.patient_full_name	extract.patient_id	extract.patient_first_name	extract.patient_last_name	nameId
0	43	Gail Hill	94451	Gail	Hill	Gail94451
1	43	Gail Hill	94451	Gail	Hill	Gail94451
2	36	Rhoda Sweet	92830	Rhoda	Sweet	Rhoda92830
3	13	Jescie Lyons	92721	Jescie	Lyons	Jescie92721
4	70	Lyle Cash	92570	Lyle	Cash	Lyle92570

Useful Column Functions

The list below contains some of the more useful pre-built functions available when using a data frame. Either assign a variable to these functions, dataframe_sum = dataframe['extract.age'].sum() or print the function directly, print(dataframe['extract.age'].sum()).

mean
dataframe['extract.age'].mean() returns 41.0
sum
dataframe['extract.age'].sum() returns 205
unique
dataframe['extract.age'].unique() returns [43 36 13 70]
tolist
dataframe['extract.age'].tolist() returns [43, 43, 36, 13, 70]

describe
dataframe['extract.age'].describe()returns the following information:

count     5.000000
mean     41.000000
std      20.359273
min      13.000000
25%      36.000000
50%      43.000000
75%      43.000000
max      70.000000
Name: extract.age, dtype: float64

dataframe.describe() returns the following table:

	extract.age	extract.patient_id
count	5.000000	5.000000
mean	41.000000	93404.600000
std	20.359273	959.679269
min	13.000000	92570.000000
25%	36.000000	92721.000000
50%	43.000000	93830.000000
75%	43.000000	94451.000000
max	70.000000	94451.000000

Stack and Unstack

The .stack() function creates a multi-index based on the columns. To print the function, use print(dataframe.stack()), where dataframe is the variable assigned to the query.

0	extract.age	43
	extract.patient_full_name	Gail Hill
	extract.patient_id	94451
	extract.patient_first_name	Gail
	extract.patient_last_name	Hill
	nameId	Gail94451
1	extract.age	43
	extract.patient_full_name	Gail Hill
	extract.patient_id	94451
	extract.patient_first_name	Gail
	extract.patient_last_name	Hill
	nameId	Gail94451
2	extract.age	36
	extract.patient_full_name	Rhoda Sweet
	extract.patient_id	92830
	extract.patient_first_name	Rhoda
	extract.patient_last_name	Sweet
	nameId	Rhoda92830
3	extract.age	13
	extract.patient_full_name	Jescie Lyons
	extract.patient_id	92721
	extract.patient_first_name	Jescie
	extract.patient_last_name	Lyons
	nameId	Jescie92721
4	extract.age	70
	extract.patient_full_name	Lyle Cash
	extract.patient_id	92570
	extract.patient_first_name	Lyle
	extract.patient_last_name	Cash
	nameId	Lyle92570
dtype: object

dataframe.loc[0, 'extract.age'] returns 43

The .unstack() function produces the opposite effect of the .stack() function.

Transposing Columns to Rows

The transpose() function switches the rows and columns.

	extract.age	extract.patient_full_name	extract.patient_id	extract.patient_first_name	extract.patient_last_name	nameId
0	43	Gail Hill	94451	Gail	Hill	Gail94451
1	43	Gail Hill	94451	Gail	Hill	Gail94451
2	36	Rhoda Sweet	92830	Rhoda	Sweet	Rhoda92830
3	13	Jescie Lyons	92721	Jescie	Lyons	Jescie92721
4	70	Lyle Cash	92570	Lyle	Cash	Lyle92570

Using the code below transposes the table:

dataframe = dataframe.transpose()

	0	1	2	3	4
extract.age	43	43	36	13	70
extract.patient_full_name	Gail Hill	Gail Hill	Rhoda Sweet	Jescie Lyons	Lyle Cash
extract.patient_id	94451	94451	92830	92721	92570
extract.patient_first_name	Gail	Gail	Rhoda	Jescie	Lyle
extract.patient_last_name	Hill	Hill	Sweet	Lyons	Cash
nameId	Gail94451	Gail94451	Rhoda92830	Jescie92721	Lyle92570