Machine Learning 101- Opening

Since I focused on data science to get comfortable in computational part for biotechnology,  I will start learning machine learning using Python scikit-learn library. Today I will share some of my knowledge about machine learning and some new things that I just learn from course in Udacity: Intro to Machine Learning:

By Andrew Ng, Machine Learning is the science of getting computers to act without being explicitly programmed. As it is in the name, machine is learning by seeing and analysing the data which is already handed to it. This is a long process of making machine to learn or decide by itself.

There are dozens of cool examples of usage in Machine Learning. You probably well aware of face tag mechanism on Facebook, and our smart phones. They are using machine learning to recognize faces. Self driving cars, some robot projects, search engines like Google, and so on. Nowadays we surrounded by machine learning technologies.

With this post I officially start learning and sharing Machine Learning. Join me on this enjoyable journey. Share your knowledge, ask questions, make contributions. See you soon…

Advertisements

Python Machine Learning Books

Python is a very popular language for machine learning. The machine learning libraries and frameworks in Python (especially around the SciPy stack) are maturing quickly. They may not be as feature rich as R, but they are robust enough for small to medium scale production implementation.

If you are a Python programmer looking to get into machine learning or you are generally interested to get into machine learning via Python, then I want to use this post to point out some key books you might find useful on your journey.

This is by no means a complete list of books, but I think they are the pick of the books you should look at if you are interested in machine learning in Python

Machine Learning in Python

Amazon ImageBuilding Machine Learning Systems with Python (2013): Master the art of machine learning with Python and build effective machine learning systems with this intensive hands-on guide.

Learning scikit-learn: Machine Learning in Python (2013): Experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library.

Machine Learning in Action (2012): Machine Learning in Action is unique book that blends the foundational theories of machine learning with the practical realities of building tools for everyday data analysis. You’ll use the flexible Python programming language to build programs that implement algorithms for data classification, forecasting, recommendations, and higher-level features like summarization and simplification.

Amazon ImageProgramming Collective Intelligence: Building Smart Web 2.0 Applications (2007): This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet.

Machine Learning: An Algorithmic Perspective (2011): The field is ready for a text that not only demonstrates how to use the algorithms that make up machine learning methods, but also provides the background needed to understand how and why these algorithms work. Machine Learning: An Algorithmic Perspective is that text.

Specialty Machine Learning in Python

Amazon ImageMining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More (2013): You’ll learn how to acquire, analyze, and summarize data from all corners of the social web, including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs.

Natural Language Processing with Python (2009): This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.

Programming Computer Vision with Python: Tools and algorithms for analyzing images (2012): If you want a basic understanding of computer vision’s underlying theory and algorithms, this hands-on introduction is the ideal place to start. You’ll learn techniques for object recognition, 3D reconstruction, stereo imaging, augmented reality, and other computer vision applications as you follow clear examples written in Python.

Amazon ImagePython for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2012): It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.

Things that I hate to Teach in Python

I am currently teaching Python class for total beginners. My students are not only new for Python programming, most of them are new to programming. I choose to teach Python because Python is the best beginner programming language. Throughout the class -which we took a break for 2 weeks- I talked about basic data structures of Python functions control statements. Although they are very complicated concepts in computational world, I think Python makes it easier to understand. This is a blog post from Python for biologist. He plotted almost every hardship that I am dealing with.
I realized that, while I’ve spent a lot time talking about why Python is a great language, I have a number of pet peeves that I’ve never written down.I’m not talking about the usual problems, like Python’s relative lack of performance or lack of compile-time type checking – these things are deliberate design trade-offs and changing them would involve making Python not-Python. I’m talking about the small things that cause friction, especially in a teaching environment.

Note: I realize that there are good reasons for all these things to be the way they are, so don’t take this too seriously….

1. Floating point vs. integer division

Anyone who’s written in Python for any length of time probably types this line automatically without really thinking about it:

1
from __future__ import division

but take a moment to consider how you would explain what’s going on in this piece of code to a beginner. In order to really understand what’s happening here, you have to know about:

  • Python’s system for importing modules
  • Python’s system for grouping modules into packages
  • the fact that there are different versions of Python with slightly different behavior
  • the difference between floating-point and integer numbers
  • the mechanisms of operator overloading, whereby we can define the behavior of things like + and / for different types
  • the concept of polymorphic functions and operators, which allow us to treat different classes the same, some of the time

Explaining all this to someone who has never written a line of code before is unlikely to be productive, but none of the alternatives are particularly attractive either. We can just present this as a magic piece of code and save the explanation for later (this is normally what I do). We can instruct students to use explicit floating point numbers:

1
2
answer = float(4)/3
answer = 4.0/3

, but eventually they will forget and use integers and find that it works some of the time. We can carefully craft our examples and exercises to avoid the need for floating point division, but this is setting students up for pain further down the line. We can use the command-line argument -Q to force floating-point division, or just use Python 3 for teaching, but both of these options will cause confusion once the student goes back to their own environment.

2. split() vs. join()

“OK class, this is how we take a string and split it up into a list of strings using a fixed delimiter:”

1
2
sentence = "The all-England summarize Proust competition"
words = sentence.split(" ")

“So I guess, logically, to put the words back together again we just say:

1
sentence = words.join(" ")

right? Look at that elegant symmetry…… Wait a minute, you’re telling me it doesn’t work like that? The list and the delimiter actually go the other way around, so that we have to write this ugly line?

1
sentence = " ".join(words)

Wow, that just looks wrong.”

Yes, I know that there are good reasons for collection classes to only have methods that are type-agnostic, but would it really be so bad to just str() everything?

3. Exhaustible files

It’s perfectly logical that you shouldn’t be able to iterate through a file object twice without re-opening it….. once you know a fair bit about how iteration is actually implemented in Python. As a beginner, thought, it’s a bit like Python is giving with one hand and taking away with the other – you can use an opened file object just like a list, except in this one specific but very important way:

1
2
3
4
5
6
7
8
9
10
11
12
13
my_list = [1,2,3,4]
for number in my_list:
    do_something(number)
# second loop works just as you'd expect
for number in my_list:
    do_something_else(number)
my_file = open("some.input")
for line in my_file:
    do_something(line)
# second loop silently never runs
for line in my_file:
    do_something_else(line)

This problem also rears its ugly head when students try to iterate over a file having already consumed its contents using read():

1
2
3
4
5
6
my_file = open("some.input")
my_contents = my_file.read()
....
# this loop silently never runs
for line in my_file:
    do_something(line)

That second line can be difficult to spot for student and teacher alike when there are many intervening lines between it and the loop.

4. Lambda expressions

OK, this one is more annoying when writing code than when teaching it, since I rarely get round to talking about functional programming in introductory courses. I totally get why there should be a big, obvious flag when we are doing something clever (which lambda expressions generally are). Nevertheless, it seems a shame to have a style of coding that lends itself to elegant brevity marred by so many unnecessary keystrokes.

I think that the reason this bugs me so much is that I first got into functional programming by way of Groovy, which has (to me) a very pleasing syntax for anonymous functions (actually closures):

1
{x,y -> x**y}

compared to Python:

1
lambda x,y : x**y

Of course, Python lessens the sting of having to type lambda with its various comprehensions:

1
2
squares = map(lambda x : x**2, range(10))
squares = [x**2 for x in range(10)]

so I can’t complain too loudly.

5. Variables aren’t declared

It’s just way too easy for beginners to make a typo that brings their progress to a screeching halt. Consider this real-life example from my most recent course:

1
2
3
4
positions = [0]
for pos in [12,54,76,103]:
    postions  = positions + [pos]
print(positions) # prints [0] rather than [0,12,54,76,103]

Leaving aside that this particular example could have been salvaged by using positions.append(), it took way to long for us to track down the typo. In real-life code, this is the kind of thing that would ideally be caught by unit testing. This is one (rare!) case in which I pine for the old days of teaching Perl – use strict and my would have taken care of this type of problem.

Data Cleaning Helpers in R language

Friendly post from R-Bloggers’ author: Christopher Gandrud… It helped me a lot. I believe you will find it as much as helpful as I did.

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I’ve talked about some of the commands in DataCombine in previous posts. In this post I’ll give examples for a few more that I’ve added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

FindReplace: a function to replace multiple patterns found in a character string column of a data frame.

MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

rmExcept: removes all objects from a work space except those specified by the user.

FindReplace

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c(“London, UK”, “Oxford, UK”, “Berlin, DE”, “Hamburg, DE”, “Oslo, NO”), b = c(8, 0.1, 3, 2, 1))

Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c(“UK”, “DE”), to = c(“England”, “Germany”))

Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = “a”, replaceData = Replaces, from = “from”, to = “to”, exact = FALSE)

# Show changes
ABNewDF

## a b
## 1 London, England 8.0
## 2 Oxford, England 0.1
## 3 Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5 Oslo, NO 1.0

If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.
MoveFront

On occasion I’ve wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)

## [1] “A” “B” “C”

# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c(“B”, “A”))
names(NewOrder2)

## [1] “B” “A” “C”

rmExcept

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept(“A”)

## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces

# Show workspace
ls()

## [1] “A”

You can set the environment you want to clean up with the environ argument. By default is is your global environment.

10 reasons to choose Ubuntu over Windows 8

During previous week I had a great dilemma to deal with, which is having Ubuntu or Windows 8 as a first operating system to use.   I know I could have double operating system, and I did have Ubuntu in my USB device. Here is a post from pcworld website.

Microsoft’s Windows 8 dominated countless headlines in the weeks leading up to its launch late last month, but October saw the debut of another major operating system as well.

Canonical’s Ubuntu 12.10 “Quantal Quetzal” arrived a week ahead of its competitor, in fact, accompanied by a challenge: “Avoid the pain of Windows 8.” That slogan appeared on the Ubuntu home page for the first few hours after the OS’s official launch, and attracted considerable attention.

Apparently Canonical decided to tone down its message later in the day—the slogan now reads “Your wish is our command“—but it seems fair to say that the underlying challenge remains.

Ubuntu comes with a variety of software packages, including Firefox, Thunderbird, and the full-featured productivity suite LibreOffice.

Window of opportunity

Ubuntu is a widely popular open-source Linux distribution with eight years of maturity under its belt, and more than 20 million users. Of the roughly 5 percent of desktop OSs accounted for by Linux, at least one survey suggests that about half are Ubuntu. (Windows, meanwhile, accounts for about 84 percent.)

The timing of this latest Ubuntu release couldn’t be better for Windows users faced with the paradigm-busting Windows 8 and the big decision of whether to take the plunge.

Initial uptake of Windows 8 has been unenthusiastic, according to reports, and a full 80 percent of businesses will never adopt it, Gartner predicts. As a result, Microsoft’s big gamble may be desktop Linux’s big opportunity.

So, now that Canonical has thrown down the gauntlet, let’s take a closer look at Ubuntu 12.10 to see how it compares with Windows 8 from a business user’s perspective.

Perhaps the biggest surprise for many users of Windows 8’s mobile-style Modern UI is that it has no Start button.

1. Unity vs. Modern UI

Both Microsoft and Canonical have received considerable flak for the default user interfaces in their respective OSs. In Microsoft’s case, of course, it’s the Modern UI, formerly known as Metro; in Canonical’s case, it’s Unity. Both are designed with touchscreens in mind, and borrow heavily from the mobile world.

By removing the Start button and overhauling the way users interact with the operating system, Windows 8’s Modern interface poses a considerable challenge for users, who face a significant learning curve.

Unity, on the other hand, became a default part of Ubuntu back in April 2011 with Ubuntu 11.04 “Natty Narwhal.” It has definitely undergone growing pains, but more than a year has passed, and Canonical has revised the interface accordingly. Although it still has numerous critics, most people concede that it has matured and improved. Some observers, in fact, have even suggested that it may feel more familiar to many longtime Windows users than does Windows 8.

One advantage of Ubuntu Linux is that it supports multiple workspaces.

2. Customizability

Linux has long been known for its virtually limitless customizability, but given the current controversy surrounding desktop interfaces, that feature has become more salient than ever.

This is a point on which Windows 8 and Ubuntu differ considerably. Yes, Windows 8 does allow users to customize some aspects of their environment, such as by specifying the size of Live Tile icons, moving commonly used tiles to the left side of the screen, or grouping tiles by program type.

Most of the changes you can make in Windows 8, however, are largely cosmetic, and they don’t include a built-in way to set the OS to boot to the traditional Windows desktop. A growing assortment of third-party utilities such as Pokki can restore that capability, but otherwise you’re stuck with Modern UI. Windows 8 offers what you might call a “tightly coupled” interface—in other words, one that you can’t change substantially.

Microsoft’s Windows Store was sparsely populated at launch, but company executives have said that the number of apps will increase quickly.

Ubuntu’s Unity, in contrast, is more of a loosely coupled UI. First and foremost, you can easily replace it with any one of several free alternatives, including KDE, Xfce, LXDE, GNOME 3 Shell, Cinnamon, and MATE.

Also available for Unity are third-party customization tools, including the increasingly popular Ubuntu Tweak, while a raft of “look” sites are available for myriad Linux interfaces with a variety of themes to change the desktop’s appearance.

The rule of thumb with Linux in general and Ubuntu in particular is, if you don’t like it, swap in something else. Also worth mentioning is the fact that Ubuntu supports multiple workspaces, essentially letting you run up to four different desktops; Windows 8 Pro does not.

3. Apps

Whereas Windows 8 Pro comes bundled with Microsoft’s Internet Explorer 10 browser, Ubuntu comes with a wide assortment of open-source software packages such as Firefox, Thunderbird, LibreOffice, and more, offering both individual and business users a pretty full suite of functionality.

Similar to Microsoft’s SkyDrive, Ubuntu One allows users to back up and access their files from Ubuntu, Windows, the Web, or a mobile device.

Beyond those bundled programs, both Ubuntu and Windows 8 offer app stores to help users find the additional software they need.

Dating back to 2009, the Ubuntu Software Center now houses more than 40,000 apps, ranging from games to productivity tools to educational resources. In addition, by using Wine or CodeWeaver’s CrossOver, you can run Windows programs on top of Linux.

The Windows Store just launched with Windows 8, and at the time of its debut it included just over 9000 apps. Microsoft execs have said that they hope to provide 100,000 apps in the Windows Store within 90 days of the Windows launch.

Operating system binaries and drivers, however, will not come from the Windows Store. Rather, it will have both Windows RT (ARM) apps and Windows desktop (“legacy”) apps. Entries for legacy desktop apps in the Windows Store will take users to separate sites where they can purchase or download the apps. Ubuntu’s repository, on the other hand, centrally stores all operating system and app binaries and drivers.

As a result, aside from numbers, a key difference between the two app stores involves security. Ubuntu provides a GNU Privacy Guard (GnuPG) keyring-protected repository system wherein each application and driver has a unique keyring identity to verify its authenticity and integrity as having come only from the Ubuntu repo system. The keyring method of protection has been highly effective at ensuring that no rogue applications find their way into the repo—or onto users’ PCs.

Historically, Microsoft Windows has lacked such a keyring-protected repository. Although Microsoft does support its OS with monthly Windows Updates, no comparable third-party vendor support for updates exists. Because of this situation, users have had to venture online to obtain their own third-party-supported updates manually at separate websites. The Windows Store was developed to mitigate that risk and is specifically designed to curate apps, screen apps, and provide the capability to purchase apps. Time will tell how well it succeeds.

4. Hardware compatibility

To run Windows 8 on your PC, you’ll need a processor that’s 1GHz or faster with support for PAE, NX, and SSE2. You’ll also need a minimum of 1GB RAM for the 32-bit version or 2GB for the 64-bit version, along with 16GB (32-bit) or 20GB (64-bit) of space on your hard drive. For graphics processing, you’ll need a Microsoft DirectX 9-compatible graphics device with a WDDM driver, Microsoft says.

Of course, that’s the minimum. If you want to take advantage of Windows 8’s touch features, obviously you’ll need a multitouch device. To make the most of the software, you’ll want considerably more than that.

Ubuntu’s requirements, however, are much more modest: Canonical recommends 512MB of RAM, plus 5GB on the hard drive. You’ll also find versions such as Lubuntu and Xubuntu for lower-spec machines. In short, if hardware is a constraining factor for you, Ubuntu is most likely the better choice.

Microsoft’s SkyDrive service lets users upload and sync files to the cloud and then access them from virtually any browser or local device.

5. Cloud integration

Starting with the launch of Ubuntu One in 2009, the cloud has played a key role in Ubuntu Linux for some time, enabling users to store files online and sync them among computers and mobile devices, as well as to stream audio and music from the cloud to mobile devices.

Ubuntu One works on Windows, OS X, iOS, and Android, as well as on Ubuntu. Users of Ubuntu Linux get 5GB of Ubuntu One storage for free; 20GB costs $30 per year.

Beginning with Ubuntu 12.10, the OS also integrates Web apps and online searches directly into the Unity desktop for a more seamless experience.

With Windows 8, the cloud is coming to the forefront of Microsoft’s platform as well. For storage, Microsoft’s SkyDrive offers users 7GB of space for free. If you need more than that, you can have an extra 20GB for $10, 50GB for $25, or 100GB for $50 annually.

Storage isn’t the only benefit of the cloud, however. Beginning with this new release, the new Microsoft Account sign-in (formerly Live ID) lets you use a single username and password to establish common preferences among all the Windows-based hardware and services with which you work. The idea is to employ the cloud to connect your PCs, tablets, and smartphones through a common, user-specific experience.

Ubuntu doesn’t fully compete with Windows in this regard, since it doesn’t offer counterparts to Windows Phone 8 or Windows 8 RT that are tailored specifically to non-PC devices. However, Ubuntu for Android is in the works.

Offering a browser-based control panel, Ubuntu’s Landscape administrative tool can perform most Windows Active Directory tasks.

6. Security

Although Windows RT apps run within a sandboxed environment for greater security, Windows 8 Pro desktop legacy apps have no equivalent. Instead, third-party software developers are left to their own devices to add security measures to their apps.

Windows 8 and Ubuntu Linux provide their own firewalls, however, as well as the option for full disk encryption.

Despite the fact that Windows 8 Pro offers some security improvements over Windows 7, the new OS still carries forward with the WinNT legacy kernel, which is at least partially responsible for the litany of security issues Windows has suffered over the years.

To mitigate some of those issues, Microsoft created in conjunction with partnering OEMs Secure Boot, an extension to UEFI. Windows 8 now provides Secure Boot support on OEM systems, while Ubuntu 12.10 offers a raft of advanced security features such as support for installation with Secure Boot systems.

Additionally, Ubuntu Linux comes with Linux Security Modules (LSM) installed by default. Other security-enhancing measures include chroot, seccomp, seccomp-bpf, and the newest addition—LinuX Containers (LXC)—for third-party developers and users alike.

Just as an aside, it’s interesting to note that, each year at Pwn2Own, hackers get a chance to hack Windows and Apple Mac systems, but Linux is not included in the contest. No exploit can escalate against (and gain root privilege on) Ubuntu Linux running AppArmor-sandboxed Firefox.

7. Administrative tools

For administrative controls, Windows provides Active Directory, using dedicated Active Directory servers.

Canonical supports Active Directory as well, and Ubuntu Linux clients can join to an Active Directory Domain using third-party software such as Likewise Open or Centrify.

In addition, Canonical provides Landscape, an enterprise administrative tool of its own that can perform most Windows Active Directory tasks. Landscape presents an easy-to-use, browser-based control panel through which you can manage desktops, servers, and cloud instances.

Both Windows 8 and Ubuntu Linux 12.10 offer support for popular VPN protocols.

8. VPN support

Users who require virtual private network support will find it in both Windows 8 and Ubuntu 12.10.

In Ubuntu repositories, the provided utility is OpenVPN, which uses a custom security protocol based on SSL/TLS for key exchange. Both operating systems offer support for varied protocols, however, depending on site-specific and inter-site needs.

9. User support

Microsoft offers support for Windows 8 Pro users through its TechNet subscription service, which is priced starting at $149 per year.

Canonical offers Ubuntu Advantage service-level agreements starting at about $80 per year at the standard desktop level, including legal coverage and use of the Landscape administrative tool.

10. Price

Last but certainly not least, Ubuntu Linux is free, while Windows 8 Pro will reportedly cost $199 after the current introductory upgrade offer of $39 to $69 expires.

So which operating system is better for small-business users? The answer, of course, is in the eye of the beholder. If one thing is clear, however, it’s that any lead Windows may have once had over competing operating systems is shrinking every year. Depending on your needs, Ubuntu Linux 12.10 could provide a compelling alternative. If nothing else, it’s almost certainly worth your while to try it online or take it for a free test drive.

Windows 8 Pro (x86) Ubuntu 12.10
License fee $39 to $69 upgrade Free
CPU architectures supported x86, x86-64 x86, x86-64, ARM, PPC
Minimum RAM 1GB, 2GB 512MB
Minimum hard-disk space 20GB 5GB
Concurrent multiuser support No Yes
Workspaces One Two or more
Virtualization Hyper-V KVM
License Not applicable GPL Open Source: Main, Non-GPL: Restricted
Productivity software included None LibreOffice
Graphics tools included No Yes

Creating PDF Reports with Pandas, Jinja and WeasyPrint by Chris Moffitt

Introduction

Pandas is excellent at manipulating large amounts of data and summarizing it in multiple text and visual representations. Without much effort, pandas supports output to CSV, Excel, HTML, json and more. Where things get more difficult is if you want to combine multiple pieces of data into one document. For example, if you want to put two DataFrames on one Excel sheet, you need to use the Excel libraries to manually construct your output. It is certainly possible but not simple. This article will describe one method to combine multiple pieces of information into an HTML template and then converting it to a standalone PDF document using Jinja templates and WeasyPrint.

Before going too far through this article, I would recommend that you review the previous articles on Pandas Pivot Tables and the follow-on article on generating Excel reports from these tables. They explain the data set I am using and how to work with pivot tables.

The Process

As shown in the reporting article, it is very convenient to use Pandas to output data into multiple sheets in an Excel file or create multiple Excel files from pandas DataFrames. However, if you would like to combine multiple pieces of information into a single file, there are not many simple ways to do it straight from Pandas. Fortunately, the python environment has many options to help us out.

In this article, I’m going to use the following process flow to create a multi-page PDF document.

Tool pipeline for generating PDF

The nice thing about this approach is that you can substitute your own tools into this workflow. Don’t like Jinja? Plug in mako or your templating tool of choice. If you want to use another type of markup outside of HTML, go for it.

The Tools

First, I decided to use HTML as the templating language because it is probably the simplest way to generate structured data and allow for relatively rich formatting. I also think everyone knows (or can figure out) enough HTML to generate a simple report. Also, I don’t have the desire to learn a whole new templating language. However, if you choose to use other markup languages, the flow should work the same.

I chose Jinja because I have experience with Django and it closely mirrors Django’s syntax. There are certainly other options out there so feel free to experiment with your options. I think for this approach there is nothing very complicated about our templates so any tool should work fine.

Finally, the most difficult part of this tool chain is figuring out how to render the HTML into PDF. I don’t feel like there is an optimal solution yet but I chose WeasyPrint because it is still being actively maintained and I found that I could get it working relatively easily. There are quite a few dependencies for it to work so I’ll be curious if people have any real challenges getting it to work on Windows. As an alternative, I have used xhtml2pdf in the past and it works well too. Unfortunately the documentation is a little lacking at this time but it has been around for a while and does generate PDF’s effectively from HTML.

The Data

As discussed above, we’ll use the same data from my previous articles. In order to keep this all a self-contained article, here is how I import the data and generate a pivot table as well as some summary statistics of the average quantity and price of the CPU and Software sales.

Import modules, and read in the sales funnel information.

from __future__ import print_function
import pandas as pd
import numpy as np
df = pd.read_excel("sales-funnel.xlsx")
df.head()
Account Name Rep Manager Product Quantity Price Status
0 714466 Trantow-Barrows Craig Booker Debra Henley CPU 1 30000 presented
1 714466 Trantow-Barrows Craig Booker Debra Henley Software 1 10000 presented
2 714466 Trantow-Barrows Craig Booker Debra Henley Maintenance 2 5000 pending
3 737550 Fritsch, Russel and Anderson Craig Booker Debra Henley CPU 1 35000 declined
4 146832 Kiehn-Spinka Daniel Hilton Debra Henley CPU 2 65000 won

Pivot the data to summarize.

sales_report = pd.pivot_table(df, index=["Manager", "Rep", "Product"], values=["Price", "Quantity"],
                           aggfunc=[np.sum, np.mean], fill_value=0)
sales_report.head()
sum mean
Price Quantity Price Quantity
Manager Rep Product
Debra Henley Craig Booker CPU 65000 2 32500 1
Maintenance 5000 2 5000 2
Software 10000 1 10000 1
Daniel Hilton CPU 105000 4 52500 2
Software 10000 1 10000 1

Generate some overall descriptive statistics about the entire data set. In this case, we want to show the average quantity and price for CPU and Software sales.

print(df[df["Product"]=="CPU"]["Quantity"].mean())
print(df[df["Product"]=="CPU"]["Price"].mean())
print(df[df["Product"]=="Software"]["Quantity"].mean())
print(df[df["Product"]=="Software"]["Price"].mean())
1.88888888889
51666.6666667
1.0
10000.0

Ideally what we would like to do now is to split our data up by manager and include some of the summary statistics on a page to help understand how the individual results compare to the national averages.

DataFrame Options

I have one quick aside before we talk templates. For some quick and dirty needs, sometimes all you need to do is copy and paste the data. Fortunately a DataFrame has a to_clipboard() function that will copy the whole DataFrame to the clipboard which you can then easily paste into Excel. I have found this to be a really helpful option in certain situations.

The other option we will use later in the template is the to_html() which will generate a string containing a fully composed HTML table with minimal styling applied.

Templating

Jinja templating is very powerful and supports a lot of advanced features such as sandboxed execution and auto-escaping that are not necessary for this application. These capabilities however will serve you well as your reports grow more complex or you choose to use Jinja for your web apps.

The other nice feature of Jinja is that it includes multiple builtin filters which will allow us to format some of our data in a way that is difficult to do within Pandas.

In order to use Jinja in our application, we need to do 3 things:

  • Create a template
  • Add variables into the templates context
  • Render the template into HTML

Here is a very simple template, let’s call it myreport.html :

<!DOCTYPE html>
<html>
<head lang="en">
    <meta charset="UTF-8">
    <title>{{ title }}</title>
</head>
<body>
    <h2>Sales Funnel Report - National</h2>
     {{ national_pivot_table }}
</body>
</html>

The two keys portions of this code are the {{ title }} and {{ national_pivot_table }} . They are essentially placeholders for variables that we will provide when we render the document.

To populate those variable, we need to create a Jinja environment and get our template:

from jinja2 import Environment, FileSystemLoader
env = Environment(loader=FileSystemLoader('.'))
template = env.get_template("myreport.html")

In the example above, I am assuming that the template is in the current directory but you could put the full path to a template location.

The other key component is the creation of env . This variable is how we pass content to our template. We create a dictionary called template_var that contains all the variable we want to pass to the template.

Note how the names of the variables match our templates.

template_vars = {"title" : "Sales Funnel Report - National",
                 "national_pivot_table": sales_report.to_html()}

The final step is to render the HTML with the variables included in the output. This will create a string that we will eventually pass to our PDF creation engine.

html_out = template.render(template_vars)

For the sake of brevity, I won’t show the full HTML but you should get the idea.

Generate PDF

The PDF creation portion is relatively simple as well. We need to do some imports and pass a string to the PDF generator.

from weasyprint import HTML
HTML(string=html_out).write_pdf("report.pdf")

This command creates a PDF report that looks something like this:

Unstyled pivot table output

Ugh. It’s cool that it’s a PDF but it is ugly. The main problem is that we don’t have any styling on it. The mechanism we have to use to style is CSS.

As an aside, I really don’t like CSS. Every time I start playing with it I feel like I spend more time monkeying with the presentation than I did getting the data summarized. I am open to ideas on how to make this look nicer but in the end, I decided to go the route of using a portion of blueprint CSS to have very simple styling that would work with the rendering engines.

For the rest of the article, I’ll be using blue print’s typography.css as the basis for my style.css shown below. What I like about this css is:

  • It is relatively small and easy to understand
  • It works will in the PDF engines without throwing errors and warnings
  • It includes basic table formatting that looks pretty decent

Let’s try re-rendering it with our updated stylesheet:

HTML(string=html_out).write_pdf(args.outfile.name, stylesheets=["style.css"])
Styled pivot table output

Just adding a simple stylesheet makes a huge difference!

There is still a lot more you can do with it but this shows how to make it at least serviceable for a start. As an aside, I think it would be pretty cool if someone that knew CSS way better than me developed an open sourced, simple CSS sheet we could use for report generation like this.

More Complex Templating

Up until now, we haven’t done anything different than if we had just generated a simple Excel sheet using to_excel() on a DataFrame.

In order to generate a more useful report, we are going to combine the summary statistics shown above as well as break out the report to include a separate PDF page per manager.

Let’s start with the updated template (myreport.html):

<!DOCTYPE html>
<html>
<head lang="en">
    <meta charset="UTF-8">
    <title>{{ title }} </title>
</head>
<body>
<div class="container">
    <h2>Sales Funnel Report - National</h2>
     {{ national_pivot_table }}
    {% include "summary.html" %}
</div>
<div class="container">
    {% for manager in Manager_Detail %}
        <p style="page-break-before: always" ></p>
        <h2>Sales Funnel Report - {{manager.0}}</h2>
        {{manager.1}}
        {% include "summary.html" %}
    {% endfor %}
</div>
</body>
</html>

The first thing you’ll notice is that there is an include statement which mentions another file. The include allows us to bring in a snippet of HTML and use it repeteadly in different portions of the code. In this case the summary contains some simple national level stats we want to include on each report so that the managers can compare their performance to the national average.

Here is what summary.html looks like:

<h3>National Summary: CPUs</h3>
    <ul>
        <li>Average Quantity: {{CPU.0|round(1)}}</li>
        <li>Average Price: {{CPU.1|round(1)}}</li>
    </ul>
<h3>National Summary: Software</h3>
    <ul>
        <li>Average Quantity: {{Software.0|round(1)}}</li>
        <li>Average Price: {{Software.1|round(1)}}</li>
    </ul>

In this snippet, you’ll see that there are some additional variables we have access to: CPU and Software . Each of these is a python list that includes the average quantity and price for CPU and Software sales.

You may also notice that we use a pipe | to round each value to 1 decimal place. This is one specific example of the use of Jinja’s filters.

There is also a for loop that allows us to display the details for each manager in our report. Jinja’s template language only includes a very small subset of code that alters the control flow. Basic for-loops are a mainstay of almost any template so they should make sense to most of you.

I want to call out one final piece of code that looks a little out of place:

<p style="page-break-before: always" ></p>

This is a simple CSS directive that I put in to make sure the CSS breaks on each page. I had to do a little digging to figure out the best way to make the pages break so I thought I would include it to help others out.

Additional Stats

Now that we have gone through the templates, here is how to create the additional context variables used in the templates.

Here is a simple summary function:

def get_summary_stats(df,product):
    """
    For certain products we want National Summary level information on the reports
    Return a list of the average quantity and price
    """
    results = []
    results.append(df[df["Product"]==product]["Quantity"].mean())
    results.append(df[df["Product"]==product]["Price"].mean())
    return results

We also need to create the manager details:

manager_df = []
for manager in sales_report.index.get_level_values(0).unique():
    manager_df.append([manager, sales_report.xs(manager, level=0).to_html()])

Finally, call the template with these variables:

template_vars = {"title" : "National Sales Funnel Report",
                 "CPU" : get_summary_stats(df, "CPU"),
                 "Software": get_summary_stats(df, "Software"),
                 "national_pivot_table": sales_report.to_html(),
                 "Manager_Detail": manager_df}
# Render our file and create the PDF using our css style file
html_out = template.render(template_vars)
HTML(string=html_out).write_pdf("report.pdf",stylesheets=["style.css"])

Here is the final PDF Report . I think it looks pretty decent for a simple report.

Ideas For Improvements

In the example above, we used the simple to_html() to generate our HTML. I suspect that when you start to do more of these you will want to have finer grained control over the output of your table.

There are a couple of options:

  • Pass a custom css class to_html using classes
  • Use formatters to format the data
  • Pass the data directly to your template and use iterrows to manually construct your table

Final Program

In order to pull it all together, here is the full program:

"""
Generate PDF reports from data included in several Pandas DataFrames
From pbpython.com
"""
from __future__ import print_function
import pandas as pd
import numpy as np
import argparse
from jinja2 import Environment, FileSystemLoader
from weasyprint import HTML


def create_pivot(df, infile, index_list=["Manager", "Rep", "Product"], value_list=["Price", "Quantity"]):
    """
    Create a pivot table from a raw DataFrame and return it as a DataFrame
    """
    table = pd.pivot_table(df, index=index_list, values=value_list,
                           aggfunc=[np.sum, np.mean], fill_value=0)
    return table

def get_summary_stats(df,product):
    """
    For certain products we want National Summary level information on the reports
    Return a list of the average quantity and price
    """
    results = []
    results.append(df[df["Product"]==product]["Quantity"].mean())
    results.append(df[df["Product"]==product]["Price"].mean())
    return results

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Generate PDF report')
    parser.add_argument('infile', type=argparse.FileType('r'),
    help="report source file in Excel")
    parser.add_argument('outfile', type=argparse.FileType('w'),
    help="output file in PDF")
    args = parser.parse_args()
    # Read in the file and get our pivot table summary
    df = pd.read_excel(args.infile.name)
    sales_report = create_pivot(df, args.infile.name)
    # Get some national summary to include as well
    manager_df = []
    for manager in sales_report.index.get_level_values(0).unique():
        manager_df.append([manager, sales_report.xs(manager, level=0).to_html()])
    # Do our templating now
    # We can specify any directory for the loader but for this example, use current directory
    env = Environment(loader=FileSystemLoader('.'))
    template = env.get_template("myreport.html")
    template_vars = {"title" : "National Sales Funnel Report",
                     "CPU" : get_summary_stats(df, "CPU"),
                     "Software": get_summary_stats(df, "Software"),
                     "national_pivot_table": sales_report.to_html(),
                     "Manager_Detail": manager_df}
    # Render our file and create the PDF using our css style file
    html_out = template.render(template_vars)
    HTML(string=html_out).write_pdf(args.outfile.name,stylesheets=["style.css"])

You can also view the gist if you are interested amd download a zip file of myreport.html, style.css and summary.html if you find it helpful.

Thanks for reading all the way to the end. As always, feedback is appreciated.