Data Cleaning Helpers in R language

Friendly post from R-Bloggers’ author: Christopher Gandrud… It helped me a lot. I believe you will find it as much as helpful as I did.

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I’ve talked about some of the commands in DataCombine in previous posts. In this post I’ll give examples for a few more that I’ve added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

FindReplace: a function to replace multiple patterns found in a character string column of a data frame.

MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

rmExcept: removes all objects from a work space except those specified by the user.

FindReplace

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c(“London, UK”, “Oxford, UK”, “Berlin, DE”, “Hamburg, DE”, “Oslo, NO”), b = c(8, 0.1, 3, 2, 1))

Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c(“UK”, “DE”), to = c(“England”, “Germany”))

Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = “a”, replaceData = Replaces, from = “from”, to = “to”, exact = FALSE)

# Show changes
ABNewDF

## a b
## 1 London, England 8.0
## 2 Oxford, England 0.1
## 3 Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5 Oslo, NO 1.0

If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.
MoveFront

On occasion I’ve wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)

## [1] “A” “B” “C”

# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c(“B”, “A”))
names(NewOrder2)

## [1] “B” “A” “C”

rmExcept

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept(“A”)

## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces

# Show workspace
ls()

## [1] “A”

You can set the environment you want to clean up with the environ argument. By default is is your global environment.

10 reasons to choose Ubuntu over Windows 8

During previous week I had a great dilemma to deal with, which is having Ubuntu or Windows 8 as a first operating system to use.   I know I could have double operating system, and I did have Ubuntu in my USB device. Here is a post from pcworld website.

Microsoft’s Windows 8 dominated countless headlines in the weeks leading up to its launch late last month, but October saw the debut of another major operating system as well.

Canonical’s Ubuntu 12.10 “Quantal Quetzal” arrived a week ahead of its competitor, in fact, accompanied by a challenge: “Avoid the pain of Windows 8.” That slogan appeared on the Ubuntu home page for the first few hours after the OS’s official launch, and attracted considerable attention.

Apparently Canonical decided to tone down its message later in the day—the slogan now reads “Your wish is our command“—but it seems fair to say that the underlying challenge remains.

Ubuntu comes with a variety of software packages, including Firefox, Thunderbird, and the full-featured productivity suite LibreOffice.

Window of opportunity

Ubuntu is a widely popular open-source Linux distribution with eight years of maturity under its belt, and more than 20 million users. Of the roughly 5 percent of desktop OSs accounted for by Linux, at least one survey suggests that about half are Ubuntu. (Windows, meanwhile, accounts for about 84 percent.)

The timing of this latest Ubuntu release couldn’t be better for Windows users faced with the paradigm-busting Windows 8 and the big decision of whether to take the plunge.

Initial uptake of Windows 8 has been unenthusiastic, according to reports, and a full 80 percent of businesses will never adopt it, Gartner predicts. As a result, Microsoft’s big gamble may be desktop Linux’s big opportunity.

So, now that Canonical has thrown down the gauntlet, let’s take a closer look at Ubuntu 12.10 to see how it compares with Windows 8 from a business user’s perspective.

Perhaps the biggest surprise for many users of Windows 8’s mobile-style Modern UI is that it has no Start button.

1. Unity vs. Modern UI

Both Microsoft and Canonical have received considerable flak for the default user interfaces in their respective OSs. In Microsoft’s case, of course, it’s the Modern UI, formerly known as Metro; in Canonical’s case, it’s Unity. Both are designed with touchscreens in mind, and borrow heavily from the mobile world.

By removing the Start button and overhauling the way users interact with the operating system, Windows 8’s Modern interface poses a considerable challenge for users, who face a significant learning curve.

Unity, on the other hand, became a default part of Ubuntu back in April 2011 with Ubuntu 11.04 “Natty Narwhal.” It has definitely undergone growing pains, but more than a year has passed, and Canonical has revised the interface accordingly. Although it still has numerous critics, most people concede that it has matured and improved. Some observers, in fact, have even suggested that it may feel more familiar to many longtime Windows users than does Windows 8.

One advantage of Ubuntu Linux is that it supports multiple workspaces.

2. Customizability

Linux has long been known for its virtually limitless customizability, but given the current controversy surrounding desktop interfaces, that feature has become more salient than ever.

This is a point on which Windows 8 and Ubuntu differ considerably. Yes, Windows 8 does allow users to customize some aspects of their environment, such as by specifying the size of Live Tile icons, moving commonly used tiles to the left side of the screen, or grouping tiles by program type.

Most of the changes you can make in Windows 8, however, are largely cosmetic, and they don’t include a built-in way to set the OS to boot to the traditional Windows desktop. A growing assortment of third-party utilities such as Pokki can restore that capability, but otherwise you’re stuck with Modern UI. Windows 8 offers what you might call a “tightly coupled” interface—in other words, one that you can’t change substantially.

Microsoft’s Windows Store was sparsely populated at launch, but company executives have said that the number of apps will increase quickly.

Ubuntu’s Unity, in contrast, is more of a loosely coupled UI. First and foremost, you can easily replace it with any one of several free alternatives, including KDE, Xfce, LXDE, GNOME 3 Shell, Cinnamon, and MATE.

Also available for Unity are third-party customization tools, including the increasingly popular Ubuntu Tweak, while a raft of “look” sites are available for myriad Linux interfaces with a variety of themes to change the desktop’s appearance.

The rule of thumb with Linux in general and Ubuntu in particular is, if you don’t like it, swap in something else. Also worth mentioning is the fact that Ubuntu supports multiple workspaces, essentially letting you run up to four different desktops; Windows 8 Pro does not.

3. Apps

Whereas Windows 8 Pro comes bundled with Microsoft’s Internet Explorer 10 browser, Ubuntu comes with a wide assortment of open-source software packages such as Firefox, Thunderbird, LibreOffice, and more, offering both individual and business users a pretty full suite of functionality.

Similar to Microsoft’s SkyDrive, Ubuntu One allows users to back up and access their files from Ubuntu, Windows, the Web, or a mobile device.

Beyond those bundled programs, both Ubuntu and Windows 8 offer app stores to help users find the additional software they need.

Dating back to 2009, the Ubuntu Software Center now houses more than 40,000 apps, ranging from games to productivity tools to educational resources. In addition, by using Wine or CodeWeaver’s CrossOver, you can run Windows programs on top of Linux.

The Windows Store just launched with Windows 8, and at the time of its debut it included just over 9000 apps. Microsoft execs have said that they hope to provide 100,000 apps in the Windows Store within 90 days of the Windows launch.

Operating system binaries and drivers, however, will not come from the Windows Store. Rather, it will have both Windows RT (ARM) apps and Windows desktop (“legacy”) apps. Entries for legacy desktop apps in the Windows Store will take users to separate sites where they can purchase or download the apps. Ubuntu’s repository, on the other hand, centrally stores all operating system and app binaries and drivers.

As a result, aside from numbers, a key difference between the two app stores involves security. Ubuntu provides a GNU Privacy Guard (GnuPG) keyring-protected repository system wherein each application and driver has a unique keyring identity to verify its authenticity and integrity as having come only from the Ubuntu repo system. The keyring method of protection has been highly effective at ensuring that no rogue applications find their way into the repo—or onto users’ PCs.

Historically, Microsoft Windows has lacked such a keyring-protected repository. Although Microsoft does support its OS with monthly Windows Updates, no comparable third-party vendor support for updates exists. Because of this situation, users have had to venture online to obtain their own third-party-supported updates manually at separate websites. The Windows Store was developed to mitigate that risk and is specifically designed to curate apps, screen apps, and provide the capability to purchase apps. Time will tell how well it succeeds.

4. Hardware compatibility

To run Windows 8 on your PC, you’ll need a processor that’s 1GHz or faster with support for PAE, NX, and SSE2. You’ll also need a minimum of 1GB RAM for the 32-bit version or 2GB for the 64-bit version, along with 16GB (32-bit) or 20GB (64-bit) of space on your hard drive. For graphics processing, you’ll need a Microsoft DirectX 9-compatible graphics device with a WDDM driver, Microsoft says.

Of course, that’s the minimum. If you want to take advantage of Windows 8’s touch features, obviously you’ll need a multitouch device. To make the most of the software, you’ll want considerably more than that.

Ubuntu’s requirements, however, are much more modest: Canonical recommends 512MB of RAM, plus 5GB on the hard drive. You’ll also find versions such as Lubuntu and Xubuntu for lower-spec machines. In short, if hardware is a constraining factor for you, Ubuntu is most likely the better choice.

Microsoft’s SkyDrive service lets users upload and sync files to the cloud and then access them from virtually any browser or local device.

5. Cloud integration

Starting with the launch of Ubuntu One in 2009, the cloud has played a key role in Ubuntu Linux for some time, enabling users to store files online and sync them among computers and mobile devices, as well as to stream audio and music from the cloud to mobile devices.

Ubuntu One works on Windows, OS X, iOS, and Android, as well as on Ubuntu. Users of Ubuntu Linux get 5GB of Ubuntu One storage for free; 20GB costs $30 per year.

Beginning with Ubuntu 12.10, the OS also integrates Web apps and online searches directly into the Unity desktop for a more seamless experience.

With Windows 8, the cloud is coming to the forefront of Microsoft’s platform as well. For storage, Microsoft’s SkyDrive offers users 7GB of space for free. If you need more than that, you can have an extra 20GB for $10, 50GB for $25, or 100GB for $50 annually.

Storage isn’t the only benefit of the cloud, however. Beginning with this new release, the new Microsoft Account sign-in (formerly Live ID) lets you use a single username and password to establish common preferences among all the Windows-based hardware and services with which you work. The idea is to employ the cloud to connect your PCs, tablets, and smartphones through a common, user-specific experience.

Ubuntu doesn’t fully compete with Windows in this regard, since it doesn’t offer counterparts to Windows Phone 8 or Windows 8 RT that are tailored specifically to non-PC devices. However, Ubuntu for Android is in the works.

Offering a browser-based control panel, Ubuntu’s Landscape administrative tool can perform most Windows Active Directory tasks.

6. Security

Although Windows RT apps run within a sandboxed environment for greater security, Windows 8 Pro desktop legacy apps have no equivalent. Instead, third-party software developers are left to their own devices to add security measures to their apps.

Windows 8 and Ubuntu Linux provide their own firewalls, however, as well as the option for full disk encryption.

Despite the fact that Windows 8 Pro offers some security improvements over Windows 7, the new OS still carries forward with the WinNT legacy kernel, which is at least partially responsible for the litany of security issues Windows has suffered over the years.

To mitigate some of those issues, Microsoft created in conjunction with partnering OEMs Secure Boot, an extension to UEFI. Windows 8 now provides Secure Boot support on OEM systems, while Ubuntu 12.10 offers a raft of advanced security features such as support for installation with Secure Boot systems.

Additionally, Ubuntu Linux comes with Linux Security Modules (LSM) installed by default. Other security-enhancing measures include chroot, seccomp, seccomp-bpf, and the newest addition—LinuX Containers (LXC)—for third-party developers and users alike.

Just as an aside, it’s interesting to note that, each year at Pwn2Own, hackers get a chance to hack Windows and Apple Mac systems, but Linux is not included in the contest. No exploit can escalate against (and gain root privilege on) Ubuntu Linux running AppArmor-sandboxed Firefox.

7. Administrative tools

For administrative controls, Windows provides Active Directory, using dedicated Active Directory servers.

Canonical supports Active Directory as well, and Ubuntu Linux clients can join to an Active Directory Domain using third-party software such as Likewise Open or Centrify.

In addition, Canonical provides Landscape, an enterprise administrative tool of its own that can perform most Windows Active Directory tasks. Landscape presents an easy-to-use, browser-based control panel through which you can manage desktops, servers, and cloud instances.

Both Windows 8 and Ubuntu Linux 12.10 offer support for popular VPN protocols.

8. VPN support

Users who require virtual private network support will find it in both Windows 8 and Ubuntu 12.10.

In Ubuntu repositories, the provided utility is OpenVPN, which uses a custom security protocol based on SSL/TLS for key exchange. Both operating systems offer support for varied protocols, however, depending on site-specific and inter-site needs.

9. User support

Microsoft offers support for Windows 8 Pro users through its TechNet subscription service, which is priced starting at $149 per year.

Canonical offers Ubuntu Advantage service-level agreements starting at about $80 per year at the standard desktop level, including legal coverage and use of the Landscape administrative tool.

10. Price

Last but certainly not least, Ubuntu Linux is free, while Windows 8 Pro will reportedly cost $199 after the current introductory upgrade offer of $39 to $69 expires.

So which operating system is better for small-business users? The answer, of course, is in the eye of the beholder. If one thing is clear, however, it’s that any lead Windows may have once had over competing operating systems is shrinking every year. Depending on your needs, Ubuntu Linux 12.10 could provide a compelling alternative. If nothing else, it’s almost certainly worth your while to try it online or take it for a free test drive.

Windows 8 Pro (x86) Ubuntu 12.10
License fee $39 to $69 upgrade Free
CPU architectures supported x86, x86-64 x86, x86-64, ARM, PPC
Minimum RAM 1GB, 2GB 512MB
Minimum hard-disk space 20GB 5GB
Concurrent multiuser support No Yes
Workspaces One Two or more
Virtualization Hyper-V KVM
License Not applicable GPL Open Source: Main, Non-GPL: Restricted
Productivity software included None LibreOffice
Graphics tools included No Yes

Creating PDF Reports with Pandas, Jinja and WeasyPrint by Chris Moffitt

Introduction

Pandas is excellent at manipulating large amounts of data and summarizing it in multiple text and visual representations. Without much effort, pandas supports output to CSV, Excel, HTML, json and more. Where things get more difficult is if you want to combine multiple pieces of data into one document. For example, if you want to put two DataFrames on one Excel sheet, you need to use the Excel libraries to manually construct your output. It is certainly possible but not simple. This article will describe one method to combine multiple pieces of information into an HTML template and then converting it to a standalone PDF document using Jinja templates and WeasyPrint.

Before going too far through this article, I would recommend that you review the previous articles on Pandas Pivot Tables and the follow-on article on generating Excel reports from these tables. They explain the data set I am using and how to work with pivot tables.

The Process

As shown in the reporting article, it is very convenient to use Pandas to output data into multiple sheets in an Excel file or create multiple Excel files from pandas DataFrames. However, if you would like to combine multiple pieces of information into a single file, there are not many simple ways to do it straight from Pandas. Fortunately, the python environment has many options to help us out.

In this article, I’m going to use the following process flow to create a multi-page PDF document.

Tool pipeline for generating PDF

The nice thing about this approach is that you can substitute your own tools into this workflow. Don’t like Jinja? Plug in mako or your templating tool of choice. If you want to use another type of markup outside of HTML, go for it.

The Tools

First, I decided to use HTML as the templating language because it is probably the simplest way to generate structured data and allow for relatively rich formatting. I also think everyone knows (or can figure out) enough HTML to generate a simple report. Also, I don’t have the desire to learn a whole new templating language. However, if you choose to use other markup languages, the flow should work the same.

I chose Jinja because I have experience with Django and it closely mirrors Django’s syntax. There are certainly other options out there so feel free to experiment with your options. I think for this approach there is nothing very complicated about our templates so any tool should work fine.

Finally, the most difficult part of this tool chain is figuring out how to render the HTML into PDF. I don’t feel like there is an optimal solution yet but I chose WeasyPrint because it is still being actively maintained and I found that I could get it working relatively easily. There are quite a few dependencies for it to work so I’ll be curious if people have any real challenges getting it to work on Windows. As an alternative, I have used xhtml2pdf in the past and it works well too. Unfortunately the documentation is a little lacking at this time but it has been around for a while and does generate PDF’s effectively from HTML.

The Data

As discussed above, we’ll use the same data from my previous articles. In order to keep this all a self-contained article, here is how I import the data and generate a pivot table as well as some summary statistics of the average quantity and price of the CPU and Software sales.

Import modules, and read in the sales funnel information.

from __future__ import print_function
import pandas as pd
import numpy as np
df = pd.read_excel("sales-funnel.xlsx")
df.head()
Account Name Rep Manager Product Quantity Price Status
0 714466 Trantow-Barrows Craig Booker Debra Henley CPU 1 30000 presented
1 714466 Trantow-Barrows Craig Booker Debra Henley Software 1 10000 presented
2 714466 Trantow-Barrows Craig Booker Debra Henley Maintenance 2 5000 pending
3 737550 Fritsch, Russel and Anderson Craig Booker Debra Henley CPU 1 35000 declined
4 146832 Kiehn-Spinka Daniel Hilton Debra Henley CPU 2 65000 won

Pivot the data to summarize.

sales_report = pd.pivot_table(df, index=["Manager", "Rep", "Product"], values=["Price", "Quantity"],
                           aggfunc=[np.sum, np.mean], fill_value=0)
sales_report.head()
sum mean
Price Quantity Price Quantity
Manager Rep Product
Debra Henley Craig Booker CPU 65000 2 32500 1
Maintenance 5000 2 5000 2
Software 10000 1 10000 1
Daniel Hilton CPU 105000 4 52500 2
Software 10000 1 10000 1

Generate some overall descriptive statistics about the entire data set. In this case, we want to show the average quantity and price for CPU and Software sales.

print(df[df["Product"]=="CPU"]["Quantity"].mean())
print(df[df["Product"]=="CPU"]["Price"].mean())
print(df[df["Product"]=="Software"]["Quantity"].mean())
print(df[df["Product"]=="Software"]["Price"].mean())
1.88888888889
51666.6666667
1.0
10000.0

Ideally what we would like to do now is to split our data up by manager and include some of the summary statistics on a page to help understand how the individual results compare to the national averages.

DataFrame Options

I have one quick aside before we talk templates. For some quick and dirty needs, sometimes all you need to do is copy and paste the data. Fortunately a DataFrame has a to_clipboard() function that will copy the whole DataFrame to the clipboard which you can then easily paste into Excel. I have found this to be a really helpful option in certain situations.

The other option we will use later in the template is the to_html() which will generate a string containing a fully composed HTML table with minimal styling applied.

Templating

Jinja templating is very powerful and supports a lot of advanced features such as sandboxed execution and auto-escaping that are not necessary for this application. These capabilities however will serve you well as your reports grow more complex or you choose to use Jinja for your web apps.

The other nice feature of Jinja is that it includes multiple builtin filters which will allow us to format some of our data in a way that is difficult to do within Pandas.

In order to use Jinja in our application, we need to do 3 things:

  • Create a template
  • Add variables into the templates context
  • Render the template into HTML

Here is a very simple template, let’s call it myreport.html :

<!DOCTYPE html>
<html>
<head lang="en">
    <meta charset="UTF-8">
    <title>{{ title }}</title>
</head>
<body>
    <h2>Sales Funnel Report - National</h2>
     {{ national_pivot_table }}
</body>
</html>

The two keys portions of this code are the {{ title }} and {{ national_pivot_table }} . They are essentially placeholders for variables that we will provide when we render the document.

To populate those variable, we need to create a Jinja environment and get our template:

from jinja2 import Environment, FileSystemLoader
env = Environment(loader=FileSystemLoader('.'))
template = env.get_template("myreport.html")

In the example above, I am assuming that the template is in the current directory but you could put the full path to a template location.

The other key component is the creation of env . This variable is how we pass content to our template. We create a dictionary called template_var that contains all the variable we want to pass to the template.

Note how the names of the variables match our templates.

template_vars = {"title" : "Sales Funnel Report - National",
                 "national_pivot_table": sales_report.to_html()}

The final step is to render the HTML with the variables included in the output. This will create a string that we will eventually pass to our PDF creation engine.

html_out = template.render(template_vars)

For the sake of brevity, I won’t show the full HTML but you should get the idea.

Generate PDF

The PDF creation portion is relatively simple as well. We need to do some imports and pass a string to the PDF generator.

from weasyprint import HTML
HTML(string=html_out).write_pdf("report.pdf")

This command creates a PDF report that looks something like this:

Unstyled pivot table output

Ugh. It’s cool that it’s a PDF but it is ugly. The main problem is that we don’t have any styling on it. The mechanism we have to use to style is CSS.

As an aside, I really don’t like CSS. Every time I start playing with it I feel like I spend more time monkeying with the presentation than I did getting the data summarized. I am open to ideas on how to make this look nicer but in the end, I decided to go the route of using a portion of blueprint CSS to have very simple styling that would work with the rendering engines.

For the rest of the article, I’ll be using blue print’s typography.css as the basis for my style.css shown below. What I like about this css is:

  • It is relatively small and easy to understand
  • It works will in the PDF engines without throwing errors and warnings
  • It includes basic table formatting that looks pretty decent

Let’s try re-rendering it with our updated stylesheet:

HTML(string=html_out).write_pdf(args.outfile.name, stylesheets=["style.css"])
Styled pivot table output

Just adding a simple stylesheet makes a huge difference!

There is still a lot more you can do with it but this shows how to make it at least serviceable for a start. As an aside, I think it would be pretty cool if someone that knew CSS way better than me developed an open sourced, simple CSS sheet we could use for report generation like this.

More Complex Templating

Up until now, we haven’t done anything different than if we had just generated a simple Excel sheet using to_excel() on a DataFrame.

In order to generate a more useful report, we are going to combine the summary statistics shown above as well as break out the report to include a separate PDF page per manager.

Let’s start with the updated template (myreport.html):

<!DOCTYPE html>
<html>
<head lang="en">
    <meta charset="UTF-8">
    <title>{{ title }} </title>
</head>
<body>
<div class="container">
    <h2>Sales Funnel Report - National</h2>
     {{ national_pivot_table }}
    {% include "summary.html" %}
</div>
<div class="container">
    {% for manager in Manager_Detail %}
        <p style="page-break-before: always" ></p>
        <h2>Sales Funnel Report - {{manager.0}}</h2>
        {{manager.1}}
        {% include "summary.html" %}
    {% endfor %}
</div>
</body>
</html>

The first thing you’ll notice is that there is an include statement which mentions another file. The include allows us to bring in a snippet of HTML and use it repeteadly in different portions of the code. In this case the summary contains some simple national level stats we want to include on each report so that the managers can compare their performance to the national average.

Here is what summary.html looks like:

<h3>National Summary: CPUs</h3>
    <ul>
        <li>Average Quantity: {{CPU.0|round(1)}}</li>
        <li>Average Price: {{CPU.1|round(1)}}</li>
    </ul>
<h3>National Summary: Software</h3>
    <ul>
        <li>Average Quantity: {{Software.0|round(1)}}</li>
        <li>Average Price: {{Software.1|round(1)}}</li>
    </ul>

In this snippet, you’ll see that there are some additional variables we have access to: CPU and Software . Each of these is a python list that includes the average quantity and price for CPU and Software sales.

You may also notice that we use a pipe | to round each value to 1 decimal place. This is one specific example of the use of Jinja’s filters.

There is also a for loop that allows us to display the details for each manager in our report. Jinja’s template language only includes a very small subset of code that alters the control flow. Basic for-loops are a mainstay of almost any template so they should make sense to most of you.

I want to call out one final piece of code that looks a little out of place:

<p style="page-break-before: always" ></p>

This is a simple CSS directive that I put in to make sure the CSS breaks on each page. I had to do a little digging to figure out the best way to make the pages break so I thought I would include it to help others out.

Additional Stats

Now that we have gone through the templates, here is how to create the additional context variables used in the templates.

Here is a simple summary function:

def get_summary_stats(df,product):
    """
    For certain products we want National Summary level information on the reports
    Return a list of the average quantity and price
    """
    results = []
    results.append(df[df["Product"]==product]["Quantity"].mean())
    results.append(df[df["Product"]==product]["Price"].mean())
    return results

We also need to create the manager details:

manager_df = []
for manager in sales_report.index.get_level_values(0).unique():
    manager_df.append([manager, sales_report.xs(manager, level=0).to_html()])

Finally, call the template with these variables:

template_vars = {"title" : "National Sales Funnel Report",
                 "CPU" : get_summary_stats(df, "CPU"),
                 "Software": get_summary_stats(df, "Software"),
                 "national_pivot_table": sales_report.to_html(),
                 "Manager_Detail": manager_df}
# Render our file and create the PDF using our css style file
html_out = template.render(template_vars)
HTML(string=html_out).write_pdf("report.pdf",stylesheets=["style.css"])

Here is the final PDF Report . I think it looks pretty decent for a simple report.

Ideas For Improvements

In the example above, we used the simple to_html() to generate our HTML. I suspect that when you start to do more of these you will want to have finer grained control over the output of your table.

There are a couple of options:

  • Pass a custom css class to_html using classes
  • Use formatters to format the data
  • Pass the data directly to your template and use iterrows to manually construct your table

Final Program

In order to pull it all together, here is the full program:

"""
Generate PDF reports from data included in several Pandas DataFrames
From pbpython.com
"""
from __future__ import print_function
import pandas as pd
import numpy as np
import argparse
from jinja2 import Environment, FileSystemLoader
from weasyprint import HTML


def create_pivot(df, infile, index_list=["Manager", "Rep", "Product"], value_list=["Price", "Quantity"]):
    """
    Create a pivot table from a raw DataFrame and return it as a DataFrame
    """
    table = pd.pivot_table(df, index=index_list, values=value_list,
                           aggfunc=[np.sum, np.mean], fill_value=0)
    return table

def get_summary_stats(df,product):
    """
    For certain products we want National Summary level information on the reports
    Return a list of the average quantity and price
    """
    results = []
    results.append(df[df["Product"]==product]["Quantity"].mean())
    results.append(df[df["Product"]==product]["Price"].mean())
    return results

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Generate PDF report')
    parser.add_argument('infile', type=argparse.FileType('r'),
    help="report source file in Excel")
    parser.add_argument('outfile', type=argparse.FileType('w'),
    help="output file in PDF")
    args = parser.parse_args()
    # Read in the file and get our pivot table summary
    df = pd.read_excel(args.infile.name)
    sales_report = create_pivot(df, args.infile.name)
    # Get some national summary to include as well
    manager_df = []
    for manager in sales_report.index.get_level_values(0).unique():
        manager_df.append([manager, sales_report.xs(manager, level=0).to_html()])
    # Do our templating now
    # We can specify any directory for the loader but for this example, use current directory
    env = Environment(loader=FileSystemLoader('.'))
    template = env.get_template("myreport.html")
    template_vars = {"title" : "National Sales Funnel Report",
                     "CPU" : get_summary_stats(df, "CPU"),
                     "Software": get_summary_stats(df, "Software"),
                     "national_pivot_table": sales_report.to_html(),
                     "Manager_Detail": manager_df}
    # Render our file and create the PDF using our css style file
    html_out = template.render(template_vars)
    HTML(string=html_out).write_pdf(args.outfile.name,stylesheets=["style.css"])

You can also view the gist if you are interested amd download a zip file of myreport.html, style.css and summary.html if you find it helpful.

Thanks for reading all the way to the end. As always, feedback is appreciated.

Take note, it’s time to take notes

Great post from Richard Branson, I wanted to share with you…

In his experience, 99 per cent of people in leadership roles don’t take notes. What’s more, males are less likely to take notes than their female counterparts.

 

He recently met with 30 chief executives for a dinner-table conversation about closing the gender gap. They discussed how men can counteract bias in the workplace by speaking up and championing their female colleagues. It was a wonderfully eye-opening discussion, full of valuable insights; yet he was the only person who took notes the entire time – “and boy did I take notes, I ran out of white space and had to write over my notes, my hotel notepad, my report and even my name tag!”, he said.

Sheryl Sandberg, Facebook’s chief operating officer and the founder of LeanIn.Org, expressed her surprise over his note-taking in this excellent piece about women in the workplace co-written with Adam Grant. Sheryl is putting together a series on the subject on LinkedIn too, he’s looking forward to getting involved.

Back in their meeting, conversation came around to the subject of women being more likely to be note takers in meetings, because there is an unfair expectation on them to do support work. In other words as a society we expect the office housework to fall to a woman.

Not only is this unfair to women, but it’s also disadvantageous to men. It’s time for men to step up and do their share of support work. On top of counteracting gender bias in the work force, it will also give men a better understanding of what going on within the business and what needs to be done to make things run more effectively. Mentoring, training and note taking – these are wonderful development areas, which everyone, men and women alike, can greatly benefit from.

Note taking is one of his favorite pastimes,  which he stated with, “I can’t tell you where I’d be if I hadn’t had a pen on hand to write down my ideas (or more importantly, other people’s) as soon as they came to me.” Some of Virgin’s most successful companies have been born from random moments – if they hadn’t opened our notebooks, they would never have happened.

No matter how big, small, simple or complex an idea is, get it in writing. But don’t just take notes for the sake of taking notes, go through your ideas and turn them into actionable and measurable goals. If you don’t write your ideas down, they could leave your head before you even leave the room.

To counteract the gender bias, men shouldn’t take over the note taking from women, everyone should be taking notes!

Miller-Rabin Primality Test

Math ∩ Programming

Problem: Determine if a number is prime, with an acceptably small error rate.

Solution: (in Python)

Discussion: This algorithm is known as the Miller-Rabin primality test, and it was a very important breakthrough in the study of probabilistic algorithms.

Efficiently testing whether a number is prime is a crucial problem in cryptography, because the security of many cryptosystems depends on the use of large randomly chosen primes. Indeed, we’ve seen one on this blog already which is in widespread use: RSA. Randomized algorithms also have quite useful applications in general, because it’s often that a solution which is correct with probability, say, $latex 2^{-100}$ is good enough for practice.

But from a theoretical and historical perspective, primality testing lied at the center of a huge problem in complexity theory. In particular, it is unknown whether algorithms which have access to randomness and can output probably correct answers are more…

View original post 425 more words

Beginning Data Science with R by Manas A. Pathak, Springer

Why we should start learning R for getting into data science…

Compudicted

Beginnng Data Science With R

Continuing on with the Springer series on Computational Intelligence and Complexity I picked another book on the ever increasing in popularity R.

Besides, I read already several books from other publishers in 2014. The books were aiming at different levels, and at people from different professional backgrounds. Myself, a data practitioner, positioned rather away from being a data scientist, sitting closer to the server side, with periodic ETL or Business Intelligence development tasks at hand professional I started to realize the times have changed: each new project requires new depth and breaths of data analysis. Using Excel and its data add-ons does no longer cut it in. I was aware of tools as MATLAB, SAS and SPSS, but boy they cost!

I was always in love with data, linear, discrete algebra and statistics in general so for me R came to the natural choice. Learning tools as R (not just a…

View original post 459 more words

Using GitHub with R and RStudio

 

 

 

github-logo

 

This was an amazing post from molecular ecologist blog. I loved it and I would like to share with you…

A few weeks back, the Molecular Ecologist released an article about GitHub and also created an organization where you can fork or simply download code shared by the Molecular Ecology community. A few of you out there may still be skeptical about the benefits of using GitHub. Or you may find it confusing and not worth the bother. You may be thinking to yourself (well, at least, I was guilty of this) that all of your code is backed up on Dropbox, Google Drive, and three external hardrives – so what could possibly go wrong? The short answer is: lots! The longer answer is that there really are some tremendous advantages associated with using Git and GitHub that may not be immediately apparent.

Git is a version control system and allows you to save copies of your code throughout the entire developmental process. Git isn’t the only version control system out there (e.g., SVN), but it is one of the more popular implementations. GitHub allows you to push your code from your local workspace to be hosted online. GitHub, which seamlessly integrates with Git, allows you to 1.) keep copies of all of your code through time, 2.) compare code from various points in time (very useful for debugging), 3.) collaborate with people on the same project in a non-chaos inducing fashion, and 4.) keep copies of your code both locally and online (note that you should still officially back up all of your work). Still not convinced? I suggest you google ‘why should I use version control?’

Below, I show how to use GitHub with Rstudio and also show that it is equally easy to use GitHub with any simple file of code. Thus, the take home message for the day is ‘GitHub is easy and you should use it.’

RStudio is an excellent integrated development environment built specifically for R. It also contains version control for Git and SVN. Below I outline the simple steps to get RStudio working with GitHub.

  1. Setup a GitHub account here.
  2. Download and install Rstudio.
  3. Download and install the platform-specific version of Git (not GitHub), default options   work well.
  4. Configure Git with global commands. I have found this step necessary both times I     ran through this process. Open up the bash version of Git and type the following:         git config –global user.name “your GitHub account name”                                                     git config –global user.email “GitHubEmail@something.com”
  5. Open Rstudio and set the path to Git executable. Go to Tools > Options > Git/SVN                Screenshot 2013-11-12 09.53.56 - Copy

It is important that you find your git.exe file (as shown above). This may be located in any number of places depending on your operating system, but the location of your GIT install is a good first place to look.

Restart RStudio and that is all there is to it! There are some simple guidelines at the RStudio website, which may be helpful. Now that you have successfully installed everything, lets run through a quick example. There are four terms associated with Git that you must learn: repository, commit, push, and pull. A repository equals the location and name for all the files associated with a particular project. The first step is to log into your GitHub account and create a new repository. Make sure you check the box ‘Initialize this repository with a README.’ When you are done, you should be able to view the Repository like below:

Screenshot 2013-11-12 09.36.42 - Copy

Notice the box highlighted in red. That box is really important – remember it as the ‘red box’. Now, open Rstudio and go to Project > Create Project > Version Control > Git and you should see a screen like below:

Screenshot 2013-11-12 09.37.08 - Copy

In the Repository URL box, you should copy and paste the URL indicated in the ‘red box’ above. This is how Rstudio knows what repository to use and associates it with your new project files. In this box you can also set the project directory.  Now do some work in your new R project and create and save some files. The next step is to ‘commit’ your work – essentially making a copy of all of your script files (i.e., .R files) associated with the R project. To do this go to Tools > Version Control > Commit.  This brings up the following window:

Untitled

Here you can see that I have saved two files, test1 and test2. Now I simply check the files that  I want to commit and press the ‘commit’ button, highlighted with the green box. If I want to also move these files onto the GitHub servers, I will click on the red box, marked ‘push’.  Look at your repository online to double check that your files actually made it there. That is pretty much all there is to it. You can also use the ‘git’ box in the top right-hand corner of Rstudio to make commits or use the various keyboard shortcuts. One feature that I think would be useful is for a commit to be made every time you save a file. I haven’t figured out how to do this, so please post a comment if you know how – or if you think that this would actually be a bad idea in practice.

What if you decide that RStudio isn’t for you because you can’t live without Notepad++ or Sublime Text? No worries – GitHub is super easy to use on Mac or Windows (and, of course Linux, but you probably already knew that).  Simply download

GitHub for Windows or GitHub for Mac

Follow the installation directions.  Create a few files and use the GUI to commit and push your files (see screenshot below) – it couldn’t be easier!

Screenshot 2013-11-12 11.50.53

One advantage that I find to using RStudio is that everything is integrated, so it really takes no time at all to commit my R code and push it on to GitHub.  This extra convenience means that I make more frequent commits.  Remember that it is a good idea to commit and push often.  Well that’s about it.  Please feel free to contribute and pull from the Molecular Ecologist’s repositories – this resource will only get better as more people use it. Also, please add any tricks or tips to the comments below!