Recently in python Category

When you build a website it is often filled with objects that use serial column types, these are usually auto-incrementing integers. Often you want to obscure these numbers since they may convey some business value i.e. number of sales, users reviews etc. A Better idea is to use an actual Natural Key, which exposes the actual domain name of the object vs some numeric identifier. It's not always possible to produce a natural key for every object and when you can't do this, consider obscuring the serial id.

This doesn't secure your numbers that convey business value, it only conceals them from the casual observer. Here is an alternative that uses the bit mixing properties of exclusive or XOR, some compression by conversion into "base36" (via reddit) and some bit shuffling so that the least significant bit is moved which minimizes the serial appearance. You should be able to adapt this code to alternative bit sizes and shuffling patterns with some small changes. Just not that I am using signed integers and it is important to keep the high bit 0 to avoid negative numbers that cannot be converted via the "base36" algorithm.

Twiddling bits in python isn't fun so I used the excellent bitstring module

    from bitstring import Bits, BitArray

    #set the mask to whatever you want, just keep the high bit 0 (or use bitstring's uint)
    XOR_MASK = Bits(int=0x71234567, length=32)

    # base36 the reddit way 
    # https://github.com/reddit/reddit/blob/master/r2/r2/lib/utils/_utils.pyx
    # happens to be easy to convert back to and int using int('foo', 36)
    # int with base conversion is case insensitive
     def to_base(q, alphabet):
        if q < 0: raise ValueError, "must supply a positive integer"
        l = len(alphabet)
        converted = []
        while q != 0:
            q, r = divmod(q, l)
            converted.insert(0, alphabet[r])
        return "".join(converted) or '0'

    def to36(q):
        return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')

    def shuffle(ba, start=(1,16,8), end=(16,32,16), reverse=False):
        """
        flip some bits around
        '0x10101010' -> '0x04200808'
        """
        b = BitArray(ba)
        if reverse:
            map(b.reverse, reversed(start), reversed(end))
        else:
            map(b.reverse, start, end)
        return b  

    def encode(num):
        """
        Encodes numbers to strings

        >>> encode(1)
        've4d47'

        >>> encode(2)
        've3b6v'
        """
        return to36((shuffle(BitArray(int=num,length=32)) ^ XOR_MASK).int)

    def decode(q):
        """
        decodes strings to  (case insensitive)

        >>> decode('ve3b6v')
        2

        >>> decode('Ve3b6V')
        2

        """    
        return (shuffle(BitArray(int=int(q,36),length=32) ^ XOR_MASK, reverse=True) ).int

Removing A Django Application Completely with South

Let's pretend that the application that you want remove contains the following model in myapp/models.py: class SomeModel(models.Model): data = models.TextField

Create the initial migration and apply it to the database:

    ./manage.py schemamigration --initial myapp
    ./manage.py migrate

To remove the Models edit myapp/models.py and remove all the model definitions

Create the deleting migration:

    ./manage.py schemamigration myapp

Edit myapp/migrations/0002autodel_somemodel.py to remove the related content types

    from django.contrib.contenttypes.models import ContentType
    ...
    def forwards(self, orm):

        # Deleting model 'SomeModel'
        db.delete_table('myapp_somemodel')
        for content_type in ContentType.objects.filter(app_label='myapp'):
            content_type.delete()

Migrate the App and remove the table, then fake a zero migration to clean out the south tables

    ./manage.py migrate
    ./manage.py migrate myapp zero --fake

Remove the app from your settings.py and it should now be fully gone....

Django has a built in sitemap generation framework that uses views to build a sitemap on the fly. Sometimes your dataset is too large for this to work in a web application.  Here is a management command that will generate a static sitemap and index for your models.  You can extend it to handle multiple Models.

import os.path 
from django.core.management.base import BaseCommand, CommandError
from django.contrib.sitemaps import GenericSitemap
from django.contrib.sites.models import Site
from django.template import loader 
from django.utils.encoding import smart_str

from myproject.models import MyModel


class Command(BaseCommand):
    help = """Generates the sitemaps for the site, pass in a output directory
    """

    def handle(self, *args, **options):
        if len(args) != 1:
            raise CommandError('You need to specify a output directory')
        directory = args[0]
        if not os.path.isdir(directory):
            raise CommandError('directory %s does not exist' % directory)
        #modify to meet your needs
        sitemap = GenericSitemap({'queryset': MyModel.objects.order_by('id'), 'date_field':'modified' })
        current_site = Site.objects.get_current()

        index_files = []
        paginator = sitemap.paginator
        for page_num in range(1, paginator.num_pages+1):
            filename = 'sitemap_%s.xml' % page_num
            file_path = os.path.join(directory,filename)
            index_files.append("http://%s/%s" % (current_site.domain, filename))
            print "Generating sitemap %s" % file_path
            with open(file_path, 'w') as site_mapfile:
                site_mapfile.write(smart_str(loader.render_to_string('sitemap.xml', {'urlset': sitemap.get_urls(page_num)})))
        sitemap_index = os.path.join(directory,'sitemap_index.xml')
        with open(sitemap_index, 'w') as site_index:
            print "Generating sitemap_index.xml %s" % sitemap_index
            site_index.write(loader.render_to_string('sitemap_index.xml', {'sitemaps': index_files}))

A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution

  1. easy_install ngram

  2. build file with all the city and state names one per line, place in citystate.data Redwood City, CA Redwood, VA etc

  3. Experiment ( the .2 threshold is a little lax )

import string
import ngram
cityStateParser = ngram.NGram(
  items = (line.strip() for line in open('citystate.data')) ,
  N=3, iconv=string.lower, qconv=string.lower,  threshold=.2
)

Example:

cityStateParser.search('redwood')
[('Redwood VA', 0.5),
('Redwood NY', 0.5),
('Redwood MS', 0.5),
('Redwood City CA', 0.36842105263157893),
...
]

Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"

You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)

If this works for you, consider giving me a vote on StackOverflow.com

Cascading and Coroutines

|0 comments |

Cascading looks quite interesting. Here is a python program that does something similar to the Technical Overview seen main in the python program.

    #!/usr/bin/env python
    # encoding: utf-8
    import sys

    def input(theFile, pipe):
        """
        pushes a file a line at a time to a coroutine pipe
        """
        for line in theFile:
            pipe.send(line)
        pipe.close()

    @coroutine
    def extract(expression, pipe, group = 0):
        """
        extract the group from a regex
        """
        import re
        r = re.compile(expression)
        while True:
            line = (yield)
            match = r.search(line)
            if match:
                pipe.send(match.group(0))

    @coroutine
    def sort(pipe):
        """
        sort the input on a pipe
        """
        import heapq
        heap = []
        try:
            while True:
                line = (yield)
                heapq.heappush(heap, line)
        except GeneratorExit:
            while heap:
                pipe.send(heapq.heappop(heap))

    @coroutine
    def group(groupPipe, pipe):
        """
        sends consectutive matching lines from pipe to groupPipe
        """
        cur = None
        g = None
        while True:
            line = (yield)
            if cur is None:
                g = groupPipe(pipe)
            elif cur != line:
                g.close()
                g = groupPipe(pipe)

            g.send(line)
            cur = line

    @coroutine
    def uniq(pipe):
        """
        implements uniq -c
        """
        lines = 0
        try:
            while True:
                line = (yield)
                lines += 1
        except GeneratorExit:
            pipe.send('%s\t%s' % (lines, line))

    @coroutine
    def output(theFile):
        while True:
            line = (yield)
            theFile.write(line + '\n')

    def main():
        input(sys.stdin,
            extract( r'^([^ ]+)',
                sort(
                    group( uniq,
                        output(sys.stdout)
                    )
                )
            )
        )

    if __name__ == '__main__':
        main()

You can achieve the same results with the unix command line:

cat  access.log | cut -d ' ' -f 1 | sort | uniq -c

Reading http://www.dabeaz.com/coroutines/ and thought this was a natural for a twitter client. Here is a pretty simple version that just prints the public timeline every 60 seconds. Next, up removing the time.sleep and scheduling the followStatus function as a task so I can follow more than one stream at a time.

    #!/usr/bin/env python
    # encoding: utf-8
    import time
    import twitter

    def coroutine(func):
        """
        A decorator function that takes care of starting a coroutine
        automatically on call.

        see: http://www.dabeaz.com/coroutines/
        """
        def start(*args,**kwargs):
            cr = func(*args,**kwargs)
            cr.next()
            return cr
        return start

    @coroutine
    def statusPrinter():
        """
        Just prints twitter status messages to the screen
        """
        while True:
             status = (yield)
             print status.id, status.user.name, status.text

    def followStatus(twitterGetter, target, timeout = 60):
        """
        Follows a twitter status message that takes a since_id
        """
        since_id = None
        while True:
            statuses = twitterGetter(since_id=since_id)
            if statuses:
                # pretty sure these are always in order
                since_id = statuses[0]
                for status in statuses:
                    target.send(status)
            # twitter caches for 60 seconds anyway
            time.sleep(timeout)

    def main():
        api = twitter.Api()
        followStatus(api.GetPublicTimeline, statusPrinter())

    if __name__ == '__main__':
        main()
 

About this Archive

This page is an archive of recent entries in the python category.

node.js is the previous category.

social responsibility is the next category.

Find recent content on the main index or look in the archives to find all content.