Game Library Service / new module library demo

Very simply because each of these steps are potentially heavy, and so if the later steps fails or doesn’t get it quite right, you do not need to redo the earlier steps. It also allows you to do a bit of hand-holding if you find that a particular MediaWiki page is a bit troublesome.

The conversion is probably meant to be done, ideally, once, so it doesn’t have to be a thing that can be automated fully. Of course, during development it will be done several times, and there I believe you can save some headaches by staging the process a bit.

Then do

foo = 'aBC' 
Foo = foo.lower().capitalize()
assert Foo == 'Abc'

or

foo = 'aBC'
FoO = foo[0].upper()+foo[1:]
assert FoO == 'ABC'

BTW, it seems the presumption that module file names have the form Name-version does not always hold. Perhaps it’s better to just rely on the {{ModuleVersion2}} template content.

Yours,
Christian

I did.

I don’t see the advantage of replacing working code in a one-off unless that leads to finishing sooner. There is no long-term maintenance advantage to having nicer code for the converter because there is no long-term.

Would you point out what problem the code you suggested would fix?

User selectable - 10, 20, 50, all.

Would like more options in “sort by” (e.g., Era).

Also would like more control over search (maybe a basic and advanced search). For instance, I would like to search on “Developer = Korval”; “WW2 AND Area Movement”; “Block Game and Columbia”; etc.

Both are linked to what data fields/tags will be associated with the modules…

Well. your code misses some modules, as I pointed out earlier, while the proposed code misses fewer, and it is much smaller, meaning less error prone :slight_smile:

Below is a minor updated version to deal with ol’-school {{ModuleFileTable}} template etc.

#!/usr/bin/env python
from mwparserfromhell import parse
from json import dumps

def do_gameinfo(code):
    gi = code.filter_templates(matches=lambda n : n.name=='GameInfo')
    if not gi or len(gi) < 1:
        raise RuntimeError('No GameInfo')

    gi = gi[0]

    ret = {k: str(gi.get(k)).replace(f'{k}=','') for k in
           ['image',
            'publisher',
            'year',
            'era',
            'topic',
            'series',
            'scale',
            'players',
            'length']
           if gi.has(k)}

    code.replace(gi,'')

    return ret

def do_emails(text):
    main = parse(text)

    eml = main.filter_templates()
    return [{'name': str(e.params[1]) if len(e.params) > 1 else '',
             'address': str(e.params[0])}
            for e in eml
            if str(e.params[0]) != 'someguy@example.com'
            ]
    
def do_modules(code):
    names = ['ModuleFilesTable2', # 0 
             'ModuleVersion2',    # 1
             'ModuleFile2',       # 2
             'ModuleFilesTable',  # 3
             'ModuleVersion',     # 4
             'ModuleFile'         # 5
             ]
    tmpl = code.filter_templates(matches=lambda n : n.name in names,
                                 recursive=False)

    
    tab = None
    cur = None
    for tm in tmpl:
        if tm.name in names[0::3]:
            tab = {}
            continue

        if tab is None:
            raise RuntimeError(f'{tm.name} seen before {",".join(names[0::3])}')


        if tm.name in names[1::3]:
            cur  = []
            key  = str(tm.get('version')).replace('version=','')
            tab[key] = cur
            continue

        if cur is None:
            raise RuntimeError(f'No current version')

        db = {k: str(tm.get(k)).replace(f'{k}=','').replace('\u200e','')
              for k in
              ['filename',
               'decription',
               'date',
               'size',
               'compatibility']
              if tm.has(k)}
        
        db['maintainers'] = do_emails(str(tm.get('maintainer'))
                                      if tm.has('maintainer') else '')
        db['contributors'] = do_emails(str(tm.get('contributors')
                                           if tm.has('contributors') else ''))

        cur.append(db)

    for tm in tmpl:
        code.replace(tm, '')

    tmpl = code.filter_templates(matches=lambda n :
                                 n.name == 'ModuleContactInfo',
                                 recursive=False)
    
    for tm in tmpl:
        main = do_emails(str(tm.get('maintainer'))
                         if tm.has('maintainer') else '')
        cont = do_emails(str(tm.get('contributors')
                             if tm.has('contributors') else ''))

        for ver,cur in tab.items():
            for db in cur:
                if main and len(main) > 0:
                    if not 'maintainers' in db:
                        db['maintainers'] = []
                    db['maintainers'].extend(main)
                if cont and len(cont) > 0:
                    if not 'contributors' in db:
                        db['contributors'] = []
                    dub['contributors'].extend(cont)
                
    for tm in tmpl:
        code.replace(tm, '')
        
    return tab

def do_gallery(code):
    tags = code.filter_tags(matches = lambda n: n.tag == 'gallery')

    if not tags:
        return []

    def extract(e):
        fields = e.split('|')
        img    = fields[0].replace('Image:','')
        alt    = '' if len(fields) < 2 else fields[1]

        return {'img': img, 'alt': alt}
            
    ret = [
        extract(e)
        for tag in tags
        for e in tag.contents.split('\n')
        if e != ''
    ]
    for tag in tags:
        code.replace(tag, '')

    return ret

def do_players(code):
    tags = code.filter_tags(matches = lambda n: n.tag == 'div')

    if not tags:
        return []

    ret = [
        do_emails(tag.contents) for tag in tags
        if tag.contents != ''
    ]

    for tag in tags:
        code.replace(tag, '')

    return ret

def do_readme(code):
    from tempfile import mkstemp
    from subprocess import Popen, PIPE
    from os import unlink
    

    tmp, tmpnam = mkstemp(text=True)
    with open(tmp,'w') as tmpfile:
        tmpfile.write(str(code))

    cmd = ['pandoc',
           '--from', 'mediawiki',
           '--to', 'markdown-simple_tables',
           tmpnam]
    out,err = Popen(cmd, stdout=PIPE,stderr=PIPE).communicate()

    unlink(tmpnam)

    return out.decode().replace(r'\|}','').replace(r'\_\_NOTOC\_\_','')

def convert(inp,md,js):
    
    text = inp.read()

    code     = parse(text)
    gameinfo = do_gameinfo(code)
    modules  = do_modules(code)
    gallery  = do_gallery(code)
    players  = do_players(code)
    readme   = do_readme(code)
    
    game     = {'info': gameinfo,
                'modules': modules,
                'gallery': gallery,
                'players': players }

    js.write(dumps(game,indent=2))
    md.write(readme)

    

if __name__ == '__main__':
    from argparse import ArgumentParser, FileType

    ap = ArgumentParser(description='Convert')
    ap.add_argument('input',type=FileType('r'),
                    help='Input media wiki')
    ap.add_argument('readme',type=FileType('w'),
                    help='Output markdow')
    ap.add_argument('json',type=FileType('w'),
                    help='Output JSON')

    args = ap.parse_args()

    convert(args.input,args.readme,args.json)

Use it anyway you like, I just think you can get the job done sooner if you would use some of this.

Yours,
Christian

Well, then your earlier comment “Converting entire pages to markdown isn’t desirable” was a (deliberate?) straw-man, since you would know that that was not what the code does.

Yours,
Christian

I was replying specifically to your suggestion “to use existing tools such as pandoc, possibly with some pre-parsing in Python”, not commenting on your code there at all. I understood what I quote here not to bear on the code, and was explaining why using pandoc to convert whole pages won’t work. If it’s not what you intended, then I misunderstood.

It’s hard to know if this is the case when you don’t have the input to run against.

The wiki data is here. You can create the database that dump.py needs prior to running using this script.

Try it and see. I look forward to seeing the result.

Having tried this, I’ve noticed that it doesn’t remove the headers for the Files, Module Information, or Players sections from the readme output.

How does this ensure that only usernames and email addresses from the Players section are returned?

That’s true, it doesn’t. The reason is that the format of these are not necessarily easy to identify. Of course, one could take the generated Markdown and run some replacements on that - e.g.,

from re import sub
md = sub(r'#+ (Players|File|Screenshot).*','',md)

The code selects any and all {{email...}} templates from the <div>...</div> element and returns only the content of the templates. The <div> element itself is deleted from the input.

Running some simple tests of Mediawiki input, I see that i get all the information out, but looking at the site above, that some “projects” are missing some information. It’s partially because some of the Mediawiki pages are poorly formatted and your code doesn’t catch that (not always caught by my code too - especially when templates are wrongly formatted) or some extra invisible characters which prevents matching names to other entries, and so on. In those cases, it may be very useful to be able to correct the MediaWiki pages by hand

I would simply suggest to use an already implemented parser of the MediaWiki pages rather than trying to roll your own.

Yours,
Christian

I found that one could remove headings either by filtering by section and then removing the entire section, or by filtering by headings and then removing just those when the rest of the section is still wanted.

E.g.,

def remove_headings(page):
    to_remove = [
        'Comments',
        'Module Information',
        'Files',
        'Screen Shots',
        'Screenshots'
    ]

    headings = page.filter_headings(matches='|'.join(to_remove))
    for h in headings:
        page.remove(h)

Yes, but it appears to be doing this for all <div> elements, not just the one in the Players section. It should get the Players section and filter_tags on that only.

A modified version is here.

It would be helpful to know which ones.

I’ve uploaded a new version of the database. Some things are different now. I’m not at all certain they’re better. (E.g., the packages are much worse for some projects, in that versions which should be grouped together in one package aren’t now.)

FWIW, I’ve always thought that some listings are clusterf*cks mainly because of the people uploading various module versions. Specifically the tendency to go down a chain of neverending decimal points, V1.3.6.6b, instead of just naming them V1, V2, V3, etc. Everyone thinks they’re a software developer.

The only minor issue I’ve ever had with the module system is the use of “The” as an alphabet identifier.

Game titles have a sort key in the GLS, so that won’t be a problem in the future. Everything starting with an article has automatically been given a sort key starting with the second word. If you browse to T, you’ll find that the only project sorted under “The” is “Their Finest Hour”, which is exactly where that should go.

1 Like

I’ve returned the package grouping to how it had been earlier in the week.

What I could use help with is being made aware of specific conversion problems.

Somewhat more modules are now included in the converted pages.