Game Library Service / new module library demo

cholmcc · June 11, 2024, 12:28pm

Very simply because each of these steps are potentially heavy, and so if the later steps fails or doesn’t get it quite right, you do not need to redo the earlier steps. It also allows you to do a bit of hand-holding if you find that a particular MediaWiki page is a bit troublesome.

The conversion is probably meant to be done, ideally, once, so it doesn’t have to be a thing that can be automated fully. Of course, during development it will be done several times, and there I believe you can save some headaches by staging the process a bit.

Then do

foo = 'aBC' 
Foo = foo.lower().capitalize()
assert Foo == 'Abc'

or

foo = 'aBC'
FoO = foo[0].upper()+foo[1:]
assert FoO == 'ABC'

BTW, it seems the presumption that module file names have the form Name-version does not always hold. Perhaps it’s better to just rely on the {{ModuleVersion2}} template content.

Yours,
Christian

uckelman · June 11, 2024, 12:45pm

I did.

I don’t see the advantage of replacing working code in a one-off unless that leads to finishing sooner. There is no long-term maintenance advantage to having nicer code for the converter because there is no long-term.

Would you point out what problem the code you suggested would fix?

Korval · June 11, 2024, 1:31pm

User selectable - 10, 20, 50, all.

Korval · June 11, 2024, 1:36pm

Would like more options in “sort by” (e.g., Era).

Also would like more control over search (maybe a basic and advanced search). For instance, I would like to search on “Developer = Korval”; “WW2 AND Area Movement”; “Block Game and Columbia”; etc.

Both are linked to what data fields/tags will be associated with the modules…

cholmcc · June 11, 2024, 1:37pm

Well. your code misses some modules, as I pointed out earlier, while the proposed code misses fewer, and it is much smaller, meaning less error prone

Below is a minor updated version to deal with ol’-school {{ModuleFileTable}} template etc.

#!/usr/bin/env python
from mwparserfromhell import parse
from json import dumps

def do_gameinfo(code):
    gi = code.filter_templates(matches=lambda n : n.name=='GameInfo')
    if not gi or len(gi) < 1:
        raise RuntimeError('No GameInfo')

    gi = gi[0]

    ret = {k: str(gi.get(k)).replace(f'{k}=','') for k in
           ['image',
            'publisher',
            'year',
            'era',
            'topic',
            'series',
            'scale',
            'players',
            'length']
           if gi.has(k)}

    code.replace(gi,'')

    return ret

def do_emails(text):
    main = parse(text)

    eml = main.filter_templates()
    return [{'name': str(e.params[1]) if len(e.params) > 1 else '',
             'address': str(e.params[0])}
            for e in eml
            if str(e.params[0]) != 'someguy@example.com'
            ]
    
def do_modules(code):
    names = ['ModuleFilesTable2', # 0 
             'ModuleVersion2',    # 1
             'ModuleFile2',       # 2
             'ModuleFilesTable',  # 3
             'ModuleVersion',     # 4
             'ModuleFile'         # 5
             ]
    tmpl = code.filter_templates(matches=lambda n : n.name in names,
                                 recursive=False)

    
    tab = None
    cur = None
    for tm in tmpl:
        if tm.name in names[0::3]:
            tab = {}
            continue

        if tab is None:
            raise RuntimeError(f'{tm.name} seen before {",".join(names[0::3])}')


        if tm.name in names[1::3]:
            cur  = []
            key  = str(tm.get('version')).replace('version=','')
            tab[key] = cur
            continue

        if cur is None:
            raise RuntimeError(f'No current version')

        db = {k: str(tm.get(k)).replace(f'{k}=','').replace('\u200e','')
              for k in
              ['filename',
               'decription',
               'date',
               'size',
               'compatibility']
              if tm.has(k)}
        
        db['maintainers'] = do_emails(str(tm.get('maintainer'))
                                      if tm.has('maintainer') else '')
        db['contributors'] = do_emails(str(tm.get('contributors')
                                           if tm.has('contributors') else ''))

        cur.append(db)

    for tm in tmpl:
        code.replace(tm, '')

    tmpl = code.filter_templates(matches=lambda n :
                                 n.name == 'ModuleContactInfo',
                                 recursive=False)
    
    for tm in tmpl:
        main = do_emails(str(tm.get('maintainer'))
                         if tm.has('maintainer') else '')
        cont = do_emails(str(tm.get('contributors')
                             if tm.has('contributors') else ''))

        for ver,cur in tab.items():
            for db in cur:
                if main and len(main) > 0:
                    if not 'maintainers' in db:
                        db['maintainers'] = []
                    db['maintainers'].extend(main)
                if cont and len(cont) > 0:
                    if not 'contributors' in db:
                        db['contributors'] = []
                    dub['contributors'].extend(cont)
                
    for tm in tmpl:
        code.replace(tm, '')
        
    return tab

def do_gallery(code):
    tags = code.filter_tags(matches = lambda n: n.tag == 'gallery')

    if not tags:
        return []

    def extract(e):
        fields = e.split('|')
        img    = fields[0].replace('Image:','')
        alt    = '' if len(fields) < 2 else fields[1]

        return {'img': img, 'alt': alt}
            
    ret = [
        extract(e)
        for tag in tags
        for e in tag.contents.split('\n')
        if e != ''
    ]
    for tag in tags:
        code.replace(tag, '')

    return ret

def do_players(code):
    tags = code.filter_tags(matches = lambda n: n.tag == 'div')

    if not tags:
        return []

    ret = [
        do_emails(tag.contents) for tag in tags
        if tag.contents != ''
    ]

    for tag in tags:
        code.replace(tag, '')

    return ret

def do_readme(code):
    from tempfile import mkstemp
    from subprocess import Popen, PIPE
    from os import unlink
    

    tmp, tmpnam = mkstemp(text=True)
    with open(tmp,'w') as tmpfile:
        tmpfile.write(str(code))

    cmd = ['pandoc',
           '--from', 'mediawiki',
           '--to', 'markdown-simple_tables',
           tmpnam]
    out,err = Popen(cmd, stdout=PIPE,stderr=PIPE).communicate()

    unlink(tmpnam)

    return out.decode().replace(r'\|}','').replace(r'\_\_NOTOC\_\_','')

def convert(inp,md,js):
    
    text = inp.read()

    code     = parse(text)
    gameinfo = do_gameinfo(code)
    modules  = do_modules(code)
    gallery  = do_gallery(code)
    players  = do_players(code)
    readme   = do_readme(code)
    
    game     = {'info': gameinfo,
                'modules': modules,
                'gallery': gallery,
                'players': players }

    js.write(dumps(game,indent=2))
    md.write(readme)

    

if __name__ == '__main__':
    from argparse import ArgumentParser, FileType

    ap = ArgumentParser(description='Convert')
    ap.add_argument('input',type=FileType('r'),
                    help='Input media wiki')
    ap.add_argument('readme',type=FileType('w'),
                    help='Output markdow')
    ap.add_argument('json',type=FileType('w'),
                    help='Output JSON')

    args = ap.parse_args()

    convert(args.input,args.readme,args.json)

Use it anyway you like, I just think you can get the job done sooner if you would use some of this.

Yours,
Christian

cholmcc · June 11, 2024, 1:40pm

Well, then your earlier comment “Converting entire pages to markdown isn’t desirable” was a (deliberate?) straw-man, since you would know that that was not what the code does.

Yours,
Christian

uckelman · June 11, 2024, 1:50pm

I was replying specifically to your suggestion “to use existing tools such as pandoc, possibly with some pre-parsing in Python”, not commenting on your code there at all. I understood what I quote here not to bear on the code, and was explaining why using pandoc to convert whole pages won’t work. If it’s not what you intended, then I misunderstood.

uckelman · June 11, 2024, 2:08pm

It’s hard to know if this is the case when you don’t have the input to run against.

The wiki data is here. You can create the database that dump.py needs prior to running using this script.

Try it and see. I look forward to seeing the result.

uckelman · June 11, 2024, 4:35pm

Having tried this, I’ve noticed that it doesn’t remove the headers for the Files, Module Information, or Players sections from the readme output.

uckelman · June 11, 2024, 9:56pm

cholmcc:

def do_players(code):
    tags = code.filter_tags(matches = lambda n: n.tag == 'div')

    if not tags:
        return []

    ret = [
        do_emails(tag.contents) for tag in tags
        if tag.contents != ''
    ]

    for tag in tags:
        code.replace(tag, '')

    return ret

How does this ensure that only usernames and email addresses from the Players section are returned?

cholmcc · June 13, 2024, 6:21am

That’s true, it doesn’t. The reason is that the format of these are not necessarily easy to identify. Of course, one could take the generated Markdown and run some replacements on that - e.g.,

from re import sub
md = sub(r'#+ (Players|File|Screenshot).*','',md)

The code selects any and all {{email...}} templates from the <div>...</div> element and returns only the content of the templates. The <div> element itself is deleted from the input.

Running some simple tests of Mediawiki input, I see that i get all the information out, but looking at the site above, that some “projects” are missing some information. It’s partially because some of the Mediawiki pages are poorly formatted and your code doesn’t catch that (not always caught by my code too - especially when templates are wrongly formatted) or some extra invisible characters which prevents matching names to other entries, and so on. In those cases, it may be very useful to be able to correct the MediaWiki pages by hand

I would simply suggest to use an already implemented parser of the MediaWiki pages rather than trying to roll your own.

Yours,
Christian

uckelman · June 13, 2024, 10:03am

I found that one could remove headings either by filtering by section and then removing the entire section, or by filtering by headings and then removing just those when the rest of the section is still wanted.

E.g.,

def remove_headings(page):
    to_remove = [
        'Comments',
        'Module Information',
        'Files',
        'Screen Shots',
        'Screenshots'
    ]

    headings = page.filter_headings(matches='|'.join(to_remove))
    for h in headings:
        page.remove(h)

Yes, but it appears to be doing this for all <div> elements, not just the one in the Players section. It should get the Players section and filter_tags on that only.

A modified version is here.

It would be helpful to know which ones.

uckelman · June 14, 2024, 8:34pm

I’ve uploaded a new version of the database. Some things are different now. I’m not at all certain they’re better. (E.g., the packages are much worse for some projects, in that versions which should be grouped together in one package aren’t now.)

RobS · June 14, 2024, 9:02pm

FWIW, I’ve always thought that some listings are clusterf*cks mainly because of the people uploading various module versions. Specifically the tendency to go down a chain of neverending decimal points, V1.3.6.6b, instead of just naming them V1, V2, V3, etc. Everyone thinks they’re a software developer.

The only minor issue I’ve ever had with the module system is the use of “The” as an alphabet identifier.

uckelman · June 14, 2024, 9:08pm

Game titles have a sort key in the GLS, so that won’t be a problem in the future. Everything starting with an article has automatically been given a sort key starting with the second word. If you browse to T, you’ll find that the only project sorted under “The” is “Their Finest Hour”, which is exactly where that should go.

uckelman · June 14, 2024, 11:38pm

I’ve returned the package grouping to how it had been earlier in the week.

What I could use help with is being made aware of specific conversion problems.

uckelman · June 18, 2024, 11:12pm

Somewhat more modules are now included in the converted pages.

uckelman · August 5, 2024, 1:21pm

I’m at the point of going over module pages searching for conversion problems now. The first fifty or so turned up four kinds of problem—which is good news. This means that solving these four problems will likely clear the bulk of all import problems outstanding.

I am hopeful that we can make the new game library during August or September. This should in no way be construed as a promise, however.

Korval · August 7, 2024, 5:41pm

Using a single digit versioning scheme is problematic.
Similarly, a “neverending” versioning scheme (as you describe) is problematic.

The standard 3 decimal construct, when used correctly, provides a LOT of valuable information for released versions that indicates the degree of change.

Major.Minor.Maintenance/Bug

I personally use an additional letter construct (a, b…z, aa, ab…az) for non-released developmental versions.

Proper configuration control is very helpful to both the developer(s) and end-users.

Personally, I’ve had a bigger issues with people using 1 digit versioning scheme that mixes huge changes w/ minor bug-fixes.

RobS · August 8, 2024, 12:14am

Okay, I stand corrected then. I didn’t know there was a “standard construct”. It always just seemed like a hodgepodge of numbers and letters to me.