Javascript objects in python

Chompjs

Transforms JavaScript objects into Python data structures.

In web scraping, you sometimes need to transform Javascript objects embedded in HTML pages into valid Python dictionaries. chompjs is a library designed to do that as a more powerful replacement of standard json.loads :

Quickstart

1. installation

$ git clone https://github.com/Nykakin/chompjs $  chompjs $ python setup.py build $ python setup.py install

Features

There are two functions available:

  • parse_js_object — try reading first encountered JSON-like object. Raises ValueError on failure
  • parse_js_objects — returns a generator yielding all encountered JSON-like objects. Can be used to read JSON Lines. Does not raise on ivalid input.

An example usage with scrapy :

            If the input string is not yet escaped and contains a lot of \\ characters, then unicode_escape=True argument might help to sanitize it:
'By default chompjs tries to start with first < or [ character it founds, omitting the rest:
.
'
json_params argument can be used to pass options to underlying json_loads , such as strict or object_hook :
In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:
 Standard library function json.loads is usually sufficient to extract this data:
 The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
  • "" is not a valid JSON because it uses ' character to quote
  • '' is not a valid JSON because property name is not quoted at all
  • '' is not a valid JSON because there is an extra , character at the end of the array
  • '' is not a valid JSON because float value lacks a leading 0

As a result, json.loads fail to extract any of those:

"''' chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
"'''Internally chompjs use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's json.loads , ensuring a high speed as compared to full-blown JavaScript parsers such as demjson .
Pull requests are welcome.

Источник

JavaScript objects¶

In JavaScript all objects are part of an inheritance tree. The create function adds a node to the inheritance tree.

// A JavaScript object. js> root = <> // Prototype inheritance. js> create = function (obj)  var f = function () return this;>; f.prototype = obj; return new f; > js> a = create(root) js> b = create(a) js> a.name = 5 js> a.name 5 js> b.name 5 

In Python classes inherit in the same way.

>>> root = type # Most classes are instance of type. >>> class a(root): pass >>> class b(a): pass # Class inheritance. >>> a.name = 5 # Just like JavaScript. >>> a.name 5 >>> b.name 5 

class explanation¶

In Python we can subclass anything whose type is type (or a subclass of type). A subclass (and its instances) inherits properties from the super-class.

>>> type(root) == type(a) == type(b) == type True 

Custom item methods¶

In JavaScript attribute and item access are the same.

js> a = create(root) js> a.name = 5 js> a['name'] 5 js> a['key'] = 6 js> a.key 6 js> a[1] = 6 js> a['1'] 6 

In Python we can defined our own item methods. (The programmer owns the dot.)

>>> class A(object): . . def __getitem__(self, key): . return getattr(self, str(key)) . def __setitem__(self, key, value): . return setattr(self, str(key), value) >>> a = A() >>> a.name = 5 >>> a['name'] 5 >>> a['key'] = 6 >>> a.key 6 >>> a[1] = 6 >>> a['1'] 6 

Because type(a) is A, which has the special item methods, we get the special item behaviour.

On metaclass¶

Using previous definition, we cannot subclass a to create b.

>>> class b(a): pass Traceback (most recent call last): class b(a): pass TypeError: Error when calling the metaclass bases object.__new__() takes no parameters 

This is because a is not a type. The solution involves Python metaclasses (an advanced topic).

metaclass construction¶

We will subclass type, not object, and add to it the special item methods.

>>> class ObjectType(type): . . def __getitem__(self, key): . return getattr(self, str(key)) . . def __setitem__(self, key, value): . return setattr(self, str(key), value) 

Here is a fancy way of calling ObjectType.

>>> class root(object): . __metaclass__ = ObjectType 

Here is a more direct (and equivalent) construction (create an instance of ObjectType, whose instances are objects).

>>> root = ObjectType('root', (object,), <>) >>> isinstance(root(), object) True 

metaclass demonstration¶

>>> class a(root): pass >>> class b(a): pass >>> a.name = 5 >>> a.name 5 >>> b.name 5 >>> a['name'] 5 >>> b['name'] 5 >>> a[1] = 6 >>> a['1'] 6 

metaclass explanation¶

Because type(root) is a subclass of type we can subclass root.

>>> issubclass(type(root), type) True 

Because the type(root) is ObjectType, which has special item methods, we get the special item behaviour.

>>> type(root) == type(a) == type(b) == ObjectType True 

Never instantiated¶

We can’t call JavaScript objects (unless they are a function). But create creates ordinary JavaScript objects.

js> a = create(root) js> a(1, 2, 3) TypeError: a is not a function 

We will monkey-patch the previous Python class, to provide custom behaviour when called.

>>> def raise_not_a_function(obj, *argv, **kwargs): . raise TypeError, obj.__name__ + ' is not a function' >>> ObjectType.__call__ = raise_not_a_function >>> a(1, 2, 3) Traceback (most recent call last): a(1, 2, 3) TypeError: a is not a function 

Conclusion¶

JavaScript objects are like Python classes (because they inherit like Python classes).

For JavaScript attribute and item access are the same. This is achieved in Python by providing custom item methods.

In Python the custom item methods must be placed on the type of the object (or a superclass of its type).

Ordinary JavaScript objects are not functions and cannot be called. A Python class can be called (to create an instance of the object). But we can override this behaviour by supplying a custom method for call.

JavaScript objects are like Python classes with custom item methods (on the metaclass) that are never instantiated.

JavaScript objects are like Python classes with custom item methods (on the metaclass) that are never instantiated.

© Copyright 2011, Jonathan Fine. Revision 9c142f616b68 .

Versions latest Downloads pdf htmlzip epub On Read the Docs Project Home Builds Free document hosting provided by Read the Docs.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Parsing JavaScript objects into Python data structures

License

Nykakin/chompjs

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Transforms JavaScript objects into Python data structures.

In web scraping, you sometimes need to transform Javascript objects embedded in HTML pages into valid Python dictionaries. chompjs is a library designed to do that as a more powerful replacement of standard json.loads :

>>> chompjs.parse_js_object("") 'a': 100> >>> >>> json_lines = """ . . . . """ >>> for entry in chompjs.parse_js_objects(json_lines): . print(entry) . 'a': 12> 'b': 13> 'c': 14>

1. installation

$ git clone https://github.com/Nykakin/chompjs $ cd chompjs $ python setup.py build $ python setup.py install

There are two functions available:

  • parse_js_object - try reading first encountered JSON-like object. Raises ValueError on failure
  • parse_js_objects - returns a generator yielding all encountered JSON-like objects. Can be used to read JSON Lines. Does not raise on invalid input.

An example usage with scrapy :

import chompjs import scrapy class MySpider(scrapy.Spider): # . def parse(self, response): script_css = 'script:contains("__NEXT_DATA__")::text' script_pattern = r'__NEXT_DATA__ = (.*);' # warning: for some pages you need to pass replace_entities=True # into re_first to have JSON escaped properly script_text = response.css(script_css).re_first(script_pattern) try: json_data = chompjs.parse_js_object(script_text) except ValueError: self.log('Failed to extract data from <>'.format(response.url)) return # work on json_data
>>> data = """ . . // comments . unquoted: 'and you can quote me on that', . singleQuotes: 'I can use "double quotes" here', . lineBreaks: "Look, Mom! \ . No \\n's!", . hexadecimal: 0xdecaf, . leadingDecimalPoint: .8675309, andTrailing: 8675309., . positiveSign: +1, . trailingComma: 'in objects', andIn: ['arrays',], . "backwardsCompatible": "with JSON", . > . """ >>> chompjs.parse_js_object(data) 'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use "double quotes" here', 'lineBreaks': "Look, Mom! No \n's!", 'hexadecimal': 912559, 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'>

If the input string is not yet escaped and contains a lot of \\ characters, then unicode_escape=True argument might help to sanitize it:

>>> chompjs.parse_js_object('\\\"a\\\": 12>', unicode_escape=True) 'a': 12>

By default chompjs tries to start with first < or [ character it founds, omitting the rest:

>>> chompjs.parse_js_object('
.
.
'
) [1, 2, 3]

json_params argument can be used to pass options to underlying json_loads , such as strict or object_hook :

>>> import decimal >>> import chompjs >>> chompjs.parse_js_object('[23.2]', json_params='parse_float': decimal.Decimal>) [Decimal('23.2')]

In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:

html> head>. head> body> . script type pl-s">text/javascript">window.__PRELOADED_STATE__="foo": "bar">script> . body> html>

Standard library function json.loads is usually sufficient to extract this data:

>>> # scrapy shell file:///tmp/test.html >>> import json >>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)') >>> json.loads(script_text) u'foo': u'bar'>

The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:

  • "" is not a valid JSON because it uses ' character to quote
  • '' is not a valid JSON because property name is not quoted at all
  • '' is not a valid JSON because there is an extra , character at the end of the array
  • '' is not a valid JSON because float value lacks a leading 0

As a result, json.loads fail to extract any of those:

>>> json.loads("") Traceback (most recent call last): . ValueError: Expecting property name: line 1 column 2 (char 1) >>> json.loads('') Traceback (most recent call last): . ValueError: Expecting property name: line 1 column 2 (char 1) >>> json.loads('') Traceback (most recent call last): . ValueError: No JSON object could be decoded >>> json.loads('') Traceback (most recent call last): . json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)

chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:

>>> import chompjs >>> >>> chompjs.parse_js_object("") 'a': 'b'> >>> chompjs.parse_js_object('') 'a': 'b'> >>> chompjs.parse_js_object('') 'a': [1, 2, 3]> >>> chompjs.parse_js_object('') 'a': 0.99>

Internally chompjs use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's json.loads , ensuring a high speed as compared to full-blown JavaScript parsers such as demjson .

>>> import json >>> import _chompjs >>> >>> _chompjs.parse('') '' >>> json.loads(_) 'a': 1>

Pull requests are welcome.

Источник

Читайте также:  Ru adlist css fixes tampermonkey
Оцените статью